Despite the wide availability of data, ensuring independent access to data sources has never been more crucial. How can researchers engage in data sharing while also protecting privacy?

The Stigler Center, of which ProMarket is a part, recently held its fifth annual conference on antitrust policy and regulation. This included a hands-on discussion about the successes and failures of data access initiatives and the challenges that must be overcome if we are to share data while also protecting user privacy.

“We’ve never had more data than we do today,” said Stigler Research Fellow and conference co-organizer Filippo Lancieri, “yet, [maintaining] data access has never been more fundamental for understanding policy tradeoffs.” Joining Lancieri was a panel of academic researchers: Laura Edelson (New York University), John de Figueiredo (Duke University), and Lior Strahilevitz (UChicago Law), as well as founder of the data journalism outlet The Markup, Julia Angwin.

Edelson, a computer scientist, discussed her Cybersecurity for Democracy project. The project has studied a wide range of online content, from the real world harms of consumer fraud to election disinformation in digital ads. Her lab found anti-democratic content prominent in the 2020 election cycle, and she notes that little has been done to prevent that from happening again. Her team grapples with the question: How do we find this content and make it less prevalent? “With network problems such as these, data is the solution to studying how problematic content manifests itself in practice,” she said. Her lab collects data through user-side crowdsourcing to study platforms rather than users, looking at how behavioral and interest-based ads are targeted. This data is made publicly available for use by other researchers and journalists. The lab aims to help the public better understand issues such as how ad spending changes over time, what ads’ purposes are, and who’s spending money on them.

An overview of the legal frameworks governing privacy, academic research, and scraping was provided by Strahilevitz. Approaches to privacy are very different in Europe and the US, with omnibus legislation in the former and a sectoral approach in the latter. Recent legislation in California has given users the right to delete some data held by tech companies, bringing it closer to the European model. Europe has historically taken a “negative liberties approach to academic research,” said Strahilevitz, and Article 5 of the General Data Protection Regulation (GDPR) provides institutional safeguards to protect privacy. More recently, the Digital Services Act has moved beyond this “negative” approach and towards a more positive one. The act requires bigger tech companies to open up data; sets up a third party that will examine research proposals; and mandates researchers to publish their findings, making it freely available to the public.

Julia Angwin, a proud alum of the University of Chicago’s student newspaper, the Chicago Maroon, discussed her work with The Markup,which focuses on data-driven investigative journalism. To conduct its work, the outlet scrapes websites but doesn’t track users, thus not participating in the data exploitation economy. To investigate platform self-preferencingand algorithmic amplification, The Markup launched Citizen Browser, through which a nationally represented panel of 1,200 users were paid to take and submit snapshots of their individual Facebook feeds. “It’s like photojournalism within Facebook’s machine,” she described. The project has revealed, among other things, that Facebook has lied about what kind of content it is promoting and about recommending political groups in feeds. “Nobody is policing Facebook because nobody else has this data,” she said, “It’s like we are on the tarmac inspecting the plane as it is flying.”

Regardless of whether their research question is well-defined (‘investigative’ or ‘verification’ research) or not well-defined (‘exploratory’ research), researchers must navigate a delicate trade-off every time they work with user data, de Figueiredo explained. Are they protecting user privacy enough, or are they degrading data by protecting privacy too much? He described types of data that can essentially be repurposed from the original dataset for the relevant use case: synthetic data (simulated from a model, which retains statistical properties without representing real individuals), or partially synthetic data (where the researcher knows who the individuals are but not what their characteristics are). In each case, the researcher—and sometimes the policymaker deciding what to do with the research—must make a decision related to the “differential privacy” question: How much privacy loss can we tolerate in the data set?

Who, then, should ultimately decide what the legitimate privacy concerns and tradeoffs are for research projects? Should it be the users who decide to give consent? Users have the right to their own data, Edelson explained, especially in the case of projects requiring inherently private user data, such as using the Frances Haugen-leaked Facebook papers to study teen body image concerns. Approval from the Institutional Review Board (IRB) can add value in such instances, Strahilevitz suggested, but there is still a threat to user privacy from entities that don’t have to go through the IRB. “Consumers have a lot going on to be able to make informed decisions about their data,” he said, “and the research community needs to go beyond user consent and go towards a framework of consumer welfare and inalienable rights.” Edelson noted that, with companies such as Venmo, a lot of data sharing has already been normalized. Further, Angwin pointed out that journalism is inherently an adversarial activity where the question is sometimes not, “Are we invading privacy?” but rather, “Are we acting in the public interest?” In cases of collecting data at scale, she said, a journalist must ask, “Who is in power, and what are they trying to hide?”

Lancieri then facilitated a wide-ranging discussion through a number of related topics. How much should researchers trust anonymization and synthetic data access? An informed answer to this question will soon be possible, de Figueiredo said, once the Census Bureau makes its decision whether to release synthetic data for the American Community Survey, one of their widely used public datasets. Edelson spoke of a research project, Social Science One, whose privacy budget was too high for it to be usable and stressed the need for unmediated access, free from the power structures that often control data. “Data is political,” concurred Angwin. “Who collects it matters. And the person who collects it has the power to shape the questions that can be asked from it.” The Washington Post, she noted, currentlymaintains the only open-source database of the use of police force because there is none provided by federal authorities.

What government body should decide which research gets greenlit? Since the government doesn’t have perfect information to make this decision, do we need mandated data access from researchers, like Europe’s DSA? Strahilevitz noted that compliance costs have been known to be very high for small companies in the case of GDPR. One solution could be to make data available via channels like the National Science Foundation, or through Freedom of Information Act (FOIA) requests, which are often used by reporters. “Any transparency regime that doesn’t allow journalists to do their job should not be on the table,” Angwin added.

But what about data analysis in other formats, like audio and video? Edelson said that, currently, speech-to-text involves expensive resources, and analyzing video is still years off. This is why YouTube is less analyzed than other platforms, Angwin observed.

In closing, Lancieri noted that, despite covering a wide range of topics, the conference didn’t even get to cover funding or encryption. The real question when it comes to data transparency and access, Angwin concluded, “is whether [companies like] Facebook itself even know what they’re doing. Are they too big, or at least bigger than the regulators, to govern themselves?”