On the Ethics of Data-Driven Journalism: of fact, friction and public records in a more transparent age

While ethics in journalism have been defined and upheld for decades, the context they’re practiced within has shifted with changes in technology and society. That’s true of public records as they become increasingly digitized and more liquid, flowing from file cabinets and county servers onto the World Wide Web. Security and privacy by obscurity are no longer enough to protect sensitive data, particularly as the traditional gatekeeper to such stores disappear and publication has become dramatically democratized around the world. In 2013, we’re just starting to really grapple with the issues that are presented here, despite years of data breaches in the public and private sector, and warning from researchers about poor information minimization or redaction, lax security protocols or lapses in common sense.

As digital government comes to more towns and the capacity to do data-driven journalism becomes integrated into more newsrooms, these tensions and their consequences are only going to grow more difficult. If data journalists and their editors don’t apply the same ethical lens to the data they publish that has governed protecting sources, they will run the risk of violating the trust of the public they seek to inform.

Given the increased reach and velocity of digital media, data journalists must be more conscious of ethics than ever. Experienced or rookie journalists need to know how to turn data into journalism in a way that both protects and informs the public the data describes.

If you’re applying a data-driven lens, as Jeff Sonderman highlighted at the Poynter Institute, you’ll need to ask a series of basic questions.

“In every situation you face, there will be unique considerations about whether and how to publish a set of data,” he wrote. “Don’t assume data is inherently accurate, fair and objective. Don’t mistake your access to data or your right to publish it as a legitimate rationale for doing so. Think critically about the public good and potential harm, the context surrounding the data and its relevance to your other reporting. Then decide whether your data publishing is journalism.”

If we want to get a handle of the future that’s coming quickly to cities and countries around the world, it’s useful to look at the hotspots where conflict has already burned a hole in public imagination after that arrival. Three recent stories are useful to consider the tangle of ethical issues presented by different entities publishing public records:

  1. The “gun map,” in upstate New York
  2. The use of mug shots by for-profit services
  3. The NSA papers published by The Guardian and The Washington Post

Aiming to make a data-driven statement and missing the target

In December 2013, a newspaper provoked a national conversation about the use of public records by when the Westchester Journal News, published a story about gun owners and (much more controversially) a map that listed the names and home addresses of every pistol permit-holder in New York State’s Westchester and Rockland counties. The Journal News had gained the permit data from a Freedom of Information Law (FOIL) request and then geocoded it.

The outrage that resulted was instructive: this data was public and subject to a freedom of information law. As Reuters media columnist Jack Shafer argued, public records are public, so anyone can do what they want with them. Ceding that the ability to do so exists, did that make it ethically sound to publish the names and addresses of permit holders?

Then and now, I’d say no: they added nothing to the story and put some people unnecessarily at risk. Al Tompkins, a senior faculty member at the Poynter Institute, agreed:

“Journalists broadcast and publish criminal records, drunk driving records, arrest records, professional licenses, inspection records and all sorts of private information,”he wrote, “but when we publish private information we should weigh the public’s right to know against the potential harm publishing could cause.”

The question of what to do about guns, maps and disturbing data in New York was answered in part by the state legislature and senate, when it passed legislation that created an anonymity exemption for gun permit holders. The issues this situation raised, however, will be central to data journalism in every state and country around the world. As more government records become digital, how they are used and the context that the data is given will only become more important.

The conflict over guns and data demonstrated how government data could be published by a media organization in a way that not only made citizens quite uncomfortable but put some at risk by making them more easily discoverable.

It also highlighted an issue with data quality and journalism: more than tthree quarters of the data in the gunmap was inaccurate. The Journal News took the map offline in January 2013.

Given that such mistakes will also inevitably become part of the journalism landscape, media organizations will have to grapple with the question of what to do about correcting them. “We regret the error” has long been a fixture in fact-based journalism. Soon, as New York Times interactive developer Jacob Harris explored for Mozilla’s “Source”, more readers should expect to see programming corrections when acts of data-driven journalism go awry. Harris argues convincingly that every media entity that’s working with gathering, analyzing or publishing data will need to have and use a correction policy when there are errors in code, computation or collection.

Getting mugged by data online

In an ideal world, acts of journalism shine a light on a dodgy practices that galvanizes action from the people who have power to change them. This past February, Jonathan Hochman published an op-ed at SearchEngineLand urging Google to crack down on the mugshot extortion racket, highlighting shady websites that scrape or download public records from government and publish pictures of booking photographs online. These sites then offer to remove the pictures for a fee. Until recently, those pictures were showing up in Google search for names, significantly amplifying and codifying the impact of arrests.

While some issues are systemic and resistant to simple policy tweaks, from poverty to pollution to education to crime, others are more susceptible to shifts from powerful new gatekeepers to information.This is such an instance. In October 2013, in response to this Hochman’s op-ed, Google changed its search algorithm to push these mugshot sites down in results. (That shift was also reported by the New York Times in a feature on online mug shots that made a much broader swath of the public aware of the ongoing “extortion-by-public-record” digital racket.) Credit card companies are also considering cutting off electronic payment services to these sites. That choice, along with Google’s shift, has caused some alarm among observers like GigaOm senior writer Mathew Ingram, who is concerned about the long-term implications of these shifts, and Tow Center director Emily Bell:


There are some truly thorny issues here for data editors to consider. On the one hand, these are public records that may hold significant value for the public interest. As Ingram notes, organizations like the Reporters’ Committee for the Freedom of the Press don’t want to see media access to such information reduced because of misuse, as occurred with gun registry data in New York. On the other hand, the potential for these records to negatively affect the lives of people without the power or resources or influence to escape the consequences of their digital amplification is significant, given the nearly ubiquitous behavior of potential employers or academic institutions to search for applicants online.

What mugshots mean for public data is, in other words, a fine conundrum for the digital age. Data scientist Hillary Mason suggests moving beyond the frame of outright removal or unfettered online access.

“The debate around fixing this problem has focused on whether the data should be removed from the public entirely,” she wrote. “I’d like to see this conversation reframed around how we maintain the friction and cost to access technically public data such that it is no longer economically feasible to run these sorts of aggregated extortion sites while still maintaining the ability of journalists and concerned citizens to explore the records as necessary for their work.”

Given that potential misuse or distortion of government data by media and the public has often been cited by government officials and civil servants as a reason not to release it, data journalists, open government advocates and “civic hackers” do hold some responsibility to protect the privacy and security of the public described in the data. While embarrassment of public officials behaving badly is a terrible rationale for not publishing data, considering the impact upon the lives or professional prospects of private citizens should at least give journalists pause. Many kinds of data have been obscured by being bound up on paper, in file cabinets. Once it becomes liquid, with the friction of physically visiting a reading room or file cabinet in a courtroom basement removed, the context for its collection can collapse rapidly.

On Wikileaks, Snowden and the role of intermediaries

Nowhere has the issue of “potential harm” been more contentious than when Wikileaks released data from the U.S. Department of Defense and Department of State to multiple news organizations in 2010 and 2011. Every media organization that reviewed classified cables or logs from the Pentagon had to decide not only whether to publish them but how, balancing redacting the names of people who might be put at risk with the public’s right to know what was done on their behalf by government. The technical capacity to move through millions of lines of messy data in proprietary formats, however, only rests with a limited number of news organizations. If the capacity to do data-driven journalism at scale isn’t democratized, this dynamic could enshrine traditional media power structures.

In 2013, journalists at The Guardian and The Washington Post faced similar decisions when they received documents from National Security Agency contractor Edward Snowden and subsequently published selected portions of them as The NSA Files and NSA Secrets. The New York Times and ProPublica subsequently worked together to report on the documents.

In each case, the editors and reporters working on these stories have had to make difficult, important decisions about what information to publish. These aren’t novel calls, given many decades of correspondence from wars, tribunals or peace negotiations, but the sensitivity of the subject matter and global reach of published data create a context that can’t be ignored. While few journalists will come into possession of documents like the Snowden leaks over the course of their careers, the way that these stories were reported collaboratively syndicated and the data itself protected is going to be an important case study for generations to come.

Whither ethics in data?

For more the ethics in data journalism, it’s worth reading a series of extracts from a draft book chapter by Birmingham City University professor Paul Bradshaw, including:

Portions of this post were excerpted from a draft report on data-driven journalism, due to be published by the Tow Center in spring 2014.


Alexander Howard is a Tow Fellow working on the Tow Center’s Data Journalism Project at the Tow Center for Digital Journalism. The Data Journalism Project is a project made possible by generous funding from both The Tow Foundation and the John S. and James L. Knight Foundation. The Data Journalism Project includes a wide range of academic research, teaching, public engagement and development of best practices in the field of data and computational journalism. Follow Alexander Howard on Twitter @digiphile. To learn more about the Tow Center Fellowship Program, please contact the Tow Center’s Research Director Taylor Owen: