New Ideas for Automation in the Newsroom


How can automation help reduce the cost of story discovery in a newsroom? As a Tow fellow in 2015, I had the opportunity to explore this question by building a tool, available online at, that automates some of the exploratory data analysis associated with investigative work.

Newsrooms need automated investigative tools now more than ever because their human investigative resources have been decimated. Since the 2008 crash, among the most severely cut departments have been the ones that do long-term, in-depth investigative reporting. Enterprise ideas are the hardest kinds of story ideas to come up with: you don’t know what you don’t know about what’s going wrong in our public institutions. Most investigative stories are reactive, meaning they result from a tip from a whistleblower, versus being proactive, resulting from the reporter’s own inquiry. Since newsrooms were cutting investigative resources, I wondered if we could build computational systems to help the remaining reporters do their jobs more easily and proactively discover story ideas in public data.

During my Tow fellowship, I explored whether a kind of custom software called a Story Discovery Engine could facilitate enterprise reporting on campaign finance issues. (It can.) I was curious about campaign finance in anticipation of the 2016 US presidential election. I had heard a lot about dark money and superPACs in the wake of the 2010 Citizens United decision, but I knew there was a vast amount I didn’t understand about this complex system. However, I did know that reporters who work with campaign finance data spend a lot of time doing the same routine things: downloading data from the FEC and other sources, cleaning it, associating data with known entities, and building basic visualizations. Any time you have routine processes, there is an opportunity for automation.

The last time I built a Story Discovery Engine, I optimized it to find a story in education data. This time, instead of starting with my own story idea, I started by interviewing other journalists. I specifically asked about the kinds of stories they look for, and what the indicators are that suggest a story might be hiding in the data.

To an outsider, these indicators are almost impossible to spot. But to these campaign finance gurus, the signs of corruption were clear as day. One common indicator is administrative overspending. Nonprofit organizations are required to report their income and spending. If the organization spends an unusually high percentage of its income on administration, it is often an indicator that something is amiss internally, and there is likely an opportunity for a story.

However, deciding what percentage is “unusually high” is a judgment call. It is also a judgment call to determine whether there is a story worth pursuing. Some fluctuation in administrative expenses is normal. There might be a perfectly good reason for an organization to have unusually high administrative expenses; a high percentage does not necessarily imply corruption. This ambiguity is the reason that it is unwise to build a system that claims to automatically identify investigative story ideas. It would be unfair (not to mention unethical) to accuse an organization or a public servant of misdeeds based on a naïve computational analysis. It requires human decision-making to fully consider what is going on in a given situation.

This human component is essential for newsrooms to remember. Computers can’t independently determine corruption. Currently, a human is necessary in any automated investigative system. Instead of a fully automated investigative system, in this case I built a human-in-the-loop investigative system. Most people dream of full automation: cars that drive themselves, robots that deliver packages. I don’t dream of this. I’m fine with a world that includes people. I like human judgment, as flawed as it is. I like the drama and the idiosyncrasies of human systems. The difference between a fully autonomous system and a human-in-the-loop system is like the difference between a drone and a jet pack. The drone is autonomous: it is programmed to go to a particular location, drop a bomb or take a picture, and then come back to base by itself. A jet pack (in theory) is designed to be strapped onto the back of a human being in order to accelerate the human’s effort. Both are legitimate system models, and each is useful for a different type of task.

Bailiwick, the system that I built, automates some of the grunt work associated with downloading FEC data, cleaning it, putting it into a database, organizing it into recognizable categories, and creating simple visualizations. The visualizations, plus the knowledge engineering layer that organizes the data into recognizable categories, allow a reporter to quickly make sense of complex data. Bailiwick analyzes data for each of the thousands of 2016 federal candidates and for 17,000+ active political committees. There is also an alerting function that allows a user to set up a personal profile of candidates or races to follow. A Pennsylvania reporter, for example, could choose to follow the two frontrunners in the PA Senate race and a handful of frontrunners in PA House races. Bailiwick sends an alert to the reporter via a private Slack channel every time there is a filing by a candidate the reporter follows. Setting automatic alerts, via Slack or a service like IFTT or Zapier, has long been known as an effective way to use automation in the newsroom.

Bailiwick is not small. As of January 1, 2017, the system included 11,032 lines of Python; 13,158 lines of HTML; 40,113 lines of JavaScript; 13,601 lines of CSS; 2.9 million lines of text; 2,892 lines of markdown; and approximately 94.2 million records in the database. The number of records increases every time new filings are added to the FEC site. The system gathers small updates from FEC every night, and then it completely refreshes all of its data from FEC once a week.

Bailiwick is not designed for beginners, but rather for experienced reporters who are specifically looking to find stories in campaign finance data. Top-tier news organizations like the New York Times or ProPublica or the Center for Public Integrity have software developers on staff who build custom, non-public-facing software for reporters to use on campaign finance data. Bailiwick is designed for organizations that don’t have internal developers.

For an investigative project, Bailiwick was relatively inexpensive to build. In Democracy’s Detectives: The Economics of Investigative Journalism, James T. Hamilton outlines some of the costs and benefits associated with complex investigative projects. He estimates that “Deadly Force,” the Washington Post’s 1999 Pulitzer Prize-winning story on D.C. police shootings, cost about $487,000 (in 2013 dollars) to create. Hamilton writes: “While accountability reporting can cost media outlets thousands of dollars, it generates millions in net benefits to society by changing public policy.” Watchdog reporting is good for society, but measuring its impact and its cost is not straightforward. The starting price varies by industry as well. Video is more expensive than print or digital: production costs for a single hour-long news documentary for a show like PBS Frontline start around $500,000 and can go upwards of $1 million.

A news app like Bailiwick starts at about $50,000 in up-front software development costs. It currently costs me about $1,000 a month to maintain; I plan to offer it for free online for about a year. Newsrooms are likely hesitant to commit to spending $50K on custom software projects. In the future, newsrooms could consider joining together to fund similar projects.

It is clear that news apps like Bailiwick can help newsrooms to produce more data-driven investigative work and can lower the cost of discovery for these investigative stories. Newsrooms who want to adopt this kind of automation will need to commit to up-front costs that are not insubstantial, and will need to plan for a development time frame that is longer than the customary time frame for daily or weekly news production. However, newsrooms excel at developing and meeting production timelines, so this challenge will likely not be an obstacle. It is also helpful to think about automated investigation as a kind of artisanal production. Data-driven enterprise stories are not easily mass-produced; they are artisanal products. Small-scale automation using human-in-the-loop systems is a cost-effective way to increase production of these high-quality stories, just as a small bakery might buy a large-capacity mixer in order to produce more loaves of bread. As newsrooms increase their production of watchdog reporting in the public interest, society will benefit.

Image via Sakena on Flickr.