On Beginning a Huge, Complicated Data Journalism Research Project

As a Tow Fellow this year, I will be creating a software system to help journalists detect campaign finance fraud. The software, which I call a Story Discovery Engine, will look at campaign finance filings and will alert journalists when irregularities exist. Each irregularity is potentially the basis for a story.

The Story Discovery Engine is based on a type of artificial intelligence called an expert system. You can read more about it in this AJR post or in this CACM story. Since I am at the very beginning of this huge complicated project right now, I wanted to write a few words about what it is like to be a journalist embarking on a software development project.

I have an unusual background: before I was a journalist, I was a computer scientist. I started out writing software for long-distance telecom networks (remember when long distance phone calls cost more than local calls?). I also wrote software for interactive musical installations, for large-scale financial transactions, and for a variety of websites. Dabbling in these different disciplines as a developer was good practice for becoming a reporter. I learned that each discipline has its own workflow, its own vocabulary, and its own unanswered questions.

Making a Story Discovery Engine is more like writing software than writing a story, although there are certain similarities. Software development has a few steps: design, implementation, testing, and rollout. You loop or iterate through them a few times before you’re ready for the world. Just like journalists revise their stories, programmers revise their code. However, I usually write a story in a day, or a couple of days. Writing software, at the scale of the Story Discovery Engine, takes weeks or months. The process is not unlike writing a book, come to think of it.

The first step is design. I asked myself: what kind of Story Discovery Engine do you want to make? In this case, I wanted to make one that would help me discover campaign finance fraud.

The core of the Engine is the conflict between what should be and what is. The “should be” is articulated in laws and policies. The “is” is articulated in data. For the campaign finance investigation, my first step was to pull the laws and policies, or rules, about campaign finance. There are state rules and federal rules; I focused on federal rules. These are about 600 pages of deadly boring statutes and regulations. I read them all. I took notes. It was grueling. At the end, I boiled them down to a set of rules that I could implement in programmable logic. Not all laws lend themselves to this practice, so it’s not like we’ll have government by computer anytime soon. But, the rules that are quantitative are easy. If there is a threshold of $2500 for personal donations, for example, you can look to see who has donated more than $2500.

The next step was to assemble the data. The FEC makes a lot of data conveniently available through its API and through posting regular filings in a timely manner. Unfortunately, though these filings are timely, they are not always machine-readable. There is a big difference between human-readable and machine-readable content. Journalists are very familiar with this; people who want to obfuscate or make our lives difficult will often give us data that is not in machine-readable format, and then we have to spend more time negotiating for the data in the correct format or converting it to the right format ourselves. It’s a hassle.

At any rate, the talented folks at the New York Times are the leaders in this particular battle for machine-readable campaign finance data. They have made something called the Campaign Finance API. As you probably know, an API is something that allows you to request data from someone else’s application. The data is delivered in a standard format. Right now, that format is usually JSON or CSV or whatever.

So. I have the rules, I have the data. What next?

Next is more serious design. I have to design the system, or write down what it is supposed to do. This comes in the form of a specification, or spec. Writing a spec is SUPREMELY important because it gives you a roadmap to the entire project. If I did not have a spec, I would not have a way to talk about the project, and the people building it would not know what I was talking about! We have lots of examples of documents that guide us in collaborative activities. For example: if you run a class, you write a syllabus. If you run a research project, you write a research proposal (which often turns into a grant application). If you write a book, you write a book proposal or outline. So if you write software, you write some kind of a spec, either a technical spec or a functional spec or both.

If you are a single newsroom developer and you are only writing code for yourself, or if you are working with a very small and functional newsroom team, you may think that you don’t need a spec. If you are writing only small bits of code, like web scrapers, you are probably right. But if you are doing anything more complex than scraping a single site, or if you are making something that will be used by more than one person, I would encourage you to write a spec. Writing a spec is like writing an outline for a paper or a long article or a book. It works out better in the long run, and it can save you when you get far down a path and forget where you were going in the first place.

You can get all formal with a spec, and follow the guidelines from the discipline of software engineering. Or, you can just write something that has enough detail to communicate. I chose to write something that has enough detail to communicate. My goals for my functional spec include:

  • Create a document to help keep my project organized
  • Work out some technical details in writing
  • Create a design document that can be used by other people on the team for guidance
  • Have a starting point for face-to-face communication

I also wrote a set of user stories, little scenarios that describe who will use the software and how they will use it. I thought this part would take a really long time. However, to my great surprise and delight, I quickly came up with four user stories that described most of what I wanted the system to do. It was one of those parting-storm-clouds-rainbow-sunshine moments: suddenly, the whole project seemed manageable and achievable in the time available. I love those moments.

Now that the project has been designed and organized, my next step is to implement it. That’s what I’ll be doing for the next several months. I’ll write another post after the development is underway to update you on how the project is going. In the meantime, for questions or suggestions, feel free to tweet me @merbroussard or get in touch via