Publishers Can Afford Data Journalism, Says ProPublica’s Scott Klein

This winter, Scott Klein made a prediction at the Nieman Lab that that drew some attention: “in 2014, you will be scooped by a reporter who knows how to program.” As I noted at this blog, he was proven correct within the month, as enterprising journalists applied their data skills to creating scoops and audiences. Yesterday, The New York Times’ promising new data journalism venture, The Upshot, published the most popular story at, confirming that original data and reporting presented with context and a strong narrative remains a powerful, popular combination. It’s still atop the most-emailed, viewed and shared leaderboards this morning.

scott-kleinSo, Klein was right. Again.

That’s not a huge surprise to me, nor anyone else in the data journalism world.

Klein, an assistant managing editor at ProPublica, draws on years of hands-on experience working with data, reporters and developers at one of the most important nonprofit news organizations in the world. Recently, his team has published projects like The Opportunity Gap, China’s Memory Hole and Prescriber Checkup. Klein co-founded DocumentCloud with Aron Pilhofer, the New York Times editor whose perspectives on technology and the news we featured here earlier this month. Before he came to ProPublica, Klein directed editorial and business application development for the, and worked at The New York Times.

This spring, he spoke with me about what he sees in the industry and have an early read and review of the report that the Tow Center will publish next month. Our conversation, lightly edited for content and clarity, follows.

Is data-driven journalism too expensive?

News organizations are contracting and budgets are going down. Times are still very tough. That said, I suspect that some newsrooms say they can’t afford to hire newsroom developers when they really mean that their budget priorities lie elsewhere – priorities that are set by a senior leadership whose definition of journalism is pretty traditional and often excludes digital-native forms. I also hear a lot from people trying to get data teams started in their own newsrooms that the advice that newsroom leaders get is that newsroom developers are “unicorns” whom they can’t afford. Big IT departments sometimes play a confounding role here.

I suspect many metro papers can actually afford one or two journalist/developers — and there’s a ton of amazing projects a small team can do. For years, the Los Angeles Times ran one of the best news application shops in the country with only two dedicated staffers (they still do great work, of course, and the team has grown). If doing data journalism well is a priority of the organization, making it happen can fit into your budget.

What’s changed today?

Lots, of course, has changed since Philip Meyer’s pioneering days in the 1960s. One is that the amount of data available for us to work with has exploded. Part of this increase is because open government initiatives have caused a ton of great data to be released. Not just through portals like — getting big data sets via FOIA has become easier, even since ProPublica launched in 2008.

Another big change is that we’ve got the opportunity to present the data itself to readers — that is, not just summarized in a story but as data itself. In the early days of CAR, we gathered and analyzed information to support and guide a narrative story. Data was something to be summarized for the reader in the print story, with of course graphics and tables (some quite extensive), but the end goal was typically something recognizable as a words-and-pictures story.

What the Internet added is that it gave us the ability to show to people the actual data and let them look through it for themselves. It’s now possible, through interaction design, to help people navigate their way through a data set just as, through good narrative writing, we’ve always been able to guide people through a complex story.

Is this new state of affairs really different?

It’s a tectonic change both in the sense that it’s slow and gradual, and in the sense that it’s reshaping the entire landscape.

Data was always central to journalism. In the oldest newspapers, from the 17th century, you can find data. Correspondents would write about the prices of commodities in faraway cities (along with court gossip) for the benefit of merchants doing international business. Commodity prices, the contents of arriving cargo ships, and even the names of visiting businessmen were a big part of the daily mission of newspapers as they started to become more common.

As technology got better in the late 18th century and readers started demanding a different kind of information, the data that appeared in newspapers got more sophisticated and was used in new ways. Data became a tool for middle-class people to use to make decisions and not just as facts to deploy in an argument, or information useful to elite business people.

The change we’re experiencing thanks to the web increases the role of presentation of the data itself, both in great data visualization and in great exploratory graphics like news applications. We can show people “the back of the baseball card” on a large scale. We’ve got the tools, and the readers can understand it and make use of it. I feel like that’s as big a change as we’ve ever experienced, but I’m biased.

Do people want to read the data?

If it’s done well, people have a really big appetite to see the data for themselves.

Look how many people understand — and love — incredibly sophisticated and arcane sports statistics. We ought to be able to trust our readers to understand data in other contexts too. If we’ve done our jobs right, most people should be able to go to our “Prescriber Checkup” news application, search for their doctors and see how their prescribing patterns compare to their peers, and understand what’s at play and what to do with the information they find.

There are ways to design data so that more important numbers are bigger and more prominent than less important details. People know to scroll down a Web page for more fine-grained details. At ProPublica, we design things to move readers through levels of abstraction from the most general, national case to the most local example.

Do you recruit or programmers to do DDJ? Or teach journalists?

Both. But culture matters a lot, too. People with the right mindset, who feel valued for their editorial judgment and creativity, and who are given real responsibility over their work, will learn whatever they need to learn in order to get a project done. The people on my team focus on telling great journalistic stories and don’t let not knowing how to do something stop them from doing so. They learn whatever skills, techniques and expertise they need to learn.

In terms of journalists learning how to program, I think there are some myths about what “programming” means. It doesn’t have to mean a computer science degree and it doesn’t have to mean what Google does. I know journalists who make incredibly complex scrapers for their reporting work who will tell you they don’t know how to program. Really, making tools to automate tasks is what a programmer does. There’s no magic threshold you have to pass between programmer and not-programmer.

Of course, there is a difference between knowing how to code and being a computer scientist. If you’ve learned about algorithmic efficiency and can express it mathematically, and if you’ve studied how compilers work, all under the guidance of a person who knows the subject very well in an academic environment, you’ve got skills that will help you write better, faster, more efficient code. That’s different than learning how to use a high-level programming language to get a task done.

Much of what we do in newsrooms is on deadline and meant to be put behind a caching system that makes efficient code much less important, so computer science is not a prerequisite for being a great newsroom coder. In newsrooms, most of us rely on frameworks like Rails or Django that already make great low-level programming decisions anyway.

Are there journalists picking those DDJ skills up?

Yes, it’s happening, and the pace is accelerating. A few years ago the NICAR conference was a few hundred people. This year it was almost 1,000 people. Next year, it will be even bigger.

On every desk in the newsroom, reporters are starting to understand that if you don’t know how to understand and manipulate data someone who can will be faster than you. Can you imagine a sports reporter who doesn’t know what an on-base percentage is? Or doesn’t know how to calculate it himself? You can now ask a version of that question for almost every beat.

There are more and more reporters who want to have their own data and to analyze it themselves. Take for example my colleague, Charlie Ornstein. In addition to being a Pulitzer Prize winner, he’s one of the most sophisticated data reporters anywhere. He pores over new and insanely complex data sets himself. He has hit the edge of Access’s abilities and is switching to SQL Server. His being able to work and find stories inside data independently is hugely important for the work he does.

There will always be a place for great interviewers, or the eagle-eyed reporter who find an amazing story in a footnote on page 412 of a regulatory disclosure. But, here comes another kind of journalist who has data skills that will sustain whole new branches of reporting.