Aron Pilhofer on data journalism, culture and going digital
When it comes to computer-assisted reporting and the ways media companies are using technology, there are few people in the U.S.A. as knowledgeable as Aron Pilhofer. He runs a newsroom team at The New York Times that combines journalism, social media, technology and analytics, co-founded the open source DocumentCloud.org project, is a two-time grantee of the Knight News Challenge, and co-founded Hacks & Hackers, a network of people focused on applying development and digital innovation to the method and practice of journalism. Happily, part of my research into data journalism’s past, present and future has been interviewing people like Pilhofer, given the insight that those talks offers for debugging debates about “what it all means.” He was kind enough to talk to me earlier this month. Our interview follows, lightly edited for clarity, content and [bracketed] for context.
How do you feel about the term “data journalism” supplanting computer-assisted reporting (CAR)?
When someone can tell me what is meant by “data journalism,” maybe I would start to feel strongly about it.
I think there’s a lack of specificity, with various definitions. It tends to be almost geographically based. In Europe, when you talk about data journalism, you’re almost always talking about data visualization. In the United States, it’s sometimes data visualization, sometimes old school computer-assisted reporting.
While I prefer the term data journalism, because it’s much less goofy [than computer-assisted reporting], I think there’s a lack of precision. You do need to define your terms. The way I see it, it is a continuum where the work that Phil Meyer, Barlett and Steele were doing 30 years ago [continues] all the way to today, with people like Sarah Cohen and John Keefe, all sharing kind of the same elements. You treat data as a source [in your reporting].
What’s happening with the market for data journalists and your ability to hire for these skills?
In some ways it’s easier. In others, it’s harder today. There’s way more competition now. We’re losing people to really good newsrooms. We are not the only game in town, which we used to be. There was a time when there was us and there was the Washington Post, and that was kind of it.
What are you working on now that’s new and potentially important?
We just started a newsroom analytics team. The kinds of projects we’re doing there are entirely editorial. They are not tied to advertising at all.
Right now, many newsrooms are stupid about the way they publish. They’re tied to a legacy model, which means that some of the most impactful journalism will be published online on Saturday afternoon, to go into print on Sunday. You could not pick a time when your audience is less engaged. It will sit on the homepage, and then sit overnight, and then on Sunday a home page editor will decide it’s been there too long or decide to freshen the page, and move it lower.
I feel strongly, and now there is a growing consensus, that we should make decisions like that based upon data. Who will the audience be for a particular piece of content? Who are they? What do they read? That will lead to a very different approach to being a publishing enterprise.
Knowing our target audience will dictate an entirely differently rollout strategy. We will go from a “publish” to a “launch.” It will also lead us in a direction that is inevitable, where we decouple the legacy model from the digital. At what point do you decide that your digital audience is as important — or more important — than print?
This sounds similar to the approach that many online outlets are pursuing.
[The Interactive News] team can build just about anything now to scale to a ridiculous amount of traffic, tying into every New York Times system. That isn’t the problem anymore. We can make everything work [from a technical standpoint] on David Leonhardt’s project, which is our answer to 538, but it still may not find an audience.
This is a product build, where we take a particular flavor of journalism and find an audience. We find a way for the audience that would want that to find it. It is really hard to think about when you really only know one tune: Your homepage. It is really powerful, but that alone isn’t going to do it. How does that change what we’re building? How can we consistently get that audience to return?
Building one-off interactives isn’t that important. When you’re starting to build persistent features, like what John Keefe has done with his Cicada Project, or Scott Klein has built with Dollars for Docs, you’ve got to think about these things more deeply. Who in the newsroom is better positioned than a data journalist to do that?
How many data journalists do you have on staff at the Times?
It depends on your definition; we could be anywhere from 5 to 50. We have a computer-assisted reporting team, which is 5-6 people. We have a graphics desk, which is probably 15 primarily or largely dedicated to digital. On my team we have 21 developers. Then there’s our research and development department, and design team.
Is there anyone you’d call a computational journalist?
Maybe Chase Davis. Amanda Cox is a statistician by training. Sarah Cohen was a former statistician before she went into CAR. We have data scientists on the business side. R&D has a couple, like Mike Dewar, who used to be at Bitly. These are people who are applying data science techniques to actual journalism, stories, infographics and data visualizations.
Would you agree with an estimate of several hundred data journalists currently working in the USA?
Absolutely. NICAR has 850 people registered, with a healthy walkup expected. [The final attendance at NICAR 2014 was 997 people.] Five years ago, the conference was on life support, with maybe 250 people. Now, this number of people showing up has changed it a lot, I think for the better. It has become the “must-go “conference for folks who are doing what my team does, for the John Keefes & Scott Kleins of the world.
Is there a mismatch between the supply and demand for people with the skills you’ve referenced?
It’s true. I have two openings now.
What was your path to the profession?
I was a political reporter, but always used data in my reporting. I just started doing it in college. I just started messing around. I had a history professor who was not well known then. Now, he’s borderline famous from doing quantitative methods in history. He’d do statistical sampling of historical census data that had just been paper records before that. Suddenly, you could do queries on the 1930 Census. You were not just basing a historic analysis on papers or on interviews with people, or what you could glean from anecdotes. You were looking at data. It was incredible.
That’s not that different from a data journalist does, on the CAR side. Instead of a person, you’re using data as a source. Over time, I shifted from being a reporter who does CAR to being a specialist at the Center for Public Integrity to a CAR editor at the New York Times and then started this team.
How did you start learning to program?
I can thank an IRS story on 527 committees, which were then the campaign finance loophole du jour. They were previously unregulated and Congress, in its wisdom, put the IRS in charge of regulating them. It was idiotic. The IRS is not a disclosure agency. They put together the world’s worst disclosure website. There was basic data there, but you couldn’t aggregate it or access it in a meaningful way. It would have taken thousands of mouse clicks to get all of it.
I talked to a public information officer, after they denied my FOIA request for the database underlying the site. He said it was all on the website.
So, I created the world’s worst Web scraper in PHP. It ran from the browser. I didn’t know the command line well.
Is “Silent Partners” still on the Web?
Parts of it are long gone, though bits remain at the Center for Public Integrity. What you won’t find is the massive searchable database. We did what IRS should have done. We took all the paper filings and got a grant to do data entry. We sent them to a company in Virginia. We spent $80,000 to create what was then the only searchable database of political donor contributions. It’s completely out of date now. The Center for Responsive Politics has been continuing to do this.
I discovered that I really enjoyed the coding part in addition to reporting. The art of it. That’s how I ended up shifting into my current job.
Have you seen more coders move into data journalism, or journalists learn how to code?
I’ve seen far, far more move the outside in, from non-journalism roles.
Do you have any sense of why that might be?
Journalism is one of the few professions that not only tolerates general innumeracy but celebrates it. I still hear journalists who are proud of it, even celebrating that they can’t do math, even though programming is about logic. It’s hard to get a journalist to open up a spreadsheet, much less open up a command line. It is just not something that they, in general, think is held to be an important skill.
It’s baffling to me. Look at The Sun-Sentinel, which just won another Pulitzer for a story on speeding cops that you could only do with data analysis. You would think you wouldn’t have to make the case that this is core to what journalists should know.
It’s a cultural problem. There is still far too much tolerance for anecdotal evidence as the foundation for news stories.
So this is endemic?
I don’t know how to solve it. Look at NICAR being around as long as it has. Early on, they had the naive belief that if you could train enough people, they could make the organization irrelevant. Now, when you look back, it’s hilarious. Obviously, that’s never going to happen for practical purposes. I don’t think we’re anywhere near the point where you could say, given enough training and time, that you would not need a specialist in the newsroom. We’re so far away from that.
Are there cultures where this is changing? Maybe ProPublica?
It’s as far along as it is is because of Scott Klein. It took years before they put Jeff Larson on news stories. That just happened this year. There are newsrooms making this a significant project. Look at the L.A. Times, or WNYC. I think John Keefe is a fricking genius. I wish I were doing the work he is.
What others would you highlight?
Given time, given urgency, we will forge something new from the old models. Given how much time we have had, I would have hoped we’d be further along. Maybe I’m just impatient. When do you treat digital as your primary platform?
We are launching three subscription products this year. If all goes well, we will have more subscriptions on pure digital than in print [at the end of 2014]. We have to think about where the eyeballs are. From the perspective of the newsroom, over time, we have to think primarily digital. That’s the cultural change that isn’t happening fast enough.
There needs to be a strategy, where all the things we considered “nice to have” in a newsroom — from analytics to coders to designers — all of a sudden, they’re building our core product. Text only takes you a certain distance in digital sphere. That’s the part that I’m excited about building.
[Image Credit: Knight Foundation]