Debugging the backlash to data journalism

While the craft and context that underlies “data journalism” is well-known to anyone who knows the history of computer-assisted reporting (CAR), the term itself is a much more recent creation.

This past week, data journalism broke into the hurly burly of mainstream discourse, with the predictable cycle of hype and then backlash, for two reasons:

1) The launch of Nate Silver’s FiveThirtyEight this past week, where he explicitly laid out his vision for data journalism in a manifesto on “what the fox knows.” He groups statistical analysis, data visualization, computer programming and “data-literate reporting” under the rubric of data journalism.

2) A story in USA Today on the booming market for data journalists and the scoops and audiences they create and enable. The “news nerd job” openings at both born-digital and traditional media institutions shows clear demand across the industry.

There are several points that I think are worth making in light of these two stories.

First, if you’re new to this discussion, Mark Coddington has curated the best reviews, comments and critiques of in his excellent weekly digest of the news at the Nieman Journalism Lab. The summary ranges from the quality of 538’s stories to criticism of Nate Silver‘s detachment or even “data journalism” and questions the notion of journalists venturing into empirical projects at all. If you want more context, start there.

Second, it’s great to see the topic of data journalism getting its moment in the sun, even if some of the reactions to Silver’s effort may mistake the man or his vision for the whole practice. Part of the backlash has something to do with high expectations for Silver’s effort. FiveThirtyEight is a new, experimental media venture in which a smart guy has been empowered to try to build something that can find signal in the noise (so to speak) for readers. I’m more than willing to give the site and its founder more time to find its feet.

Third, while FiveThirtyEight is new, as are various other startups or ventures within media companies, data journalism and its practice are not new, along with existing critiques of its practices or or of programming in journalism generally. There are powerful new digital tools and platforms. If we broaden the debate to include screeds asserting that journalists don’t have to know how to code, it’s much easier to find a backlash, along with apt responses about the importance of courses in journalism school or digital literacy, grounded in the importance of looking ahead to the future of digital media, not its ink-stained past.

Fourth, a critical backlash against computers, coding and databases in the media isn’t new. As readers of this blog certainly know, data journalism’s historic antecedent, computer-assisted reporting, has long since been recognized as an important journalistic discipline, as my colleague Susan McGregor highlighted last year in Columbia Journalism Review.

Critics have been assessing the credibility of CAR for years, If you take a longer view, database-driven journalism has been with us since journalists first started using mainframes, arriving in most newsrooms in a broad sense over two decades ago.

The idea of “computer-assisted reporting” now feels dated, though, inherited from a time when computers were still a novelty in newsrooms. There’s probably not a single reporter or editor working in a newsroom in the United States or Europe today who isn’t using a computer in the course of journalism.

Many members of the media may use several of them over the course of the day, from the powerful handheld computers we call smartphones to laptops and desktops, crunching away at analysis or transformations, or servers and cloud storage, for processing big data at Internet scale.

After investigating the subject for many months, it’s fair to say that the powerful new tools and increased sophistication differentiates the CAR of decades ago from the way data journalism is being practiced today.

While I’ve loosely defined data journalism as “gathering, cleaning, organizing, analyzing, visualizing and publishing data to support the creation of acts of journalism,” a more succinct definition might be the “application of data science to journalism.”

Other observers might suggest that data journalism involves applying the scientific method or social science and statistical analysis to journalism. Philip Meyer called the latter “precision journalism” in the 1970s.

2014 was the year that I saw the worm really turn on the use of term “data journalism,” from its adoption by David Kaplan, a pillar of the investigative journalism community, to its use as self-identification by dozens of attendees, to the annual conference of the National Institute for Computer-Assisted Reporting (NICAR), where nearly a thousand journalists from 20 countries gathered in Baltimore to teach, learn and connect. Its younger attendees use titles like “data editor,” “data reporter” or “database reporter.”

The NICAR conference has grown by leaps and bounds since its first iteration, two decades ago, tripling in size in just the past four years. That rapid expansion is happening for good reason: that strong, clear market demand for data journalists in both traditional media outlets I mentioned earlier.

The size of NICAR 2014 may have given some long-time observers pause, in terms of the effect upon the vibrant community that has grown around it for years or the focus on tools.

“I’m a little worried that NICAR has gotten too big, like SXSW, and that it will lose its soul,” said Matt Waite, a professor of practice at the College of Journalism and Communications at the University of Nebraska, in an interview. “I don’t think it’s likely.”

Fifth, there is something important happening around the emergence of data journalism. I thought that the packed hallways and NICAR sessions accurately reflect what’s happening in the industry.

“Five years ago, this kind of thing was still seen in a lot of places at best as a curiosity, and at worst as something threatening or frivolous,” said Chase Davis, assistant editor for interactive news at the New York Times, in an interview.

“Some newsrooms got it, but most data journalists I knew still had to beg, borrow and steal for simple things like access to servers. Solid programming practices were unheard of. Version control? What’s that? If newsroom developers today saw Matt Waite’s code when he first launched PolitiFact, their faces would melt like Raiders of the Lost Ark.

Now, our team at the Times runs dozens of servers. Being able to code is table stakes. Reporters are talking about machine frickin’ learning, and newsroom devs are inventing pieces of software that power huge chunks of the web.”

What’s happening today does have some genuinely interesting novelty to it, from the use of Amazon’s cloud to the maturation of various open source tools that have been funded by the Knight Foundation, like the Overview Project, Document Cloud, the PANDA Project, or free or open source tools like Google Spreadsheets, Fusion Tables, and Open Refine.

These are still relatively new and powerful tools, which will both justify excitement about their applications and prompt understandable skepticism about what difference will they make if a majority of practicing journalists aren’t quite ready to use them yet.

One broader challenge that the adoption of “data journalism” has created in mainstream discourse is that it may then be divorced from the long history that has come before, as Los Angeles Times data editor Ben Welsh reminded this year’s NICAR conference in a brilliant lightning talk.

What ever we call it, if you look around the globe, the growing importance of data journalism is now clear, given the explosion in data creation. Data and journalism have become deeply intertwined, with increased prominence.

To make sense of the data deluge, journalists today need to be more numerate, technically literate and logical. They need to be able to add context, fact-check sources, and weave in narrative, interrogating data just as a reporter would skeptically interview human sources for hidden influences and biases.

If you read Anthony DeBarros’ post on CAR and data journalism in 2010, you’d be connected to the past, but it’s fair to guess that most people who read Nate Silver’s magnum opus on FiveThirtyEight’s approach to data journalism had not. In 3500 words or so, Silver didn’t link to DeBarros, Philip Meyer, or a single organization that’s been practicing, researching or expanding data journalism in the past decade, perhaps the most fertile time for the practice in history.

Journalists have been gathering data and analyzing it for many decades, integrating it into their stories and broadcasts in tables, charts and graphics, like a box score that compares the on-base percentage for baseball player at a given position over time. Data is a critical component to substantiating various aspects of a story, as it’s woven into the way that the story was investigated and reported.

There have been reporters going to libraries, agencies, city halls and courts to find public records about nursing homes, taxes, and campaign finance spending for decades. The difference today is that in addition to digging through dusty file cabinets in court basements, they might be scraping a website, or pulling data from an API that New York Times news developer Derek Willis made, because he’s the sort of person who doesn’t want to have to repeat a process every time and will make data available to all, where possible.

Number-crunching enables Pulitzer Prize-winning stories like the one on speeding cops in Florida Welsh referenced in his NICAR talk, or The Los Angeles Times’ analysis of ambulance response times. That investigation showed the public and state something important, which was that the data quality used to analyze performance was poor because the fire stations weren’t logging it well.

The current criticism of data journalism is a tiny subset of broader backlash against the hype around “big data,” which has grown in use in recent years, adopted all the way up to President Obama in the White House. Professional pundits and critics will always jump on opportunities to puncture hype. (They have families and data plans to support too, after all.)

I may even have inadvertently participated in creating hype around “data journalism” myself over the years, although I maintain that my interest and coverage has always been grounded in my belief that it’s importance has grown because of bigger macro trends in society. The number of sensors and mobile devices that are going to come online in the next couple years are going to exponentially expand the amount of data available to interrogate. As predictive policing or “personalized redlining” become real, intrusive forces in the lives of Americans, data journalism will become a crucial democratic bulwark against the increased power of algorithms in society.

That puts a huge premium upon the media having the capacity to do this kind of work, and editors hiring them. They should: data journalism is creating both scoops and audiences. It’s also a fine reason to be focused on highlighting that demand and to celebrate the role of NICAR and data journalism MOOCs have in training an expanding tribe, along with the willingness of the people who have gone before to help others who want to learn.

I expect to see more mainstream pushback regarding data journalism from members of the media who are highly proficient at interviewing, writing and editing, but perhaps less so with other skills that are now part of the reporter’s modern toolkit, like video, social media, Web development or mobile reporting. Professional pundits who don’t ground their assertions in history or science may not fare quite as well, in this world. Researchers who blog, by contrast, will. As more sources for expert, data-driven analysis of law, science, medicine or technology go direct online, opinion journalists without deep subject matter expertise are going to have to recalibrate.

It’s possible that there could also be a (much smaller) backlash from long-time practitioners that observe too much of a focus on the tools at NICAR.

“I’m concerned that it’s become too focused on data, and not enough on journalism,” said Waite. “There used to be much more on stories, with a focus on beats. People would talk about how they reported out stories, not technology. The number of panels about algorithm design are growing, and the number of story panels are shrinking. They’re not as well attended. That’s a reflection of the wishes of the attendees, but it troubles me.”

There may also be people who may push back against the meaning of “data journalist” being diluted, though I doubt we’ll see much of it. People the top of the profession and have serious technical chops which enable them to do much more than download a .csv file and making it into an infographic. These folks are proficient in Python, R and other programming languages, able to pursue scraping, cleaning and interrogation of huge data sets with complicated statistical analyses. At the edges of that gradient, there is computational journalism, although that is a specialty that doesn’t seem to exist outside of the academy.

Every one of the data journalists I’ve met over the years, however, cared a lot more about good code, clean data and beautiful design than the semantics of what to call them, or defending their professional turf.

Of the 997 NICAR attendees, how many were students and investigative reporters, editors who had showed up for the first time to learn these skills? If you told me a majority, I wouldn’t be surprised.

My sense was that in 2014, the unprecedented number of people who came had internalized the message that data journalism was important and they need to know how to do some of these things, or at least know what they are. They want to know what forking code on Github means, or at least what Github is and how people use it.

I don’t mean to knock the digital literacy of the NICAR attendees, as my sense was that it is higher than any other gathering of journalists in the world, but it’s easy for people to forget that there’s a significant portion of the public for whom these concepts are novel.

I think that’s true of the new media industry too, in which digital literacy and numeracy is perhaps not what it could be. There’s now more pressure on people in the industry to learn more, and for those who want to enter it to have more basic data skills. That’s driven some changes in the NICAR program.

“The temptation is that NICAR will become all about code-sharing,” said Waite. “That would lose the value-add, which is how the code relates to journalism. What’s different, versus programming or Web development?”

This reflects a common dividing line I’ve seen between people in the business world: the “suits” versus hoodies, jeans versus khakis, or MBA’s vs developers. Today, the world of the “news hacker” is being democratized — a good thing — so there’s always going to be a little bit of a discomfort around something that stretched from being a smaller tribe that self identifies into something bigger.

I expect that the backlash within the NICAR community to its expanded ranks and role in the industry will be minimal, leaving people room to work, collaborate, learn and teach. We’d be better off focused on the journalism itself, from storytelling to rigorous fact checking, and a bit less focused upon the tools, however new and shiny some may be.

“I’m not overly pessimistic about NICAR — quite the opposite,” said Waite, “but this focus on the data part of data journalism and less on the journalism part of data journalism is a nagging worry in the back of my head.”

That’s not to say that the technology isn’t worth considering or covering, as I have for years. We have huge amounts of data going online today, more than we ever had before, and media have access to much more powerful personal machines and cloud computing to process it.

Even with the new tech, they’re still doing something old: practicing journalism! The approach may start to look a bit more scientific, over time. An editor might float an assertion or hypothesis about new in the world, and then assigns an investigative journalist to go find out whether it’s true or not. To that, you need to go find data, evidence and knowledge about about it. To prove to skeptical readers that the conclusion is sound, the data journalist may need show his or her work, from the data sources to the process used to transform and present them.

It now feels cliched to say it in 2014, but in this context transparency may be the new objectivity. The latter concept is not one that has much traction in the scientific community, where observer effects and experimenter bias are well-known phenomena. Studies and results that can’t be reproduced are regarded with skepticism for a reason.

Such thinking about the scientific method and journalism isn’t new, nor is its practice in by journalists around the country who have pioneered the craft of data journalism with much less fanfare than FiveThiryEight.

“As we all know, there’s a lot of data out there,” said Ben Welsh, editor of the Los Angeles Times Data Desk. “and, as anyone who works with it knows, most of it is crap. The projects I’m most proud of have taken large, ugly datasets and refined them into something worth knowing: a nut graf in an investigative story or a data-driven app that gives the reader some new insight into the world around them.”

The graphic atop this post comes from that Data Desk. While you the work that created the image, it’s online if you want to look for it: The Los Angeles Times released both the code and data behind the open source maps of California’s emergency medical agencies it published in the series.

Moreover, it wasn’t the first time. As Welsh wrote, the Data Desk has “previously written about the technical methods used to conduct [the] investigation, released the base layer created for an interactive map of response times and contributed the location of LAFD’s 106 fire station to the Open Street Map.”

This is what an open source newsroom that practices open data journalism looks like. It’s not just applying statistics and social science to polls and publishing data visualizations. If FiveThirtyEight, Vox, The New York Times Uptake or other outlets want to publish data journalism and build out the newsroom stack, that’s the high bar that’s been set. (Update: I was heartened to learn that FiveThirtyEight has a Github account.) In sharing not only its code but its data, the Los Angeles Times also set a notable example for the practice of open journalism in the 21st century.

I don’t know about you, but I think that’s a much more compelling vision for what data journalism is and how it has been, is being and could be applied in the 21st century than the fox’s tale.

Postscript: Good news: 538 is both listening and acting.