The Curious Journalist’s Guide to Data
On March 24, the Tow Center launched “The Curious Journalist’s Guide to Data” – a research project led by Tow Fellow Jonathan Stray. The launch event, held at Columbia Journalism School, featured a presentation of the report followed by a panel discussion with Meredith Broussard, Assistant Professor at the Arthur L. Carter Journalism Institute of New York University; Mark Hansen, Director of the David and Helen Gurley Brown Institute for Media Innovation; and Scott Klein, Assistant Managing Editor at ProPublica.
This is a book about the principles behind data journalism. Not what visualization software to use and how to scrape a website, but the fundamental ideas that underlie the human use of data. This isn’t “how to use data” but “how data works.”
This gets into some of the mathy parts of statistics, but also the difficulty of taking a census of race and the cognitive psychology of probabilities. It traces where data comes from, what journalists do with it, and where it goes after—and tries to understand the possibilities and limitations. Data journalism is as interdisciplinary as it gets, which can make it difficult to assemble all the pieces you need. This is one attempt. This is a technical book, and uses standard technical language, but all mathematical concepts are explained through pictures and examples rather than formulas.
The life of data has three parts: quantification, analysis, and communication. Quantification is the process that creates data. Analysis involves rearranging the data or combining it with other information to produce new knowledge. And none of this is useful without communicating the result.
Quantification is a problem without a home. Although physicists study measurement extensively, physical theory doesn’t say much about how to quantify things like “educational attainment” or even “unemployment.” There are deep philosophical issues here, but the most useful question to a journalist is simply, how was this data created? Data is useful because it represents the world, but we can only understand data if we correctly understand how it came to be. Representation through data is never perfect: all data has error. Randomly sampled surveys are both a powerful quantification technique and the prototype for all measurement error, so this report explains where the margin of error comes from and what it means – from first principles, using pictures.
All data analysis is really data interpretation, which requires much more than math. Data needs context to mean anything at all: Imagine if someone gave you a spreadsheet with no column names. Each data set could be the source of many different stories, and there is no objective theory that tells us which true stories are the best. But the stories still have to be true, which is where data journalism relies on established statistical principles. The theory of statistics solves several problems: accounting for the possibility that the pattern you see in the data was purely a fluke, reasoning from incomplete and conflicting information, and attempting to isolate causes. Stats has been taught as something mysterious, but it’s not. The analysis chapter centers on a single problem – asking if an earlier bar closing time really did reduce assaults in a downtown neighborhood – and traces through the entire process of analysis by explaining the statistical principles invoked at each step, building up to the state-of-the-art methods of Bayesian inference and causal graphs.
A story isn’t isn’t finished until you’ve communicated your results. Data visualization works because it relies on the biology of human visual perception, just as all data communication relies on human cognitive processing. People tend to overestimate small risks and underestimate large risks; examples leave a much stronger impression than statistics; and data about some will, unconsciously, come to represent all, no matter how well you warn that your sample doesn’t generalize. If you’re not aware of these issues you can leave people with skewed impressions or reinforce harmful stereotypes. The journalist isn’t only responsible for what they put in the story, but what ends up in the mind of the audience.
This report brings together many fields to explore where data comes from, how to analyze it, and how to communicate your results. It uses examples from journalism to explain everything from Bayesian statistics to the neurobiology of data visualization, all in plain language with lots of illustrations. Some of these ideas are thousands of years old, some were developed only a decade ago, and all of them have come together to create the 21st century practice of data journalism.