This is the first of a series of essays by Jonathan Stray to help data journalists see the processes they must master. Anybody can read a graph — that’s the point of publishing it — but journalists who are committed to producing great work need to look deeper. Stray’s essays will guide journalists towards answering four crucial questions: Where did the data come from? How was it analyzed? What is a reader going to believe when they see the story, and what can they do with that knowledge? Although data has just recently exploded into every corner of society, data journalism draws from ideas and traditions that go back hundreds of years.
This is a graph of the U.S. unemployment rate over the last ten years. There is a whole world just beneath the surface of this image, an intricate web of people and ideas and actions.
It’s clear that a lot of people lost their jobs after the 2008 financial crash. You can read this chart and say how many: the unemployment rate went up by 5% which means 15 million people. This is a very ordinary, very reasonable way of talking about this data, exactly the sort of thing that should pop into your head when you see this image. The data journalist needs to look deeper.
What is this little squiggly line, where did it come from, and why do we think it’s so important? This trace is not the detached, professional abstraction it appears to be. There was much craft in its making; the crash was real enough but the graph is an invention. Yet it represents something very real, if you’re someone looking for work. Graphs like this can tell us what is happening, as a society, and suggest what to do. Journalists use charts like this all the time to understand and illustrate stories.
How does this work? Why do we invest this collection of numbers with such authority, and should we?
The journalist who works with data has answer these question. The broadest answers come from an intellectual tradition that predates the written word, beginning with the concept of counting. Yet each story demands specific answers, and you can’t do good data journalism without answering: why this chart and this data? How is an unemployment chart any better, or different, than walking into the street and asking people about their post-crash lives? Is data really any use at all, here?
This simple chart is not as simple as it seems, when you look closely.
My goal is to help you do good data work. But you can’t prove that a piece of data journalism is correct in the same way that you can prove a mathematical theorem is correct. Journalism starts and ends in the world of human experience, and math is just one part in the middle. Still it seems that some journalism uses data better than others, and there are many things you could do with data that are widely recognized as mistaken or deceptive. Within these constraints there is endless space for creation.
The principles of data work go back to the dawn of civilization: the very first writing systems were used for accounting, long before they were sophisticated enough to express language.[i] At that time the rules of addition must have seemed incredibly arcane (in base 60, at first!) and it must have been a powerful trick to be able to tell in advance how many stones you would need for a building. There is no doubt that numbers, like words, are a type of practical magic. But you already know how to count. I want to talk about some of the main ideas that were developed during the Enlightenment, then massively refined and expanded in the 20th century, with modern statistics and computers.
And so I’ve been collecting pieces, trying to understand what I can take from other fields, hoping to use data thoughtfully and effectively in my journalism work. I suspect that what I have left to learn is a lot more than what I can say now. But I’ve come to appreciate certain ideas, cherished principles from other traditions.
I’ve tried to organize the things that can be said about using data in journalism into four parts: quantification, analysis, communication, and action. These are roughly the stages of any data-based story. I don’t think anyone will be surprised to hear that data journalism includes analysis and communication. But I don’t find that nearly enough. A data story necessarily stretches back through time to the creation of the data, and hopefully it also stretches forward to the point where someone finds it helpful in deciding what to do.
Data journalism begins with quantification, and questions about quantification. Data is not something that exists in nature. Unemployed people are a very different thing than unemployment data! What is the process that turns one into the other? To put it another way: what is counted and how?
Who is unemployed? There are at least six different ways that the U.S. government counts, which give rise to data sets labeled U1 to U6.[ii] The official unemployment rate – it’s officially called the “official” rate — is known as U3. But U3 does not count people who gave up looking for a job, as U4 does, or people who hold part time jobs because they can’t get a full time job, as U6 does.
And this says nothing of how these statistics are actually tabulated. No one goes around to ask every single American about their employment status every single month. The official numbers are not “raw” counts but must be derived from other data in a vast and sophisticated ongoing estimation process. Unemployment figures, being estimates, have statistical estimation error, far more error than generally realized. This makes most stories about short term increases or decreases irrelevant.[iii]
There is some complex relationship between the idea conveyed by the words “unemployment rate” and the process that produces a particular set of numbers.
Normally all of this is backstage, hidden behind the chart. This is so for all data. Data is created; it is a record, a document, an artifact, dripping with meaning and circumstance. Something specific happened in the creation of every item of data, every digit and bit of stored information. A machine recorded a number at some point on some medium, or a particular human on a particular day made a judgment that some aspect of the world was this and not that, and marked a 0 or a 1. Even before that, someone had to decide that some sort of information was worth recording, had to conceive of the categories and meanings and ways of measurement, and had to set up the whole apparatus of data production.
Data production is an astonishing process involving humans, machines, ideas, and reality. It is social, physical, and particular. I’m going to call this whole process “quantification,” a word which I’ll use to include everything from the conception of quantities all the way through to their concrete measurement and recording.
If quantification turns the world into data, analysis turns data into knowledge. Here is where data journalism comes closest to reproducible science, and leans heavily on math, statistics and logic. There are rules here, and we want those rules: it is hard to forgive arithmetic errors or a reporter’s confused causality. Journalists have a duty to get this sort of thing right, so data journalism demands deep and specific technical knowledge.
Suppose you want to know if the unemployment rate is affected by, say, tax policy. You might compare the unemployment rates of countries with different tax rates. The logic here is sound, but a simple comparison is wrong. A great many things can and do affect the unemployment rate, so it’s difficult to isolate just the effect of taxes. Even so, there are statistical techniques that can help you guess at what the unemployment rate would have been if all factors other than tax policy were the same between countries. We’re now talking about imaginary worlds, derived from the real through force of logic. That’s a tricky thing, not always possible, and not always defensible even when formally possible. Fortunately we have hundreds of years of guidance to help us.
Journalists are not economists, of course. They’re not really specialists of any kind, if journalism is all they have studied and practiced. We already have economists, epidemiologists, criminologists, climatologists, on and on. But a data journalist needs to understand the methods of any field they touch or they will be unable to tell good work from bad. They won’t know which analyses are worth repeating. Even worse, they will not understand which data matters how. And increasingly journalists are attempting their own analyses, when they discover that the knowledge they want does not yet exist. There is no avoiding quantitative methods.
Many people have some sort of reaction to the idea of studying statistics. Perhaps they had bad experiences with math in school. I find this a little sad, though I don’t blame you if you feel this way. Statistics in particular is often taught badly, using an outdated curriculum that is neither sensitive to the needs of the non-specialist nor particularly attuned to the wide availability of computing.[iv] We can do better. This isn’t a statistics course, but I’ll try to point out the specific ideas that are most relevant to data work in journalism. And they are such beautiful ideas!
The best way to learn quantitative methods is to get your hands dirty taking the machines apart. To do good data journalism work, or even to recognize good data journalism work, you need the grime of statistical method under your fingernails. That only comes from practice, but I can point to a few fundamentals, big ideas like distributions, models, causation and prediction. All of this knowledge is standard stuff, part of our shared heritage, but it can be remarkably difficult to find a description of how it all fits together.
The result of all of this work is something presented to the world, an act of communication. This is required of journalism. It’s one of the things that makes journalism different from research or scholarship or intelligence or science, or any field that produces knowledge but doesn’t feel the compulsion to tell it from the rooftops.
Communication always depends on the audience. The journalist doesn’t publish their story into vacuum, but into human minds and human societies. A story includes an unemployment chart because it is a better way of communicating changes in the unemployment rate than a table of numbers. And that is true because human eyes and brains process visual information in a certain way. Your visual system is attuned to the orientation of lines, which allows you to perceive trends without conscious effort. What a marvelous inborn ability!
Communication starts with the senses and moves ever deeper into consciousness. We know quite a lot about how minds work with data. Raw numbers are difficult to interpret without comparisons, which leads to all sorts of normalization formulas. Variation tends to get collapsed into stereotypes, and uncertainty tends to be ignored as we look for patterns and simplifications. Risk is personal and subjective, but there are sensible ways to compare and communicate odds.
But more than these technical concerns is the question of what is being said about whom. Journalism is supposed to reflect society back to itself for the benefit of us all, but who is the “we” in the data? Certain people are excluded from any count, and the astonishing variation of life is necessarily abstracted away into a useful fiction of uniformity. A vast social media data set seems like it ought to tell us deep truths about society, but it cannot say anything about the people who do not post, or the things they don’t post about. It does not speak for all. The unemployment rate reduces each voice to a single interchangeable bit: are you looking for work, yes/no?
An act of data journalism is a representation of reality that relies on stereotypes to fill in the lives behind the numbers. By stereotypes I mean our everyday understanding of people we’ve never met and situations we’ve always been lucky enough to avoid. Regardless of whether our image of “unemployed person” is positive or negative, we have to draw on this image to bring meaning to the idea of an unemployment rate. What the audience understands when they look at the data depends on what they already believe. Data can demolish or reinforce stereotypes. So it is not enough for data to be presented “accurately.” We have to ask what the recipient will end up believing about the world, and about the people represented by the data. Often, data is best communicated by connecting it to specific human stories that bring life and depth to the numbers.
We’re not quite done. I adore curiosity, and learning for the pleasure of knowing. But that’s not enough for journalism, which is supposed to inform an active democracy. Knowing the unemployment rate is interesting. Much better is knowing that a specific plan would plausibly create jobs. This is the type of knowledge that allows us to shape our future.
What good is journalism that never touches action? Action is not only essential, it is a powerfully clarifying perspective. Asking what someone could want to do is a question that will ripple through all the stages of your work, if you let it.
Data cannot tell us what to do, but it can sometimes tell us about consequences. The 20th century saw great advances in our understanding of causality and prediction. Prediction is the queen of knowledge; it is knowledge of the future. Prediction can give us instrumental knowledge: the knowledge of how to bring the world from the way it is now to the way we want it to be.
But prediction is very hard. Most things can’t be predicted well, for fundamental reasons such as lack of data, intrinsic randomness, free will, or chaos theory. There are profound limits to what we can know about the future. Yet where prediction is possible, there is convincing evidence that data is essential. Purely qualitative methods, no matter how sophisticated, just don’t seem to be as accurate. (The best methods are mixed.) Data is an irreplaceable resource for journalism that asks what will happen, what should be done, or how best to do it.
Predictions hide everywhere in data work. A claim to generalization is a also a claim to prediction, and prediction is also one of the very best ways we have of validating our knowledge. This is the logic of testing and “falsification” in the scientific method. There might be many stories that match the data we have now, but only true stories — stories that match the world — can match the data that is yet to exist.
But don’t believe for a second that all we need to do is run the equations forward and read off what to do. We’ve seen that broken dream before. At an individual level, the ancient desire for universal quantification can be a harmless fantasy, even an inspiration for the creation of new and powerful abstractions. At a societal level, utopian technocratic visions have been uniformly disastrous. A fully quantified social order is an insult to freedom, and there are good reasons to suspect that such a system can never really work.[v] Questions of action can hone and refine our data journalism, but actual action — making a choice and doing — requires practical knowledge, wisdom, and creativity. The use of data in journalism, like the use of data in society, will always involve artistry.
Quantification produces data from the world. Analysis finds a story in the data. Communication is where the story leaps to the mind of the audience. The audience acts on the world.
All of this is implicit in every use of data in journalism. All of it is just below the surface of an unemployment chart in the news, to say nothing of the complex visualizations that journalists now create routinely. Data journalism depends on what we have decided to count, the techniques used to interpret those counts, how we have decided to show the results, and what happens after we do. And then the world changes, and we report again. The data journalist sees this rich web of people, ideas, and action behind every number.
[i] Denise Schmandt-Besserat. Tokens and Writing: The Cognitive Development. SCRIPTA (1) 2009 145:154 http://sites.utexas.edu/dsb/files/2014/01/TokensWriting_the_Cognitive_Development.pdf
[ii] Table A-15. Alternative measures of labor underutilization. http://www.bls.gov/news.release/empsit.t15.htm.
[iii] A nice visualization of how the error in unemployment rates can lead to incorrect interpretations is How Not To be Misled by the Jobs Report. New York Times, 2014-5-1. http://www.nytimes.com/2014/05/02/upshot/how-not-to-be-misled-by-the-jobs-report.html
[iv] George Cobb. The Introductory Statistics Course: a Ptolemaic Curriculum. http://escholarship.org/uc/item/6hb3k0nz.