How It's Made, Research

The Data Journalist’s Eye, An Introduction

0

This is the first of a series of essays by Jonathan Stray to help data journalists see the processes they must master. Anybody can read a graph — that’s the point of publishing it — but journalists who are committed to producing great work need to look deeper. Stray’s essays will guide journalists towards answering four crucial questions: Where did the data come from? How was it analyzed? What is a reader going to believe when they see the story, and what can they do with that knowledge? Although data has just recently exploded into every corner of society, data journalism draws from ideas and traditions that go back hundreds of years. 

This is a graph of the U.S. unemployment rate over the last ten years. There is a whole world just beneath the surface of this image, an intricate web of people and ideas and actions.

The US unemployment rate from 2004 to 2014

It’s clear that a lot of people lost their jobs after the 2008 financial crash. You can read this chart and say how many: the unemployment rate went up by 5% which means 15 million people. This is a very ordinary, very reasonable way of talking about this data, exactly the sort of thing that should pop into your head when you see this image. The data journalist needs to look deeper.

 

What is this little squiggly line, where did it come from, and why do we think it’s so important? This trace is not the detached, professional abstraction it appears to be. There was much craft in its making; the crash was real enough but the graph is an invention. Yet it represents something very real, if you’re someone looking for work. Graphs like this can tell us what is happening, as a society, and suggest what to do. Journalists use charts like this all the time to understand and illustrate stories.

 

How does this work? Why do we invest this collection of numbers with such authority, and should we?

The journalist who works with data has answer these question. The broadest answers come from an intellectual tradition that predates the written word, beginning with the concept of counting. Yet each story demands specific answers, and you can’t do good data journalism without answering: why this chart and this data? How is an unemployment chart any better, or different, than walking into the street and asking people about their post-crash lives? Is data really any use at all, here?

This simple chart is not as simple as it seems, when you look closely.

My goal is to help you do good data work. But you can’t prove that a piece of data journalism is correct in the same way that you can prove a mathematical theorem is correct. Journalism starts and ends in the world of human experience, and math is just one part in the middle. Still it seems that some journalism uses data better than others, and there are many things you could do with data that are widely recognized as mistaken or deceptive. Within these constraints there is endless space for creation.

The principles of data work go back to the dawn of civilization: the very first writing systems were used for accounting, long before they were sophisticated enough to express language.[i] At that time the rules of addition must have seemed incredibly arcane (in base 60, at first!) and it must have been a powerful trick to be able to tell in advance how many stones you would need for a building. There is no doubt that numbers, like words, are a type of practical magic. But you already know how to count. I want to talk about some of the main ideas that were developed during the Enlightenment, then massively refined and expanded in the 20th century, with modern statistics and computers.

And so I’ve been collecting pieces, trying to understand what I can take from other fields, hoping to use data thoughtfully and effectively in my journalism work. I suspect that what I have left to learn is a lot more than what I can say now. But I’ve come to appreciate certain ideas, cherished principles from other traditions.

I’ve tried to organize the things that can be said about using data in journalism into four parts: quantification, analysis, communication, and action. These are roughly the stages of any data-based story. I don’t think anyone will be surprised to hear that data journalism includes analysis and communication. But I don’t find that nearly enough. A data story necessarily stretches back through time to the creation of the data, and hopefully it also stretches forward to the point where someone finds it helpful in deciding what to do.

Data journalism begins with quantification, and questions about quantification. Data is not something that exists in nature. Unemployed people are a very different thing than unemployment data! What is the process that turns one into the other? To put it another way: what is counted and how?

Who is unemployed? There are at least six different ways that the U.S. government counts, which give rise to data sets labeled U1 to U6.[ii] The official unemployment rate – it’s officially called the “official” rate — is known as U3. But U3 does not count people who gave up looking for a job, as U4 does, or people who hold part time jobs because they can’t get a full time job, as U6 does.

And this says nothing of how these statistics are actually tabulated. No one goes around to ask every single American about their employment status every single month. The official numbers are not “raw” counts but must be derived from other data in a vast and sophisticated ongoing estimation process. Unemployment figures, being estimates, have statistical estimation error, far more error than generally realized. This makes most stories about short term increases or decreases irrelevant.[iii]

There is some complex relationship between the idea conveyed by the words “unemployment rate” and the process that produces a particular set of numbers.

Normally all of this is backstage, hidden behind the chart. This is so for all data. Data is created; it is a record, a document, an artifact, dripping with meaning and circumstance. Something specific happened in the creation of every item of data, every digit and bit of stored information. A machine recorded a number at some point on some medium, or a particular human on a particular day made a judgment that some aspect of the world was this and not that, and marked a 0 or a 1. Even before that, someone had to decide that some sort of information was worth recording, had to conceive of the categories and meanings and ways of measurement, and had to set up the whole apparatus of data production.

Data production is an astonishing process involving humans, machines, ideas, and reality. It is social, physical,  and particular. I’m going to call this whole process “quantification,” a word which I’ll use to include everything from the conception of quantities all the way through to their concrete measurement and recording.

If quantification turns the world into data, analysis turns data into knowledge. Here is where data journalism comes closest to reproducible science, and leans heavily on math, statistics and logic. There are rules here, and we want those rules: it is hard to forgive arithmetic errors or a reporter’s confused causality. Journalists have a duty to get this sort of thing right, so data journalism demands deep and specific technical knowledge.

Suppose you want to know if the unemployment rate is affected by, say, tax policy. You might compare the unemployment rates of countries with different tax rates. The logic here is sound, but a simple comparison is wrong. A great many things can and do affect the unemployment rate, so it’s difficult to isolate just the effect of taxes. Even so, there are statistical techniques that can help you guess at what the unemployment rate would have been if all factors other than tax policy were the same between countries. We’re now talking about imaginary worlds, derived from the real through force of logic. That’s a tricky thing, not always possible, and not always defensible even when formally possible. Fortunately we have hundreds of years of guidance to help us.

Journalists are not economists, of course. They’re not really specialists of any kind, if journalism is all they have studied and practiced. We already have economists, epidemiologists, criminologists, climatologists, on and on. But a data journalist needs to understand the methods of any field they touch or they will be unable to tell good work from bad. They won’t know which analyses are worth repeating. Even worse, they will not understand which data matters how. And increasingly journalists are attempting their own analyses, when they discover that the knowledge they want does not yet exist. There is no avoiding quantitative methods.

Many people have some sort of reaction to the idea of studying statistics. Perhaps they had bad experiences with math in school. I find this a little sad, though I don’t blame you if you feel this way. Statistics in particular is often taught badly, using an outdated curriculum that is neither sensitive to the needs of the non-specialist nor particularly attuned to the wide availability of computing.[iv] We can do better. This isn’t a statistics course, but I’ll try to point out the specific ideas that are most relevant to data work in journalism. And they are such beautiful ideas!

The best way to learn quantitative methods is to get your hands dirty taking the machines apart. To do good data journalism work, or even to recognize good data journalism work, you need the grime of statistical method under your fingernails. That only comes from practice, but I can point to a few fundamentals, big ideas like distributions, models, causation and prediction. All of this knowledge is standard stuff, part of our shared heritage, but it can be remarkably difficult to find a description of how it all fits together.

The result of all of this work is something presented to the world, an act of communication. This is required of journalism. It’s one of the things that makes journalism different from research or scholarship or intelligence or science, or any field that produces knowledge but doesn’t feel the compulsion to tell it from the rooftops.

Communication always depends on the audience. The journalist doesn’t publish their story into vacuum, but into human minds and human societies. A story includes an unemployment chart because it is a better way of communicating changes in the unemployment rate than a table of numbers. And that is true because human eyes and brains process visual information in a certain way. Your visual system is attuned to the orientation of lines, which allows you to perceive trends without conscious effort. What a marvelous inborn ability!

Communication starts with the senses and moves ever deeper into consciousness. We know quite a lot about how minds work with data. Raw numbers are difficult to interpret without comparisons, which leads to all sorts of normalization formulas. Variation tends to get collapsed into stereotypes, and uncertainty tends to be ignored as we look for patterns and simplifications. Risk is personal and subjective, but there are sensible ways to compare and communicate odds.

But more than these technical concerns is the question of what is being said about whom. Journalism is supposed to reflect society back to itself for the benefit of us all, but who is the “we” in the data? Certain people are excluded from any count, and the astonishing variation of life is necessarily abstracted away into a useful fiction of uniformity. A vast social media data set seems like it ought to tell us deep truths about society, but it cannot say anything about the people who do not post, or the things they don’t post about. It does not speak for all. The unemployment rate reduces each voice to a single interchangeable bit: are you looking for work, yes/no?

An act of data journalism is a representation of reality that relies on stereotypes to fill in the lives behind the numbers. By stereotypes I mean our everyday understanding of people we’ve never met and situations we’ve always been lucky enough to avoid. Regardless of whether our image of “unemployed person” is positive or negative, we have to draw on this image to bring meaning to the idea of an unemployment rate. What the audience understands when they look at the data depends on what they already believe. Data can demolish or reinforce stereotypes.  So it is not enough for data to be presented “accurately.” We have to ask what the recipient will end up believing about the world, and about the people represented by the data. Often, data is best communicated by connecting it to specific human stories that bring life and depth to the numbers.

We’re not quite done. I adore curiosity, and learning for the pleasure of knowing. But that’s not enough for journalism, which is supposed to inform an active democracy. Knowing the unemployment rate is interesting. Much better is knowing that a specific plan would plausibly create jobs. This is the type of knowledge that allows us to shape our future.

What good is journalism that never touches action? Action is not only essential, it is a powerfully clarifying perspective. Asking what someone could want to do is a question that will ripple through all the stages of your work, if you let it.

Data cannot tell us what to do, but it can sometimes tell us about consequences. The 20th century saw great advances in our understanding of causality and prediction. Prediction is the queen of knowledge; it is knowledge of the future. Prediction can give us instrumental knowledge: the knowledge of how to bring the world from the way it is now to the way we want it to be.

But prediction is very hard. Most things can’t be predicted well, for fundamental reasons such as lack of data, intrinsic randomness, free will, or chaos theory. There are profound limits to what we can know about the future. Yet where prediction is possible, there is convincing evidence that data is essential. Purely qualitative methods, no matter how sophisticated, just don’t seem to be as accurate. (The best methods are mixed.) Data is an irreplaceable resource for journalism that asks what will happen, what should be done, or how best to do it.

Predictions hide everywhere in data work. A claim to generalization is a also a claim to prediction, and prediction is also one of the very best ways we have of validating our knowledge. This is the logic of testing and “falsification” in the scientific method. There might be many stories that match the data we have now, but only true stories — stories that match the world — can match the data that is yet to exist.

But don’t believe for a second that all we need to do is run the equations forward and read off what to do. We’ve seen that broken dream before. At an individual level, the ancient desire for universal quantification can be a harmless fantasy, even an inspiration for the creation of new and powerful abstractions. At a societal level, utopian technocratic visions have been uniformly disastrous. A fully quantified social order is an insult to freedom, and there are good reasons to suspect that such a system can never really work.[v] Questions of action can hone and refine our data journalism, but actual action — making a choice and doing — requires practical knowledge, wisdom, and creativity. The use of data in journalism, like the use of data in society, will always involve artistry.

Quantification produces data from the world. Analysis finds a story in the data. Communication is where the story leaps to the mind of the audience. The audience acts on the world.

The Data Journalism Cycle: Quantification, Analysis, Communication, Action

 

All of this is implicit in every use of data in journalism. All of it is just below the surface of an unemployment chart in the news, to say nothing of the complex visualizations that journalists now create routinely. Data journalism depends on what we have decided to count, the techniques used to interpret those counts, how we have decided to show the results, and what happens after we do. And then the world changes, and we report again. The data journalist sees this rich web of people, ideas, and action behind every number.

Endnotes

[i] Denise Schmandt-Besserat. Tokens and Writing: The Cognitive Development. SCRIPTA (1) 2009 145:154 http://sites.utexas.edu/dsb/files/2014/01/TokensWriting_the_Cognitive_Development.pdf

[ii] Table A-15. Alternative measures of labor underutilization. http://www.bls.gov/news.release/empsit.t15.htm.

[iii] A nice visualization of how the error in unemployment rates can lead to incorrect interpretations is How Not To be Misled by the Jobs Report. New York Times, 2014-5-1. http://www.nytimes.com/2014/05/02/upshot/how-not-to-be-misled-by-the-jobs-report.html

[iv] George Cobb. The Introductory Statistics Course: a Ptolemaic Curriculum. http://escholarship.org/uc/item/6hb3k0nz.

[v] see for example James C. Scott. Seeing Like a State. Yale University Press, 1998.

How It's Made, Research

Hyper-compensation: Ted Nelson and the impact of journalism

1

NewsLynx is a Tow Center research project and platform aimed at better understanding the impact of news. It is conducted by Tow Fellows Brian Abelson, Stijn DeBrouwere & Michael Keller.

“If you want to make an apple pie from scratch, you must first invent the universe.” — Carl Sagan

Before you can begin to measure impact, you need to first know who’s talking about you. While analytics platforms provide referrers, social media sites track reposts, and media monitoring tools follow mentions, these services are often incomplete and come with a price. Why is it that, on the internet — the most interconnected medium in history — tracking linkages between content is so difficult?

The simple answer is that the web wasn’t built to be *fully* connected, per se. It’s an idiosyncratic, labyrinthine garden of forking paths with no way to navigate from one page to pages that reference it.

We’ve spent the last few months thinking about and building an analytics platform called NewsLynx which aims to help newsrooms better capture the quantitative and qualitative effects of their work. Many of our features are aimed at giving newsrooms a better sense of who is talking about their work. This seemingly simple feature, to understand the links among web pages, has taken up the majority of our time. This obstacle turns out to be a shortcoming in the fundamental architecture of the web. But without it, however, the web might never have succeeded.

The creator of the web, Tim Berners Lee didn’t provide a means for contextual links in the specification for HTML. The world wide web wasn’t the only idea for networking computers, however. Over 50 years ago an early figure in computing had a different vision of the web – a vision that would have made the construction of NewsLynx a lot easier today, if not completely unnecessary.

Around 1960, a man named Ted Nelson came up with an idea for a structure of linking pieces of information in a two-way fashion. Whereas links on the web today just point one way — to the place you want to go — pages on Nelson’s internet would have a “What links here?” capability so would know all the websites that point to your page.

And if you were dreaming up the ideal information web, this structure makes complete sense: why not make the most connections possible? As Borges writes, “I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future and in some way involve the stars.”

Nelson called his project Xanadu, but it had the misfortune of being both extremely ahead of its time and incredibly late to the game. Project Xanadu’s first and somewhat cryptic release debuted this year: over 50 years after it was first conceived.

In the mean time, Berners-Lee put forward HTML with its one-way links, in the early 90s and it took off into what we know today. And one of the reasons for the web’s success is its extremely informal, ad-hoc functionality: anyone can put up an HTML page and without hooking into or caring about a more elaborate system. Compared to Xanadu, what we use today is the quick and dirty implementation of a potentially much richer and also much harder to maintain ecosystem.

Two-way linking would make not only impact research easier but also a number of other problems on the web. In his latest book “Who Owns the Future?”, Jaron Lanier discusses two-way linking as a potential solution to copyright infringement and a host of other web maladies. His logic is that if you could always know who is linking where, then you could create a system of micropayments to make sure authors get proper credit. His idea has its own caveats, but it shows the systems that two-way linking might enable. Chapter Seven of Lanier’s book discusses some of the other reasons Nelson’s idea never took off.

The desire for two-way links has not gone away, however. In fact, the *lack* of two-way links is an interesting lens through which to view the current tech environment. By creating a central server that catalogs and makes sense of the one-way web, Google’s adds value with its ability to make the internet seem more like Project Xanadu. If two-way links existed, you wouldn’t need all of the features of Google Analytics. People could implement their own search engines with their own page rank algorithms based on publicly available citation information.

The inefficiency of one-way links left a hole at the center of the web for a powerful player to step in and play librarian. As a result, if you want to know how your content lives online, you have to go shopping for analytics. To effectively monitor the life of an article, newsrooms currently use a host of services from trackbacks and Google Alerts to Twitter searches and ad hoc scanning. Short link services break web links even further. Instead of one canonical URL for a page, you can have a bit.ly, t.co, j.mp or thousands of other custom domains.

NewsLynx doesn’t have the power of Google. But, we have been working on a core feature that would leverage Google features and other two-way link surfacing techniques to make monitoring the life of an article much easier: we’re calling them “recipes”, for now (#branding suggestions welcome). In NewsLynx, you’ll add these “recipes” to the system and it will alert you of all pending mentions in one filterable display. If a citation is important, you can assign it to an article or onto your organization more generally. We also have a few built-in recipes to get you started.

We’re excited to get this tool into the hands of news sites and see how it helps them better understand their place in the world wide web. As we prepare to launch the platform in the next month or so, check back here for any updates.

How It's Made, Research, Tips & Tutorials

Think about data from the beginning of the story, says Cheryl Phillips

7

“Stories can be told in many different ways,” said Cheryl Phillips. “A sidebar that may once have been a 12-inch text piece is now a timeline, or a map.”

Phillips, an award-winning investigative journalist, will begin teaching students how to treat data as a source this fall, when she begins a new gig as a lecturer at Stanford’s graduate school of journalism helping to open up Stanford’s new Computational Journalism Lab.

“Cheryl Phillips brings an outstanding mix of experience in data journalism and investigative work to our program. Students and faculty here are eager to start working with her to push forward the evolving field of computational journalism,” said Jay Hamilton, Hearst Professor of Communication and Director of the Stanford Journalism Program, in a statement. “Her emphasis on accountability reporting and interest in using data to lower the costs of discovering stories will help our journalism students learn how to uncover stories that currently go untold in public affairs reporting.”

STUDIO MUG - SEATTLE - 4/9/2013

I interviewed Phillips about her career, which has included important  reporting on the nonprofit and philanthropy world, her plans for teaching at Stanford, data journalism, j-schools and teaching digital skills, and the challenges that newsrooms face today and in the future.

What is a day in your life like now?

I’m the data innovation editor at The Seattle Times. Essentially, I work with data for stories and help coordinate data-related efforts, such as working with reporters, graphics folks, and others on news apps and visualizations. I also have looked at some of our systems and processes and suggested new, more time-effective methods for us.

I’ve been at The Seattle Times since 2002. I started as a data-focused reporter on the investigations team, then became deputy investigations editor, then data enterprise editor. I also worked on the metro desk and edited a team of reporters. I currently work in the digital/online department, but really work across all the departments. I also helped train the newsroom when we moved to a new content management system about a year or so ago. I am trying to wrap up a couple of story-related projects, and do some data journalism newsroom training before I start at Stanford in the fall.

How did you get started in data journalism? Did you earn any special degrees or certificates?

I remember taking a class (outside of the journalism department) while in college. The subject purported to be about learning how personal computers worked but, aside from a textbook that showed photos of a personal computer, we really just learned how to write if, then loops on a mainframe.

I got my first taste of data journalism at the Fort Worth Star-Telegram. That’s where I did my first story using any kind of computer for something other than putting words on a screen. I had gotten the ownership agreement for the Texas Rangers, which included a somewhat complex formula. I kept doing the math on my calculator and screwed it up each time. Finally, I called up a friend of mine who was a CPA, and she taught me Lotus 1-2-3.

My real start in computer-assisted reporting came in 1995, when I was on loan to USA TODAY. I was fortunate enough to land in the enterprise department with the data editors, and Phil Meyer was there a consultant. By the end of five months, I could use spreadsheets, Paradox (for DOS!) and SPSS. What a great education. I followed that up by joining IRE and attending the NICAR conference. I’ve missed very few since then and also done some of NICAR’s specialized training on stats and maps.

I have no special degrees or certificates, but I have taken some online courses in R, Python, etc.

Did you have any mentors? Who? What were the most important resources they shared with you?

Phil Meyer is amazing, and such a great teacher. He taught me statistics, but also taught me about how to think about data. Sara Cohen and Aron Pilhofer of the New York Times, and Jennifer LaFleur of CIR. Paul Overberg at USA TODAY. They have all helped me over the years.

NICAR is an incredible world, full of data journalists and journalist-programmers who are willing to help others out. It’s a great family.

On the investigative journalism front, Jim Neff and David Boardman are fantastic editors and great at asking vital questions.

What does your personal data journalism “stack” look like? What tools could you not live without?

I’m a firm believer in the power of the spreadsheet. So much of what journalists do on a daily basis can be made easier and more effective by just using a spreadsheet.

I use OpenRefine,  CometDocs, Tabula, AP Overview and Document Cloud. I use MySQL with Navicat. I still use Access. I’m a recent convert to R, but also use SPSS. I use ESRI for mapping, but am interested in exploring other options also. I use Google Fusion Tables as well.

Most of my work has been in more of the traditional CAR front, but I’ve been learning Python for scraping projects.

What are the foundational skills that someone needs to practice data journalism?

In many ways, the same foundational skills you need for any kind of journalism.

Curiosity, for one. Journalists need to think about stories from a mindset that includes data from the very beginning, such as when a reporter talks to a source, or a government official. If an official mentions statistics, don’t just ask for a summary report, but ask for the underlying data — and for that same data over time. The editors of those reporters need to do the same thing. Think about the possibilities if you had more information and could analyze and view it in different ways.

Second, be open to learning any skill sets that will help tell the story. I got into data journalism because I discovered stories I would not be able to tell if I didn’t obtain and analyze data. We all know journalists don’t like to take someone’s word for something — data journalism just takes that to the next level.

Third, in terms of technical skills, learn how to use a spreadsheet, at a bare minimum. Really, one tool leads to another. Once you know how a spreadsheet works, you are more open to using OpenRefine to clean and standardize that data, or learning a language for scraping data, or another program that will help with finding connections.

What classes will you be teaching at Stanford, and how?

I will be teaching several courses, including a data journalism class focusing on relational data, basic statistics and mapping. I also will be teaching an investigative reporting class focusing on investigative reporting tools.

In general, I want to make sure the students are telling stories from data that they analyze. They should be not only learning the technical stack, but how to apply the technical knowledge to real-world journalism. I am hoping to create some partnerships with newsrooms as well.

Where do you turn to keep your skills updated or learn new things?

IRE and NICAR and all the folks involved there. I also try to learn from our producers at The Seattle Times, who come in knowing way more than I did when I started in journalism. I try to follow smart people on Twitter and other social media.

I like to reach out to folks about what they are doing. I think reaching out and connecting with folks outside of journalism is a great way to make sure we are aware of other new tools, developments, etc.

What are the biggest challenges that newsrooms face in meeting the demand for people with these skills and experience?

Newsrooms are often still structured into silos, so reporters just report and write. They may hand their data off to a graphics desk, but they don’t necessarily analyze or visualize data themselves. Producers produce, but don’t write, even though they may enjoy that and be good at it, too.

Some of this is by necessity, but it makes it harder to learn new skills — and some of these skills are really useful. A reporter who knows how to visualize data may also be able to look at in a different way for reporting the story out too. So, building collaborative teams is important, as is providing time for folks to try out other skills.

Are journalism schools training people properly? What will you do differently?

I think it’s no secret that a lot of change is starting to take place in schools.

Cindy Royal had an interesting piece aboutplatforms just the other day. In general, I think my answer here is similar to the biggest challenge for newsrooms: We need to take a more integrated approach. Classrooms and their teachers should collaborate on work.

So, for example, a multimedia class produces the visualizations and videos that go with the stories being written in another class. (Yes, Stanford already does this.)

Data journalism should not be just one class out of a curriculum, but infused throughout a curriculum. Every type of journalist can learn data-related skills that will help them, whether they end up as a copy editor, a reporter, a front-line editor or a graphics artist.

What data journalism project are you the most proud of working on or creating?

I have been asked this question before and can never answer it well. My last story is always the one I’m most proud of, unless it’s the one I’m about to publish.

That said, as an editor at The Seattle Times, I worked with Jennifer LaFleur (then with ProPublica) on a project tracking the reasons behind foreclosures, a deep dive into the driving factors behind foreclosures from several cities.

When I was a reporter, I was lucky enough to get to work with Ken Armstrong on our court secrecy project in 2006, which changed state practice. I also led the reporting effort on problems with airport security. Both of those used small data sets, which we built ourselves, but told important stories.

I can think of even more stories that weren’t data projects per se, but which used data in the reporting in critical ways. The recent Oso mudslide coverage is an example of where we used mapping data and landslide data to effectively tell the story of the impact of the slide on the victims and of how the potential disastrous consequences had been ignored over time.

What data journalism project created by someone else do you most admire?

Too many to count. There has been so much great work done. ProPublica’s Dollars for Docs was fantastic not only for its stories, but the way they shared the data and the way newsrooms from across the country could tap into the work.  Last year, the Milwaukee Journal Sentinel’s project,Deadly Delays, was such important work.

How has the environment for doing this kind of work changed in the past five years?

It’s much more integrated into new immersive storytelling platforms. There is a recognition that stories can be told in many different ways. A sidebar that may once have been a 12-inch text piece is now a timeline, or a map.

I think there are many more team collaborations, with the developers, designers and reporters and CAR specialists working together from the outset. We need a lot more of this.

What’s different about practicing data journalism today, versus 10 years ago? What about teaching it?

There are more tools, with more coming every day. A few are great, and a lot aspire to be great and some of those will probably get there.

The really fantastic thing about the change is that it’s relatively easy to contribute to the development of a tool that will help journalism, even just as a beta tester.

There are more tech folk interested in helping make journalism better. We’re becoming a less insular world, and that’s a good thing.

Why are data journalism and “news apps” important, in the context of the contemporary digital environment for information?

News apps help tell important stories. It’s the same reason narrative is important.

It always should boil down to that: “does this tool, language, or app help tell a story?” If the answer is “yes,” and you think the story could be worth the effort, then the tool is important too.

What’s the one thing people always get wrong when they talk about data journalism?

I think I’ll have to punt on this one. As you have pointed out, data journalism is a big umbrella term for many different things — precision journalism, computer-assisted reporting, computational journalism, news apps, etc. — so it’s easy to have a different idea as to what it means.

[IMAGE CREDIT: University of Washington]

How It's Made, Research, Tips & Tutorials

Treat data as a source and then open it to the public, says Momi Peralta

2

Long before data journalism entered the mainstream discourse, La Nacion was pushing the boundaries of what was possible in Argentina, a country without an freedom of information law. If you look back into La Nacion’s efforts to go online and start to treat data as a source, you’ll find Angélica “Momi” Peralta Ramos (@momiperalta), the multimedia development manager who originally launched LaNacion.com in the 1990s and now manages its data journalism efforts.
peralta

Ramos contends that data-driven innovation is an antidote to budget crises in newsrooms. Her perspective is grounded in experience: Peralta’s team at La Nacion is using data journalism to challenge a FOIA-free culture in Argentina, opening up data for reporting and reuse to holding government accountable. This spring, I interviewed her about her work and perspective. Her answers follow, lightly edited for clarity.

You’re a computer scientist and MBA. How did you end up in journalism?

Years ago, I fell in love with the concept of the Internet. It is the synthesis of what I’d studied: information technology applied to communications. Now, with the opportunity of data journalism, I think there is a new convergence: the extraction and sharing of knowledge through collaboration using technology. I’m curious about everything and love to discover things.

How did your technical and business perspective inform how you approached LaNacion.com and La Nacion Data?

In terms of organization, it helped to consider traditional business areas like sales, marketing, customer service, business intelligence, and of course technology and a newsroom for content.

At first, I believed in the unlimited possibilities of technology applied to publishing online, and the power of the net to distribute content. Content was free to access and gratuity became the norm. As consumers embraced it, there was a demand and a market, and when there is a market there are business opportunities, although with a much more fragmented competitive environment.

The same model applies now to data journalism. Building content from data or data platforms must evolve to an economy of scale in which the cost of producing [huge amounts of] content in one single effort tends to zero.

What examples of data-driven journalism should the public know about at La Nacion?

Linked below is a selection of 2013 projects. Some of them are finalists in the 2014 Data Journalism Awards! Please watch the videos inside the posts, as we explained how we manage to extract, transform, build and open data in every case.

How you see digital publishing, the Internet and data journalism in South America or globally? What about your peers?

I can’t tell about everyone else’s view, but I think we see it all the same, as both a big challenge and opportunity.

From then on, it’s a matter of being willing to do things. The technology is there, the talent is everywhere, the people who make a difference are the ones you have to gather.

As the context is different in every country and there are obstacles, you have to become a problem solver and be creative, but never stop. For example, if there are language barriers, translate. If there is no open data, start by doing it yourself. If technology is expensive, check first for free versions. Most are enough to do everything you need.

What are the most common tools applied to data journalism at La Nacion?

Collaborative tools. Google Docs, spreadsheets, Open Refine, Junar’s open data platform, Tableau Public for interactive graphs, and now Javascript or D3.js for reusable interactive graphs tied to updated datasets. We love tools that don’t need a developer every time to create interactive content. These are end user´s tools.

Developers are the best for “build once, use many times” kinds of content, developing tools, news applications and for creative problem solving.

What are the basic tools and foundational skills that data journalists need?

First, searching. Using advanced search techniques, in countries like ours, you find there is more on the Deep Web than in the surface.

Then scraping, converting data from PDFs, structuring datasets, and analyzing data. Then, learning to publish in open data formats.

Last, but not least: socializing and sharing your work.

Data journalists need a tolerance for frustration and ability to reinvent and self motivate. Embrace technology. Don’t be afraid to experiment with tools, and learn to ask for help: teamwork is fun.

How do you and your staff keep your skills updated and learn?

We self-teach for free, thanks to the net. We look at best practices and inspiration from other´s cases, then whenever, we can, we for assistance at conferences as NICAR, ISOJ or ONA and follow them online. If there are local trainings, we assist. We went to introductory two-day courses for ArcGIS and Qlikview (business inteliigence software) just to learn the possibilities of these technologies.

We taught ourselves Tableau. An interactive designer and myself took two days off in a Starbucks with the training videos. Then she, learned more in an advanced course.

We love webinars and MOOC, like the Knight Center´s or the EJR’s data journalism MOOC.

We design internal trainings. We have a data journalism training program, now starting our 4th edition, with five days of full-time learning for groups of journalists and designers in our newsroom. We also design Excel courses for analyzing and designing data sets (DIY Data!) and, thanks to our Knight-Mozilla OpenNews fellows, we have customized workshops like CartoDB and introductions to D3.js.

We go to hackathons and meetups — nearly every meetup in Buenos Aires. We interact with experts and with journalists and learn a lot there, working in teams.

What are the biggest challenges La Nacion faces in practicing data journalism? What’s changed since 2011, in terms of the environment?

The context. To take just one example, consider the inflation scandal in Argentina. Even The Economist removed our [national] figures from their indicators page. Media that reported private indicators were considered as opposition by the government, which took away most of official advertising from these media, fined private consultants who calculate consumer price indices different than the official, pressed private associations of consumers to stop measuring price and releasing price indexes, and so on.

Regarding official advertising, between 2009 and 2013, we managed to build a dataset. We found out that 50% went to 10 media groups, the ones closer to the government. In the last period, a hairdresser (stylist) received more advertising money than the largest newspapers in Argentina. Here´s how we built and analyzed this dataset.

Last year, independent media suffered an ad ban, as reported in The Wall Street Journal: “Argentina imposes ad ban, businesses said.”

Argentina is ranked 106 / 177 in Transparency International Corruption Perceptions Index. We still are without a Freedom of Information law.

Regarding open data from governments, there are some initiatives. One that is more advanced is the City of Buenos Aires Open Data portal, but also there are national, some provincial and municipal initiatives starting to publish useful information, and even open data.

Perhaps the best change is that we have is a big hacktivism community of transparency activists, NGOs, journalists and academic experts that are ready to share knowledge for data problem solving as needed or in hackathons.

Our dream is for everyone to understand data as a public service, not only to enhance accountability but to enhance our quality of life.

What’s different about your work today, versus 1995, when LaNacion.com went online?

In 1995, we were alone. Everything was new and hard to sell. There was a small audience. Producing content was static, still in two dimensions, perhaps including a picture in .jpg form, and feedback came through e-mail.

Now there is a huge audience, a crowded competitive environment, and things move faster than ever in terms of formats, technologies, businesses and creative uses by audiences. Every day, there are challenges and opportunities to engage where audiences are, and give them something different or useful to remember us and come back.

Why are data journalism and news apps important?

Both move public information closer to the people and literally put data in citizens’ hands.

News apps are great to tell stories, and localize your data, but we need more efforts to humanize data and explain data. [We should] make datasets famous, put them in the center of a conversation of experts first, and in the general public afterwards.

If we report on data, and we open data while reporting, then others can reuse and build another layer of knowledge on top of it. There are risks, if you have the traditional business mindset, but in an open world there is more to win than to lose by opening up.

This is not only a data revolution. It is an open innovation revolution around knowledge. Media must help open data, especially in countries with difficult access to information.

How do Freedom of Information laws relate to data journalism?

FOI laws are vital for journalism, but more vital for citizens in general, for the justice system, for politicians, businesses or investors to make decisions. Anyone can republish information, if she can get it, but there are requests of information with no response at all.

What about open government in general? How does the open data movement relate to data journalism?

The open government movement is happening. We must be ready to receive and process open data, and then tell all the stories hidden in datasets that now may seem raw or distant.

To begin with, it would be useful to have data on open contracts, statements of assets and salaries of public officials, ways to follow the money and compare, so people can help monitor government accountability. Although we dream in open data formats, we love PDFs against receiving print copies.

The open data movement and hacktivism can accelerate the application of technology to ingest large sets of documents, complex documents or large volumes of structured data. This will accelerate and help journalism extract and tell better stories, but also bring tons of information to the light, so everyone can see, process and keep governments accountable.

The way to go for us now is use data for journalism but then open that data. We are building blocks of knowledge and, at the same time, putting this data closer to the people, the experts and the ones who can do better work than ourselves to extract another story or detect spots of corruption.

It makes lots of sense for us to make the effort of typing, building datasets, cleaning, converting and sharing data in open formats, even organizing our own ‘datafest’ to expose data to experts.

Open data will help in the fight against corruption. That is a real need, as here corruption is killing people.

How It's Made, Research, Tips & Tutorials

Data skills make you a better journalist, says ProPublica’s Sisi Wei

5

sisi-weiI’ve found that the best antidote to a decade of discussion about the “future of news” is to talk to the young journalists who are building it. Sisi Wei’s award-winning journalism shows exactly what that looks like, in practice. Just browse her projects or code repositories on Github. Listening to her lightning talk at the 2014 NICAR conference on how ProPublica reverse engineered the Sina Weibo API to analyze censorship was one of many high points of the conference for me.

Wei, a news applications developer at ProPublica, was formerly a graphics editor at The Washington Post. She is also the co-founder of “Code with me,” a programming workshop for journalists. Our interview about her work and her view of the industry follows.

Where do you work now? What is a day in your life like?

I currently work at ProPublica, on the News Applications Team. We make interactive graphics and news apps; think of projects like 3D flood maps and Dollars for Docs.

At ProPublica, no one has a specific responsibility like design, backend development, data analysis, etc. Instead, people on the team tend to do the whole stack from beginning to end. When we need help, or don’t understand something, we ask our teammates. And of course, we’re constantly working alongside reporters and editors outside of the team as well. When someone’s app is deploying soon, we all pitch in to help take things off his/her plate.

On a given day, I could be calling sources and doing interviews, searching for a specific dataset, cleaning data, making my own data, analyzing it, coming up with the best way to visualize it, or programming an interactive graphic or news app. And of course, I could also be buried beneath interview notes and writing an article.

How did you get started in data journalism? Did you get any special degrees or certificates? What quantitative skills did you have?

I got started in college when I began making interactive graphics for North by Northwestern. I was a journalism/philosophy/legal studies major, so I can safely say that I had no special degrees or qualifications for data journalism.

The closest formal training I got was an “Introduction to Statistics” course my senior year, which I wish I’d taken earlier. I also had a solid math background for a non-major. The last college math course I took was on advanced linear algebra and multivariable calculus. Not that I’ve used either of those skills in my work just yet.

Did you have any mentors? Who? What were the most important resources they shared with you?

Too many to list. So, here’s just a sample of all the amazing people who I’ve been lucky to consider mentors in the past few years, and one of the many things they’ve all taught me.
Tom Giratikanon showed me that journalists could use programming to tell stories and exposed me to ActionScript and how programming works. Kat Downs taught me not to let the story be overshadowed by design or fancy interaction, and Wilson Andrews showed me how a pro handles making live interactive graphics for election night. Todd Lindeman taught me how to better visualize data and how to really take advantage of Adobe Illustrator. Lakshmi Ketineni and Michelle Chen honed my javascript and really taught me SQL and PHP.

Now at ProPublica, my teammates are my mentors. Here is where I learned Ruby on Rails, how news app development really works and how to handle large databases with first ActiveRecord and now ElasticSearch (which I am still working on learning).

What does your personal data journalism “stack” look like? What tools could you not live without?

  • Sublime Text, whose multiple selection feature is the trump card that makes it impossible for me to switch to anything else. If you haven’t used multiple selection, stop what you’re doing and go check it out.
  • The Terminal, for deploying and using Git or just testing out small bits of code in Ruby or Python.
  • Chrome, to debug my code.
  • The Internet, for the answers to all of my questions.

What are the foundational skills that someone needs to practice data journalism?

An insatiable appetite to get to the bottom of something, and the willingness to learn any tool to help you find the answers you’re looking for. In that process, you’ll by necessity learn programming skills, or data analysis skills. Both are important, But without knowing what questions to ask, or what you’re trying to accomplish, neither of those skills will help you.

Where should people who want to learn start?

In terms of programming, just pick a project, make it simple, make it happen and then finish it. Like Jennifer DeWalt did when she made 180 websites in 180 days.

Regarding data analysis, if you’re still in school, take more classes in statistics. If you’re not in school, NICAR offers CAR boot camps, or you can search for materials online, such as this book that teaches statistics to programmers.

Where do you turn to keep your skills updated or learn new things?

I don’t have a frequent cache of websites that I revisit to learn things. I simply figure out what I want to learn, or what problem I’m trying to solve, and use the Internet to find what I need to know.

For example, I’m currently trying to figure out which Javascript library or game engine can best enable me to create newsgames. I started out knowing close to nothing about the subject. Ten minutes of searching later, I had detailed comparisons between game engines, demos and reviews of gaming Javascript libraries, as well as wonderful tips from indie game developers for any rookies looking to get started.

What are the biggest challenges that newsrooms face in meeting the demand for people with these skills and experience? Are schools training people properly?

There are two major pipelines for newsrooms to recruit people with these skills. The first is to recruit journalists who have programming and/or data analysis experience. The second is to recruit programmers or data analysts to come into journalism.

The latter, I think, is much harder than the former, though the Knight-Mozilla OpenNews Fellowship is doing a great job of doing this. Schools are getting better at teaching students data journalism skills, but not at a high enough rate. I often see open job positions, but I rarely see students or professionals with the right skills and experience unable to find a job.

The lack of students, however, is a problem that starts before college. When high school students are applying for journalism school, they expect to go into print or radio or TV news. They don’t expect to learn how to code, or practice data analysis. I think one of the largest challenges is how to change this expectation at an earlier stage.

All of that said, I do have one wish that I would like journalism schools for fulfill: I wish that no j-school ever reinforces or finds acceptable, actively or passively, the stereotype that journalists are bad at math. All it takes is one professor who shrugs off a math error to add to this stereotype, to have the idea pass onto one of his or her students. Let’s be clear: Journalists do not come with a math disability.

What data journalism project created by someone else do you most admire?

I actually want to highlight a project called Vax, which was not built by journalists, but deploys the same principles as data journalism and has the same goals of educating the reader.

Vax is a game that teaches students both how epidemics spread, as well as prevention techniques. It was created originally to help students taking a Coursera MOOC on Epidemics really engage with the topic. I think it’s accomplished that in spades. Not only are users hooked right from the beginning, the game allows you to experience for yourself how people are interconnected, and how those who refuse vaccinations affect the process.

How has the environment for doing this kind of work changed in the past five years?

Since I only entered the field three years ago in 2011, all I can say is this: Data journalism is gaining momentum.

Our techniques are becoming more sophisticated and we’re learning from our mistakes. We’re constantly improving, building new tools and making it easier and more accessible to do common tasks. I don’t want to predict anything grand, but I think the environment is only going to get better.

Is data journalism the same thing as computer-assisted reporting or computational journalism? Why or why not?

To me, data journalism has become the umbrella term that includes anyone who works in data, journalism and programming. (And yes, executing functions in Excel or writing SQL queries is both data and programming.)

Why are data journalism and “news apps” important, in the context of the contemporary digital environment for information?

Philip Meyer, who wrote “Precision Journalism,” answers the first part of this question with his entire book, which I would recommend any aspiring data journalist read immediately. He says:

“Read any of the popular journals of media criticism and you will find a long litany of repeated complaints about modern journalism. It misses important stories, is too dependent on press releases, is easily manipulated by politicians and special interests, and does not communicate what it does know in an effective manner. All of these complaints are justified. Their cause is not so much a lack of energy or talent or dedication to truth, as the critics sometimes imply, but a simple lag in the application of information science — a body of knowledge — to the daunting problems of reporting the news in a time of information overload.”

Data journalism allows journalists to point to the raw data and ask questions, as well as question the very conclusions we are given. It allows us to use social science techniques to illuminate stories that might otherwise be hidden in plain sight.

News apps specifically allow users to search for what’s most relevant to them in a large dataset, and give individual readers the power to discover how a large, national story relates to them. If the story is that doctors have been receiving payments from pharmaceutical companies, news apps let you search to see if your doctor has as well.

What’s the one thing people always get wrong when they talk about data journalism?

That it’s new, or just a phase the journalism industry is going through.

Data journalism has been around since the 1970s (if not earlier), and it is not going to go away, because the skills involved are core to being a better journalist, and to making your story relatable to millions of users online.

Just imagine, if a source told you that 2+2=18, would you believe that statement? The more likely scenario is that you’d question your source about why he or she would say something so blatantly wrong, because you know how to do math, and you know that 2+2=4. Analyzing raw data can result in a similar question to a source, except this time you can ask, “Why does your data say X, but you say Y?”

Isn’t that a core skill every journalist should have?

How It's Made, Research

What’s the Upshot? A promising data-driven approach to the news.

3

This morning, The New York Times officially launched its long-awaited data-driven news site, “The Upshot.”

David Leonhardt, the site’s managing editor, introduced The Upshot in a long note posted to Facebook and then to nytimes.this morning, explaining how the site aspires to help readers navigate the news.

Leonhardt shared two reasons for The Upshot’s launch. First, help people to understand the news better:

“We believe we can help readers get to that level of understanding by writing in a direct, plain-spoken way, the same voice we might use when writing an email to a friend. We’ll be conversational without being dumbed down. We will build on the excellent journalism The New York Times is already producing, by helping readers make connections among different stories and understand how those stories fit together. We will not hesitate to make analytical judgments about why something has happened and what is likely to happen in the future. We’ll tell you how we came to those judgments — and invite you to come to your own conclusions.”

Second, make the most of the opportunity afforded by the growth of the Internet and the explosion of data creation.

Data-based reporting used to be mostly a tool for investigative journalists who could spend months sorting through reams of statistics to emerge with an exclusive story. But the world now produces so much data, and personal computers can analyze it so quickly, that data-based reporting deserves to be a big part of the daily news cycle.

One of our highest priorities will be unearthing data sets — and analyzing existing ones — in ways that illuminate and explain the news. Our first day of material, both political and economic, should give you a sense of what we hope to do with data. As with our written articles, we aspire to present our data in the clearest, most engaging way possible. A graphic can often accomplish that goal better than prose. Luckily, we work alongside The Times’s graphics department, some of the most talented data-visualization specialists in the country. It’s no accident that the same people who created the interactive dialect quiz, the deficit puzzle and therent-vs-buy calculator will be working on The Upshot.

The third goal, left unsaid by Leonhardt, is the strategic interest in the New York Times has in creating a media entity that generates public interest and draws the massive audience that  Nate Silver’s (now departed) FiveThirtyEight blog did, as the 2014 midterm elections draw near. In the fall of 2012,  20% of the visitors to the sixth-most-trafficked  website in the world were checking out 538. Many were coming specifically for 538.

First impressions

My aesthetic impressions of The Upshot have been overwhelmingly positive: the site looks great on a smartphone, tablet or laptop, and loads quickly. I also like how each columnist’s Twitter handle is located below their headshot and the smooth integration of social sharing tools.

My impression of the site’s substance were similarly positive: the site led off with a strong story on American middle class and income inequality based upon public data, an analysis of affirmative action polling, a data-rich overview of how the environment has changed in the 44 years since the first Earth Day, a look at what good marathons and bad investments have in common, a short item on how some startups are approaching regulated industries, political field notes from Washington and a simple data visualization of Pew Internet data that correlates an appreciation for Internet freedom with Internet use. Whew! internet-use-freedom-nyt-graphic The feature that many political junkies will appreciate most, however, is a clever, engaging interactive that forecasts the outcome of the 2014 election in the U.S. Senate.

A commitment to showing their work

What really made me sit up and take notice of The Upshot, however, was the editorial decisions to share how they found the income data at LIS, link to the dataset, and share both the methodology behind the forecasting model and the code for it on Github. That is precisely the model for open data journalism that embodies the best of the craft, as it is practiced in 2014, and sets a high standard right out of the gate for future interactives at The Upshot and for other sites that might seek to compete with its predictions. They even include those estimates: leaderboard-upshot Notably, FiveThirtyEight is now practicing a more open form of data journalism as well, “showing their work”:

 

Early reviews

I’m not alone in positive first impressions of the content, presentation and strategy of the Times’ new site: over at the Guardian Datablog, James Ball published an interesting analysis of data journalism, as seen through the initial foray of The Upshot, FiveThirtyEight and Vox, the “explanatory journalism” site Ezra Klein, Melissa Bell and Matt Yglesias, among others, launched this spring.

Ball’s whole post is worth reading, particularly with respect to his points about audience, diversity, personalization, but the part I think is particularly important with respect to data journalism is the one I’ve made above, regarding being open about the difficult, complicated process of reporting on data as a source:

Doing original research on data is hard: it’s the core of scientific analysis, and that’s why academics have to go through peer-review to get their figures, methods and approaches double-checked. Journalism is meant to be about transparency, and so should hold itself to this standard – at the very least.

This standard is especially true for data-driven journalism, but, sadly, it’s not always lived up to: Nate Silver (for understandable reasons) won’t release how his model works, while FivethirtyEight hasn’t released the figures or work behind some of their most high-profile articles.

That’s a shame, and a missed opportunity: sharing this stuff is good, accountable journalism, and gives the world a chance to find more stories or angles that a writer might have missed.

Counter-intuitively, old media is doing better at this than the startups: The Upshot has released the code driving its forecasting model, as well as the data on its launch inequality article. And the Guardian has at least tried to release the raw data behind its data-driven journalism since our Datablog launched five years ago.

Ball may have contributed to some category confusion by including Vox in his analysis of this new crop of data journalism startups, and he’s not alone: Mathew Ingram also groups Vox together with The Upshot and 538 in his post on “explanatory journalism.”

Both could certainly be forgiven, given Leonhardt’s introduction expressed a goal to help readers understand and Nate Silver’s explicit focus upon explanation as a component of his approach to data-driven journalism. The waters about what to call the product of these startups is are considerably muddied at this point.

Hopefully, over time, those semantic waters clarify and reveal accurate, truthful and trustworthy journalism. Whatever we call them, there’s plenty of room for all of these new entrants to thrive, if they inform the public and build audiences. 

“I think all of these sites are going to succeed,” said Leonhardt, in an interview with Capital New York. “There is much more demand for this kind of journalism right now than there is supply.”

In an interview with Digiday, Leonhardt futher emphasized this view:

“I don’t think this is about a competition between these sites to see which will emerge victorious,” he said. “There is more than enough room for any site that is providing journalism of this kind to succeed. Given there’s a hunger for conversational journalism and database journalism, as long you’re giving people reporting that’s good, you’re going to succeed.”

How It's Made, Research, Tips & Tutorials

Oakland Police Beat applies data-driven investigative journalism in California

8

One of the explicit connections I’ve made over the years lies between data-driven investigative journalism and government or corporate accountability. In debugging the backlash to data journalism, I highlighted the work of The Los Angeles Times Data Desk, which has analyzed government performance data for accountability, among other notable projects. I could also have pointed to the Chicago Sun-Times, which applied data-driven investigative methods to determine  that the City of Chicago’s 911 dispatch times vary widely depending on where you live, publishing an interactive map online for context, or to a Pulitzer Prize-winning story on speeding cops in Florida.

oaklandpb

This week, there’s a new experiment in applying data journalism  to local government accountability in Oakland, California, where the Oakland Police Beat has gone online. The nonprofit website, which is part of Oakland Local and The Center for Media Change and funded by The Ethics and Excellence in Journalism Foundation and The Fund for Investigative Journalism, was co-founded by Susan Mernit and Abraham Hyatt, the former managing editor of ReadWrite. (Disclosure: Hyatt edited my posts there.)

Oakland Police Beat is squarely aimed at shining sunlight on the practices of Oakland’s law enforcement officers. Their first story out of the gate is pulled no punches, finding that Oakland’s most decorated officers were responsible for a high number of brutality lawsuits and shootings.

The site also demonstrated two important practices that deserve to become standard in data journalism: explaining the methodology behind their analysis, including source notes, and (eventually) publishing the data behind the investigation. 

To learn more about why Oakland Police Beat did that, how they’ve approach their work and what the long game is, I contacted Hyatt. Our interview follows, lightly edited and hyperlinked for context. Any [bracketed] comments are my own.

So, what exactly did you launch? What’s the goal?

Hyatt: We launched a news site and a database with 25 years worth of data about individual Oakland Police Department (OPD) officers who have been involved in shootings and misconduct lawsuits.

Oakland journalists usually focus (and rightfully so) on the city’s violent crime rate and the latest problems with the OPD. We started this project by asking if we could create a comprehensive picture of the officers with the most violent behavior, which is why the OPD is where it is today. We started requesting records and tracking down information. That eventually became the database. It’s the first time anyone in Oakland has created a resource like this.

What makes this “data-driven journalism?”

Hyatt: We started with the data and let it guide the course of the entire project. The stories we’ve written all came from the data.

Why is sharing the data behind the work important?

Hyatt: Sharing is critical. Sharing, not traffic, is the metric I’m using to gauge our success, although traffic certainly is fun to watch, too. That’s the main reason that we’re allowing people to download all of our data. (The settlement database will be available for download next week.)

How will journalists, activists, and data nerds use it over time? That’s going to be the indicator of how important this work was.

[Like ProPublica, Oakland Police Beat is encouraging reuse. The site says that “You’re welcome to republish our stories and use our data for free. We publish our stories under an Attribution-NonCommercial-ShareAlike 4.0 License.”]

Where do you get the data?

Hyatt: All of it came from city and court documents. Some of it came as .CSV files, some as PDFs that we had to scrape.

How much time and effort did it take to ingest, clean, structure and present?

Hyatt: Almost all of the court docs had to be human-read. It was a laborious process of digging to find officer names and what the allegations were. Combining city settlement data records and court docs took close to five months. Then, we discovered that the city’s data had flaws and that took another couple of months to resolve.

Some of the data was surprisingly easy to get. I didn’t expect the City Attorney’s office to be so forthcoming with information. Other stuff was surprisingly difficult. The OPD refused to give us awards data before 2007. They claim that they didn’t keep that data on individual officers before then. I know that’s completely false, but we’re a tiny project. We don’t have the resources to take them to court over it. Our tools were very simple.

Did you pay for it?

Hyatt: We used PACER a ton. The bill was close to $900 by the time we were done. We mainly worked out of spreadsheets. I had a handful of command line tools that I used to clean and process data. I ran a virtual machine so that I could use some Linux-bases tools as well. I heart Open Refine. We experimented with using Git for version control on stories we were writing.

” A used chemical agent grenade found on the streets in downtown Oakland following Occupy demonstrations in 2011. Photo by Eric K Arnold.

Will you be publishing data, methodology as you went along?

Hyatt: The methodology post covers all of our stories. We’ll continue to publish stories, as well as some data sets that we got along the way that we decided not to put into our main dataset, like several hundred city attorney reports about the settled cases.

What’s the funding or revenue model for the site? Where will this be in one year? Or 5?

Hyatt: Everyone wants grant-funded journalism startups to be sustainable, but, so often, they start strong and then peter out when resources run dry.

Instead of following that model, I knew from the start that this was going to be a phased project. We had some great grants that got us started, but I didn’t know what the funding picture was going to look like once we started running stories. So, I tried to turn that limitation into a strength.

We’re publishing eight weeks worth of stories and data. We’re going to cram as much awesome into those weeks as we can and then, if needed, we can step away and let this project stand on its own.

With that said, we’re already looking for funding for a second phase (which will focus on teens and the OPD). When we get it, we’ll use this current data as a springboard for Phase 2.

Could this approach be extended to other cities?

Hyatt: The OPD and its problems are pretty unique in the USA. This was successful because there was so much stuff to work with in Oakland. I don’t think our mentality for creating and building this project was unique.