How It's Made, Research

Hyper-compensation: Ted Nelson and the impact of journalism

0

NewsLynx is a Tow Center research project and platform aimed at better understanding the impact of news. It is conducted by Tow Fellows Brian Abelson, Stijn DeBrouwere & Michael Keller.

“If you want to make an apple pie from scratch, you must first invent the universe.” — Carl Sagan

Before you can begin to measure impact, you need to first know who’s talking about you. While analytics platforms provide referrers, social media sites track reposts, and media monitoring tools follow mentions, these services are often incomplete and come with a price. Why is it that, on the internet — the most interconnected medium in history — tracking linkages between content is so difficult?

The simple answer is that the web wasn’t built to be *fully* connected, per se. It’s an idiosyncratic, labyrinthine garden of forking paths with no way to navigate from one page to pages that reference it.

We’ve spent the last few months thinking about and building an analytics platform called NewsLynx which aims to help newsrooms better capture the quantitative and qualitative effects of their work. Many of our features are aimed at giving newsrooms a better sense of who is talking about their work. This seemingly simple feature, to understand the links among web pages, has taken up the majority of our time. This obstacle turns out to be a shortcoming in the fundamental architecture of the web. But without it, however, the web might never have succeeded.

The creator of the web, Tim Berners Lee didn’t provide a means for contextual links in the specification for HTML. The world wide web wasn’t the only idea for networking computers, however. Over 50 years ago an early figure in computing had a different vision of the web – a vision that would have made the construction of NewsLynx a lot easier today, if not completely unnecessary.

Around 1960, a man named Ted Nelson came up with an idea for a structure of linking pieces of information in a two-way fashion. Whereas links on the web today just point one way — to the place you want to go — pages on Nelson’s internet would have a “What links here?” capability so would know all the websites that point to your page.

And if you were dreaming up the ideal information web, this structure makes complete sense: why not make the most connections possible? As Borges writes, “I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future and in some way involve the stars.”

Nelson called his project Xanadu, but it had the misfortune of being both extremely ahead of its time and incredibly late to the game. Project Xanadu’s first and somewhat cryptic release debuted this year: over 50 years after it was first conceived.

In the mean time, Berners-Lee put forward HTML with its one-way links, in the early 90s and it took off into what we know today. And one of the reasons for the web’s success is its extremely informal, ad-hoc functionality: anyone can put up an HTML page and without hooking into or caring about a more elaborate system. Compared to Xanadu, what we use today is the quick and dirty implementation of a potentially much richer and also much harder to maintain ecosystem.

Two-way linking would make not only impact research easier but also a number of other problems on the web. In his latest book “Who Owns the Future?”, Jaron Lanier discusses two-way linking as a potential solution to copyright infringement and a host of other web maladies. His logic is that if you could always know who is linking where, then you could create a system of micropayments to make sure authors get proper credit. His idea has its own caveats, but it shows the systems that two-way linking might enable. Chapter Seven of Lanier’s book discusses some of the other reasons Nelson’s idea never took off.

The desire for two-way links has not gone away, however. In fact, the *lack* of two-way links is an interesting lens through which to view the current tech environment. By creating a central server that catalogs and makes sense of the one-way web, Google’s adds value with its ability to make the internet seem more like Project Xanadu. If two-way links existed, you wouldn’t need all of the features of Google Analytics. People could implement their own search engines with their own page rank algorithms based on publicly available citation information.

The inefficiency of one-way links left a hole at the center of the web for a powerful player to step in and play librarian. As a result, if you want to know how your content lives online, you have to go shopping for analytics. To effectively monitor the life of an article, newsrooms currently use a host of services from trackbacks and Google Alerts to Twitter searches and ad hoc scanning. Short link services break web links even further. Instead of one canonical URL for a page, you can have a bit.ly, t.co, j.mp or thousands of other custom domains.

NewsLynx doesn’t have the power of Google. But, we have been working on a core feature that would leverage Google features and other two-way link surfacing techniques to make monitoring the life of an article much easier: we’re calling them “recipes”, for now (#branding suggestions welcome). In NewsLynx, you’ll add these “recipes” to the system and it will alert you of all pending mentions in one filterable display. If a citation is important, you can assign it to an article or onto your organization more generally. We also have a few built-in recipes to get you started.

We’re excited to get this tool into the hands of news sites and see how it helps them better understand their place in the world wide web. As we prepare to launch the platform in the next month or so, check back here for any updates.

How It's Made, Research, Tips & Tutorials

Think about data from the beginning of the story, says Cheryl Phillips

3

“Stories can be told in many different ways,” said Cheryl Phillips. “A sidebar that may once have been a 12-inch text piece is now a timeline, or a map.”

Phillips, an award-winning investigative journalist, will begin teaching students how to treat data as a source this fall, when she begins a new gig as a lecturer at Stanford’s graduate school of journalism helping to open up Stanford’s new Computational Journalism Lab.

“Cheryl Phillips brings an outstanding mix of experience in data journalism and investigative work to our program. Students and faculty here are eager to start working with her to push forward the evolving field of computational journalism,” said Jay Hamilton, Hearst Professor of Communication and Director of the Stanford Journalism Program, in a statement. “Her emphasis on accountability reporting and interest in using data to lower the costs of discovering stories will help our journalism students learn how to uncover stories that currently go untold in public affairs reporting.”

STUDIO MUG - SEATTLE - 4/9/2013

I interviewed Phillips about her career, which has included important  reporting on the nonprofit and philanthropy world, her plans for teaching at Stanford, data journalism, j-schools and teaching digital skills, and the challenges that newsrooms face today and in the future.

What is a day in your life like now?

I’m the data innovation editor at The Seattle Times. Essentially, I work with data for stories and help coordinate data-related efforts, such as working with reporters, graphics folks, and others on news apps and visualizations. I also have looked at some of our systems and processes and suggested new, more time-effective methods for us.

I’ve been at The Seattle Times since 2002. I started as a data-focused reporter on the investigations team, then became deputy investigations editor, then data enterprise editor. I also worked on the metro desk and edited a team of reporters. I currently work in the digital/online department, but really work across all the departments. I also helped train the newsroom when we moved to a new content management system about a year or so ago. I am trying to wrap up a couple of story-related projects, and do some data journalism newsroom training before I start at Stanford in the fall.

How did you get started in data journalism? Did you earn any special degrees or certificates?

I remember taking a class (outside of the journalism department) while in college. The subject purported to be about learning how personal computers worked but, aside from a textbook that showed photos of a personal computer, we really just learned how to write if, then loops on a mainframe.

I got my first taste of data journalism at the Fort Worth Star-Telegram. That’s where I did my first story using any kind of computer for something other than putting words on a screen. I had gotten the ownership agreement for the Texas Rangers, which included a somewhat complex formula. I kept doing the math on my calculator and screwed it up each time. Finally, I called up a friend of mine who was a CPA, and she taught me Lotus 1-2-3.

My real start in computer-assisted reporting came in 1995, when I was on loan to USA TODAY. I was fortunate enough to land in the enterprise department with the data editors, and Phil Meyer was there a consultant. By the end of five months, I could use spreadsheets, Paradox (for DOS!) and SPSS. What a great education. I followed that up by joining IRE and attending the NICAR conference. I’ve missed very few since then and also done some of NICAR’s specialized training on stats and maps.

I have no special degrees or certificates, but I have taken some online courses in R, Python, etc.

Did you have any mentors? Who? What were the most important resources they shared with you?

Phil Meyer is amazing, and such a great teacher. He taught me statistics, but also taught me about how to think about data. Sara Cohen and Aron Pilhofer of the New York Times, and Jennifer LaFleur of CIR. Paul Overberg at USA TODAY. They have all helped me over the years.

NICAR is an incredible world, full of data journalists and journalist-programmers who are willing to help others out. It’s a great family.

On the investigative journalism front, Jim Neff and David Boardman are fantastic editors and great at asking vital questions.

What does your personal data journalism “stack” look like? What tools could you not live without?

I’m a firm believer in the power of the spreadsheet. So much of what journalists do on a daily basis can be made easier and more effective by just using a spreadsheet.

I use OpenRefine,  CometDocs, Tabula, AP Overview and Document Cloud. I use MySQL with Navicat. I still use Access. I’m a recent convert to R, but also use SPSS. I use ESRI for mapping, but am interested in exploring other options also. I use Google Fusion Tables as well.

Most of my work has been in more of the traditional CAR front, but I’ve been learning Python for scraping projects.

What are the foundational skills that someone needs to practice data journalism?

In many ways, the same foundational skills you need for any kind of journalism.

Curiosity, for one. Journalists need to think about stories from a mindset that includes data from the very beginning, such as when a reporter talks to a source, or a government official. If an official mentions statistics, don’t just ask for a summary report, but ask for the underlying data — and for that same data over time. The editors of those reporters need to do the same thing. Think about the possibilities if you had more information and could analyze and view it in different ways.

Second, be open to learning any skill sets that will help tell the story. I got into data journalism because I discovered stories I would not be able to tell if I didn’t obtain and analyze data. We all know journalists don’t like to take someone’s word for something — data journalism just takes that to the next level.

Third, in terms of technical skills, learn how to use a spreadsheet, at a bare minimum. Really, one tool leads to another. Once you know how a spreadsheet works, you are more open to using OpenRefine to clean and standardize that data, or learning a language for scraping data, or another program that will help with finding connections.

What classes will you be teaching at Stanford, and how?

I will be teaching several courses, including a data journalism class focusing on relational data, basic statistics and mapping. I also will be teaching an investigative reporting class focusing on investigative reporting tools.

In general, I want to make sure the students are telling stories from data that they analyze. They should be not only learning the technical stack, but how to apply the technical knowledge to real-world journalism. I am hoping to create some partnerships with newsrooms as well.

Where do you turn to keep your skills updated or learn new things?

IRE and NICAR and all the folks involved there. I also try to learn from our producers at The Seattle Times, who come in knowing way more than I did when I started in journalism. I try to follow smart people on Twitter and other social media.

I like to reach out to folks about what they are doing. I think reaching out and connecting with folks outside of journalism is a great way to make sure we are aware of other new tools, developments, etc.

What are the biggest challenges that newsrooms face in meeting the demand for people with these skills and experience?

Newsrooms are often still structured into silos, so reporters just report and write. They may hand their data off to a graphics desk, but they don’t necessarily analyze or visualize data themselves. Producers produce, but don’t write, even though they may enjoy that and be good at it, too.

Some of this is by necessity, but it makes it harder to learn new skills — and some of these skills are really useful. A reporter who knows how to visualize data may also be able to look at in a different way for reporting the story out too. So, building collaborative teams is important, as is providing time for folks to try out other skills.

Are journalism schools training people properly? What will you do differently?

I think it’s no secret that a lot of change is starting to take place in schools.

Cindy Royal had an interesting piece aboutplatforms just the other day. In general, I think my answer here is similar to the biggest challenge for newsrooms: We need to take a more integrated approach. Classrooms and their teachers should collaborate on work.

So, for example, a multimedia class produces the visualizations and videos that go with the stories being written in another class. (Yes, Stanford already does this.)

Data journalism should not be just one class out of a curriculum, but infused throughout a curriculum. Every type of journalist can learn data-related skills that will help them, whether they end up as a copy editor, a reporter, a front-line editor or a graphics artist.

What data journalism project are you the most proud of working on or creating?

I have been asked this question before and can never answer it well. My last story is always the one I’m most proud of, unless it’s the one I’m about to publish.

That said, as an editor at The Seattle Times, I worked with Jennifer LaFleur (then with ProPublica) on a project tracking the reasons behind foreclosures, a deep dive into the driving factors behind foreclosures from several cities.

When I was a reporter, I was lucky enough to get to work with Ken Armstrong on our court secrecy project in 2006, which changed state practice. I also led the reporting effort on problems with airport security. Both of those used small data sets, which we built ourselves, but told important stories.

I can think of even more stories that weren’t data projects per se, but which used data in the reporting in critical ways. The recent Oso mudslide coverage is an example of where we used mapping data and landslide data to effectively tell the story of the impact of the slide on the victims and of how the potential disastrous consequences had been ignored over time.

What data journalism project created by someone else do you most admire?

Too many to count. There has been so much great work done. ProPublica’s Dollars for Docs was fantastic not only for its stories, but the way they shared the data and the way newsrooms from across the country could tap into the work.  Last year, the Milwaukee Journal Sentinel’s project,Deadly Delays, was such important work.

How has the environment for doing this kind of work changed in the past five years?

It’s much more integrated into new immersive storytelling platforms. There is a recognition that stories can be told in many different ways. A sidebar that may once have been a 12-inch text piece is now a timeline, or a map.

I think there are many more team collaborations, with the developers, designers and reporters and CAR specialists working together from the outset. We need a lot more of this.

What’s different about practicing data journalism today, versus 10 years ago? What about teaching it?

There are more tools, with more coming every day. A few are great, and a lot aspire to be great and some of those will probably get there.

The really fantastic thing about the change is that it’s relatively easy to contribute to the development of a tool that will help journalism, even just as a beta tester.

There are more tech folk interested in helping make journalism better. We’re becoming a less insular world, and that’s a good thing.

Why are data journalism and “news apps” important, in the context of the contemporary digital environment for information?

News apps help tell important stories. It’s the same reason narrative is important.

It always should boil down to that: “does this tool, language, or app help tell a story?” If the answer is “yes,” and you think the story could be worth the effort, then the tool is important too.

What’s the one thing people always get wrong when they talk about data journalism?

I think I’ll have to punt on this one. As you have pointed out, data journalism is a big umbrella term for many different things — precision journalism, computer-assisted reporting, computational journalism, news apps, etc. — so it’s easy to have a different idea as to what it means.

[IMAGE CREDIT: University of Washington]

How It's Made, Research, Tips & Tutorials

Treat data as a source and then open it to the public, says Momi Peralta

1

Long before data journalism entered the mainstream discourse, La Nacion was pushing the boundaries of what was possible in Argentina, a country without an freedom of information law. If you look back into La Nacion’s efforts to go online and start to treat data as a source, you’ll find Angélica “Momi” Peralta Ramos (@momiperalta), the multimedia development manager who originally launched LaNacion.com in the 1990s and now manages its data journalism efforts.
peralta

Ramos contends that data-driven innovation is an antidote to budget crises in newsrooms. Her perspective is grounded in experience: Peralta’s team at La Nacion is using data journalism to challenge a FOIA-free culture in Argentina, opening up data for reporting and reuse to holding government accountable. This spring, I interviewed her about her work and perspective. Her answers follow, lightly edited for clarity.

You’re a computer scientist and MBA. How did you end up in journalism?

Years ago, I fell in love with the concept of the Internet. It is the synthesis of what I’d studied: information technology applied to communications. Now, with the opportunity of data journalism, I think there is a new convergence: the extraction and sharing of knowledge through collaboration using technology. I’m curious about everything and love to discover things.

How did your technical and business perspective inform how you approached LaNacion.com and La Nacion Data?

In terms of organization, it helped to consider traditional business areas like sales, marketing, customer service, business intelligence, and of course technology and a newsroom for content.

At first, I believed in the unlimited possibilities of technology applied to publishing online, and the power of the net to distribute content. Content was free to access and gratuity became the norm. As consumers embraced it, there was a demand and a market, and when there is a market there are business opportunities, although with a much more fragmented competitive environment.

The same model applies now to data journalism. Building content from data or data platforms must evolve to an economy of scale in which the cost of producing [huge amounts of] content in one single effort tends to zero.

What examples of data-driven journalism should the public know about at La Nacion?

Linked below is a selection of 2013 projects. Some of them are finalists in the 2014 Data Journalism Awards! Please watch the videos inside the posts, as we explained how we manage to extract, transform, build and open data in every case.

How you see digital publishing, the Internet and data journalism in South America or globally? What about your peers?

I can’t tell about everyone else’s view, but I think we see it all the same, as both a big challenge and opportunity.

From then on, it’s a matter of being willing to do things. The technology is there, the talent is everywhere, the people who make a difference are the ones you have to gather.

As the context is different in every country and there are obstacles, you have to become a problem solver and be creative, but never stop. For example, if there are language barriers, translate. If there is no open data, start by doing it yourself. If technology is expensive, check first for free versions. Most are enough to do everything you need.

What are the most common tools applied to data journalism at La Nacion?

Collaborative tools. Google Docs, spreadsheets, Open Refine, Junar’s open data platform, Tableau Public for interactive graphs, and now Javascript or D3.js for reusable interactive graphs tied to updated datasets. We love tools that don’t need a developer every time to create interactive content. These are end user´s tools.

Developers are the best for “build once, use many times” kinds of content, developing tools, news applications and for creative problem solving.

What are the basic tools and foundational skills that data journalists need?

First, searching. Using advanced search techniques, in countries like ours, you find there is more on the Deep Web than in the surface.

Then scraping, converting data from PDFs, structuring datasets, and analyzing data. Then, learning to publish in open data formats.

Last, but not least: socializing and sharing your work.

Data journalists need a tolerance for frustration and ability to reinvent and self motivate. Embrace technology. Don’t be afraid to experiment with tools, and learn to ask for help: teamwork is fun.

How do you and your staff keep your skills updated and learn?

We self-teach for free, thanks to the net. We look at best practices and inspiration from other´s cases, then whenever, we can, we for assistance at conferences as NICAR, ISOJ or ONA and follow them online. If there are local trainings, we assist. We went to introductory two-day courses for ArcGIS and Qlikview (business inteliigence software) just to learn the possibilities of these technologies.

We taught ourselves Tableau. An interactive designer and myself took two days off in a Starbucks with the training videos. Then she, learned more in an advanced course.

We love webinars and MOOC, like the Knight Center´s or the EJR’s data journalism MOOC.

We design internal trainings. We have a data journalism training program, now starting our 4th edition, with five days of full-time learning for groups of journalists and designers in our newsroom. We also design Excel courses for analyzing and designing data sets (DIY Data!) and, thanks to our Knight-Mozilla OpenNews fellows, we have customized workshops like CartoDB and introductions to D3.js.

We go to hackathons and meetups — nearly every meetup in Buenos Aires. We interact with experts and with journalists and learn a lot there, working in teams.

What are the biggest challenges La Nacion faces in practicing data journalism? What’s changed since 2011, in terms of the environment?

The context. To take just one example, consider the inflation scandal in Argentina. Even The Economist removed our [national] figures from their indicators page. Media that reported private indicators were considered as opposition by the government, which took away most of official advertising from these media, fined private consultants who calculate consumer price indices different than the official, pressed private associations of consumers to stop measuring price and releasing price indexes, and so on.

Regarding official advertising, between 2009 and 2013, we managed to build a dataset. We found out that 50% went to 10 media groups, the ones closer to the government. In the last period, a hairdresser (stylist) received more advertising money than the largest newspapers in Argentina. Here´s how we built and analyzed this dataset.

Last year, independent media suffered an ad ban, as reported in The Wall Street Journal: “Argentina imposes ad ban, businesses said.”

Argentina is ranked 106 / 177 in Transparency International Corruption Perceptions Index. We still are without a Freedom of Information law.

Regarding open data from governments, there are some initiatives. One that is more advanced is the City of Buenos Aires Open Data portal, but also there are national, some provincial and municipal initiatives starting to publish useful information, and even open data.

Perhaps the best change is that we have is a big hacktivism community of transparency activists, NGOs, journalists and academic experts that are ready to share knowledge for data problem solving as needed or in hackathons.

Our dream is for everyone to understand data as a public service, not only to enhance accountability but to enhance our quality of life.

What’s different about your work today, versus 1995, when LaNacion.com went online?

In 1995, we were alone. Everything was new and hard to sell. There was a small audience. Producing content was static, still in two dimensions, perhaps including a picture in .jpg form, and feedback came through e-mail.

Now there is a huge audience, a crowded competitive environment, and things move faster than ever in terms of formats, technologies, businesses and creative uses by audiences. Every day, there are challenges and opportunities to engage where audiences are, and give them something different or useful to remember us and come back.

Why are data journalism and news apps important?

Both move public information closer to the people and literally put data in citizens’ hands.

News apps are great to tell stories, and localize your data, but we need more efforts to humanize data and explain data. [We should] make datasets famous, put them in the center of a conversation of experts first, and in the general public afterwards.

If we report on data, and we open data while reporting, then others can reuse and build another layer of knowledge on top of it. There are risks, if you have the traditional business mindset, but in an open world there is more to win than to lose by opening up.

This is not only a data revolution. It is an open innovation revolution around knowledge. Media must help open data, especially in countries with difficult access to information.

How do Freedom of Information laws relate to data journalism?

FOI laws are vital for journalism, but more vital for citizens in general, for the justice system, for politicians, businesses or investors to make decisions. Anyone can republish information, if she can get it, but there are requests of information with no response at all.

What about open government in general? How does the open data movement relate to data journalism?

The open government movement is happening. We must be ready to receive and process open data, and then tell all the stories hidden in datasets that now may seem raw or distant.

To begin with, it would be useful to have data on open contracts, statements of assets and salaries of public officials, ways to follow the money and compare, so people can help monitor government accountability. Although we dream in open data formats, we love PDFs against receiving print copies.

The open data movement and hacktivism can accelerate the application of technology to ingest large sets of documents, complex documents or large volumes of structured data. This will accelerate and help journalism extract and tell better stories, but also bring tons of information to the light, so everyone can see, process and keep governments accountable.

The way to go for us now is use data for journalism but then open that data. We are building blocks of knowledge and, at the same time, putting this data closer to the people, the experts and the ones who can do better work than ourselves to extract another story or detect spots of corruption.

It makes lots of sense for us to make the effort of typing, building datasets, cleaning, converting and sharing data in open formats, even organizing our own ‘datafest’ to expose data to experts.

Open data will help in the fight against corruption. That is a real need, as here corruption is killing people.

How It's Made, Research, Tips & Tutorials

Data skills make you a better journalist, says ProPublica’s Sisi Wei

3

sisi-weiI’ve found that the best antidote to a decade of discussion about the “future of news” is to talk to the young journalists who are building it. Sisi Wei’s award-winning journalism shows exactly what that looks like, in practice. Just browse her projects or code repositories on Github. Listening to her lightning talk at the 2014 NICAR conference on how ProPublica reverse engineered the Sina Weibo API to analyze censorship was one of many high points of the conference for me.

Wei, a news applications developer at ProPublica, was formerly a graphics editor at The Washington Post. She is also the co-founder of “Code with me,” a programming workshop for journalists. Our interview about her work and her view of the industry follows.

Where do you work now? What is a day in your life like?

I currently work at ProPublica, on the News Applications Team. We make interactive graphics and news apps; think of projects like 3D flood maps and Dollars for Docs.

At ProPublica, no one has a specific responsibility like design, backend development, data analysis, etc. Instead, people on the team tend to do the whole stack from beginning to end. When we need help, or don’t understand something, we ask our teammates. And of course, we’re constantly working alongside reporters and editors outside of the team as well. When someone’s app is deploying soon, we all pitch in to help take things off his/her plate.

On a given day, I could be calling sources and doing interviews, searching for a specific dataset, cleaning data, making my own data, analyzing it, coming up with the best way to visualize it, or programming an interactive graphic or news app. And of course, I could also be buried beneath interview notes and writing an article.

How did you get started in data journalism? Did you get any special degrees or certificates? What quantitative skills did you have?

I got started in college when I began making interactive graphics for North by Northwestern. I was a journalism/philosophy/legal studies major, so I can safely say that I had no special degrees or qualifications for data journalism.

The closest formal training I got was an “Introduction to Statistics” course my senior year, which I wish I’d taken earlier. I also had a solid math background for a non-major. The last college math course I took was on advanced linear algebra and multivariable calculus. Not that I’ve used either of those skills in my work just yet.

Did you have any mentors? Who? What were the most important resources they shared with you?

Too many to list. So, here’s just a sample of all the amazing people who I’ve been lucky to consider mentors in the past few years, and one of the many things they’ve all taught me.
Tom Giratikanon showed me that journalists could use programming to tell stories and exposed me to ActionScript and how programming works. Kat Downs taught me not to let the story be overshadowed by design or fancy interaction, and Wilson Andrews showed me how a pro handles making live interactive graphics for election night. Todd Lindeman taught me how to better visualize data and how to really take advantage of Adobe Illustrator. Lakshmi Ketineni and Michelle Chen honed my javascript and really taught me SQL and PHP.

Now at ProPublica, my teammates are my mentors. Here is where I learned Ruby on Rails, how news app development really works and how to handle large databases with first ActiveRecord and now ElasticSearch (which I am still working on learning).

What does your personal data journalism “stack” look like? What tools could you not live without?

  • Sublime Text, whose multiple selection feature is the trump card that makes it impossible for me to switch to anything else. If you haven’t used multiple selection, stop what you’re doing and go check it out.
  • The Terminal, for deploying and using Git or just testing out small bits of code in Ruby or Python.
  • Chrome, to debug my code.
  • The Internet, for the answers to all of my questions.

What are the foundational skills that someone needs to practice data journalism?

An insatiable appetite to get to the bottom of something, and the willingness to learn any tool to help you find the answers you’re looking for. In that process, you’ll by necessity learn programming skills, or data analysis skills. Both are important, But without knowing what questions to ask, or what you’re trying to accomplish, neither of those skills will help you.

Where should people who want to learn start?

In terms of programming, just pick a project, make it simple, make it happen and then finish it. Like Jennifer DeWalt did when she made 180 websites in 180 days.

Regarding data analysis, if you’re still in school, take more classes in statistics. If you’re not in school, NICAR offers CAR boot camps, or you can search for materials online, such as this book that teaches statistics to programmers.

Where do you turn to keep your skills updated or learn new things?

I don’t have a frequent cache of websites that I revisit to learn things. I simply figure out what I want to learn, or what problem I’m trying to solve, and use the Internet to find what I need to know.

For example, I’m currently trying to figure out which Javascript library or game engine can best enable me to create newsgames. I started out knowing close to nothing about the subject. Ten minutes of searching later, I had detailed comparisons between game engines, demos and reviews of gaming Javascript libraries, as well as wonderful tips from indie game developers for any rookies looking to get started.

What are the biggest challenges that newsrooms face in meeting the demand for people with these skills and experience? Are schools training people properly?

There are two major pipelines for newsrooms to recruit people with these skills. The first is to recruit journalists who have programming and/or data analysis experience. The second is to recruit programmers or data analysts to come into journalism.

The latter, I think, is much harder than the former, though the Knight-Mozilla OpenNews Fellowship is doing a great job of doing this. Schools are getting better at teaching students data journalism skills, but not at a high enough rate. I often see open job positions, but I rarely see students or professionals with the right skills and experience unable to find a job.

The lack of students, however, is a problem that starts before college. When high school students are applying for journalism school, they expect to go into print or radio or TV news. They don’t expect to learn how to code, or practice data analysis. I think one of the largest challenges is how to change this expectation at an earlier stage.

All of that said, I do have one wish that I would like journalism schools for fulfill: I wish that no j-school ever reinforces or finds acceptable, actively or passively, the stereotype that journalists are bad at math. All it takes is one professor who shrugs off a math error to add to this stereotype, to have the idea pass onto one of his or her students. Let’s be clear: Journalists do not come with a math disability.

What data journalism project created by someone else do you most admire?

I actually want to highlight a project called Vax, which was not built by journalists, but deploys the same principles as data journalism and has the same goals of educating the reader.

Vax is a game that teaches students both how epidemics spread, as well as prevention techniques. It was created originally to help students taking a Coursera MOOC on Epidemics really engage with the topic. I think it’s accomplished that in spades. Not only are users hooked right from the beginning, the game allows you to experience for yourself how people are interconnected, and how those who refuse vaccinations affect the process.

How has the environment for doing this kind of work changed in the past five years?

Since I only entered the field three years ago in 2011, all I can say is this: Data journalism is gaining momentum.

Our techniques are becoming more sophisticated and we’re learning from our mistakes. We’re constantly improving, building new tools and making it easier and more accessible to do common tasks. I don’t want to predict anything grand, but I think the environment is only going to get better.

Is data journalism the same thing as computer-assisted reporting or computational journalism? Why or why not?

To me, data journalism has become the umbrella term that includes anyone who works in data, journalism and programming. (And yes, executing functions in Excel or writing SQL queries is both data and programming.)

Why are data journalism and “news apps” important, in the context of the contemporary digital environment for information?

Philip Meyer, who wrote “Precision Journalism,” answers the first part of this question with his entire book, which I would recommend any aspiring data journalist read immediately. He says:

“Read any of the popular journals of media criticism and you will find a long litany of repeated complaints about modern journalism. It misses important stories, is too dependent on press releases, is easily manipulated by politicians and special interests, and does not communicate what it does know in an effective manner. All of these complaints are justified. Their cause is not so much a lack of energy or talent or dedication to truth, as the critics sometimes imply, but a simple lag in the application of information science — a body of knowledge — to the daunting problems of reporting the news in a time of information overload.”

Data journalism allows journalists to point to the raw data and ask questions, as well as question the very conclusions we are given. It allows us to use social science techniques to illuminate stories that might otherwise be hidden in plain sight.

News apps specifically allow users to search for what’s most relevant to them in a large dataset, and give individual readers the power to discover how a large, national story relates to them. If the story is that doctors have been receiving payments from pharmaceutical companies, news apps let you search to see if your doctor has as well.

What’s the one thing people always get wrong when they talk about data journalism?

That it’s new, or just a phase the journalism industry is going through.

Data journalism has been around since the 1970s (if not earlier), and it is not going to go away, because the skills involved are core to being a better journalist, and to making your story relatable to millions of users online.

Just imagine, if a source told you that 2+2=18, would you believe that statement? The more likely scenario is that you’d question your source about why he or she would say something so blatantly wrong, because you know how to do math, and you know that 2+2=4. Analyzing raw data can result in a similar question to a source, except this time you can ask, “Why does your data say X, but you say Y?”

Isn’t that a core skill every journalist should have?

How It's Made, Research

What’s the Upshot? A promising data-driven approach to the news.

1

This morning, The New York Times officially launched its long-awaited data-driven news site, “The Upshot.”

David Leonhardt, the site’s managing editor, introduced The Upshot in a long note posted to Facebook and then to nytimes.this morning, explaining how the site aspires to help readers navigate the news.

Leonhardt shared two reasons for The Upshot’s launch. First, help people to understand the news better:

“We believe we can help readers get to that level of understanding by writing in a direct, plain-spoken way, the same voice we might use when writing an email to a friend. We’ll be conversational without being dumbed down. We will build on the excellent journalism The New York Times is already producing, by helping readers make connections among different stories and understand how those stories fit together. We will not hesitate to make analytical judgments about why something has happened and what is likely to happen in the future. We’ll tell you how we came to those judgments — and invite you to come to your own conclusions.”

Second, make the most of the opportunity afforded by the growth of the Internet and the explosion of data creation.

Data-based reporting used to be mostly a tool for investigative journalists who could spend months sorting through reams of statistics to emerge with an exclusive story. But the world now produces so much data, and personal computers can analyze it so quickly, that data-based reporting deserves to be a big part of the daily news cycle.

One of our highest priorities will be unearthing data sets — and analyzing existing ones — in ways that illuminate and explain the news. Our first day of material, both political and economic, should give you a sense of what we hope to do with data. As with our written articles, we aspire to present our data in the clearest, most engaging way possible. A graphic can often accomplish that goal better than prose. Luckily, we work alongside The Times’s graphics department, some of the most talented data-visualization specialists in the country. It’s no accident that the same people who created the interactive dialect quiz, the deficit puzzle and therent-vs-buy calculator will be working on The Upshot.

The third goal, left unsaid by Leonhardt, is the strategic interest in the New York Times has in creating a media entity that generates public interest and draws the massive audience that  Nate Silver’s (now departed) FiveThirtyEight blog did, as the 2014 midterm elections draw near. In the fall of 2012,  20% of the visitors to the sixth-most-trafficked  website in the world were checking out 538. Many were coming specifically for 538.

First impressions

My aesthetic impressions of The Upshot have been overwhelmingly positive: the site looks great on a smartphone, tablet or laptop, and loads quickly. I also like how each columnist’s Twitter handle is located below their headshot and the smooth integration of social sharing tools.

My impression of the site’s substance were similarly positive: the site led off with a strong story on American middle class and income inequality based upon public data, an analysis of affirmative action polling, a data-rich overview of how the environment has changed in the 44 years since the first Earth Day, a look at what good marathons and bad investments have in common, a short item on how some startups are approaching regulated industries, political field notes from Washington and a simple data visualization of Pew Internet data that correlates an appreciation for Internet freedom with Internet use. Whew! internet-use-freedom-nyt-graphic The feature that many political junkies will appreciate most, however, is a clever, engaging interactive that forecasts the outcome of the 2014 election in the U.S. Senate.

A commitment to showing their work

What really made me sit up and take notice of The Upshot, however, was the editorial decisions to share how they found the income data at LIS, link to the dataset, and share both the methodology behind the forecasting model and the code for it on Github. That is precisely the model for open data journalism that embodies the best of the craft, as it is practiced in 2014, and sets a high standard right out of the gate for future interactives at The Upshot and for other sites that might seek to compete with its predictions. They even include those estimates: leaderboard-upshot Notably, FiveThirtyEight is now practicing a more open form of data journalism as well, “showing their work”:

 

Early reviews

I’m not alone in positive first impressions of the content, presentation and strategy of the Times’ new site: over at the Guardian Datablog, James Ball published an interesting analysis of data journalism, as seen through the initial foray of The Upshot, FiveThirtyEight and Vox, the “explanatory journalism” site Ezra Klein, Melissa Bell and Matt Yglesias, among others, launched this spring.

Ball’s whole post is worth reading, particularly with respect to his points about audience, diversity, personalization, but the part I think is particularly important with respect to data journalism is the one I’ve made above, regarding being open about the difficult, complicated process of reporting on data as a source:

Doing original research on data is hard: it’s the core of scientific analysis, and that’s why academics have to go through peer-review to get their figures, methods and approaches double-checked. Journalism is meant to be about transparency, and so should hold itself to this standard – at the very least.

This standard is especially true for data-driven journalism, but, sadly, it’s not always lived up to: Nate Silver (for understandable reasons) won’t release how his model works, while FivethirtyEight hasn’t released the figures or work behind some of their most high-profile articles.

That’s a shame, and a missed opportunity: sharing this stuff is good, accountable journalism, and gives the world a chance to find more stories or angles that a writer might have missed.

Counter-intuitively, old media is doing better at this than the startups: The Upshot has released the code driving its forecasting model, as well as the data on its launch inequality article. And the Guardian has at least tried to release the raw data behind its data-driven journalism since our Datablog launched five years ago.

Ball may have contributed to some category confusion by including Vox in his analysis of this new crop of data journalism startups, and he’s not alone: Mathew Ingram also groups Vox together with The Upshot and 538 in his post on “explanatory journalism.”

Both could certainly be forgiven, given Leonhardt’s introduction expressed a goal to help readers understand and Nate Silver’s explicit focus upon explanation as a component of his approach to data-driven journalism. The waters about what to call the product of these startups is are considerably muddied at this point.

Hopefully, over time, those semantic waters clarify and reveal accurate, truthful and trustworthy journalism. Whatever we call them, there’s plenty of room for all of these new entrants to thrive, if they inform the public and build audiences. 

“I think all of these sites are going to succeed,” said Leonhardt, in an interview with Capital New York. “There is much more demand for this kind of journalism right now than there is supply.”

In an interview with Digiday, Leonhardt futher emphasized this view:

“I don’t think this is about a competition between these sites to see which will emerge victorious,” he said. “There is more than enough room for any site that is providing journalism of this kind to succeed. Given there’s a hunger for conversational journalism and database journalism, as long you’re giving people reporting that’s good, you’re going to succeed.”

How It's Made, Research, Tips & Tutorials

Oakland Police Beat applies data-driven investigative journalism in California

6

One of the explicit connections I’ve made over the years lies between data-driven investigative journalism and government or corporate accountability. In debugging the backlash to data journalism, I highlighted the work of The Los Angeles Times Data Desk, which has analyzed government performance data for accountability, among other notable projects. I could also have pointed to the Chicago Sun-Times, which applied data-driven investigative methods to determine  that the City of Chicago’s 911 dispatch times vary widely depending on where you live, publishing an interactive map online for context, or to a Pulitzer Prize-winning story on speeding cops in Florida.

oaklandpb

This week, there’s a new experiment in applying data journalism  to local government accountability in Oakland, California, where the Oakland Police Beat has gone online. The nonprofit website, which is part of Oakland Local and The Center for Media Change and funded by The Ethics and Excellence in Journalism Foundation and The Fund for Investigative Journalism, was co-founded by Susan Mernit and Abraham Hyatt, the former managing editor of ReadWrite. (Disclosure: Hyatt edited my posts there.)

Oakland Police Beat is squarely aimed at shining sunlight on the practices of Oakland’s law enforcement officers. Their first story out of the gate is pulled no punches, finding that Oakland’s most decorated officers were responsible for a high number of brutality lawsuits and shootings.

The site also demonstrated two important practices that deserve to become standard in data journalism: explaining the methodology behind their analysis, including source notes, and (eventually) publishing the data behind the investigation. 

To learn more about why Oakland Police Beat did that, how they’ve approach their work and what the long game is, I contacted Hyatt. Our interview follows, lightly edited and hyperlinked for context. Any [bracketed] comments are my own.

So, what exactly did you launch? What’s the goal?

Hyatt: We launched a news site and a database with 25 years worth of data about individual Oakland Police Department (OPD) officers who have been involved in shootings and misconduct lawsuits.

Oakland journalists usually focus (and rightfully so) on the city’s violent crime rate and the latest problems with the OPD. We started this project by asking if we could create a comprehensive picture of the officers with the most violent behavior, which is why the OPD is where it is today. We started requesting records and tracking down information. That eventually became the database. It’s the first time anyone in Oakland has created a resource like this.

What makes this “data-driven journalism?”

Hyatt: We started with the data and let it guide the course of the entire project. The stories we’ve written all came from the data.

Why is sharing the data behind the work important?

Hyatt: Sharing is critical. Sharing, not traffic, is the metric I’m using to gauge our success, although traffic certainly is fun to watch, too. That’s the main reason that we’re allowing people to download all of our data. (The settlement database will be available for download next week.)

How will journalists, activists, and data nerds use it over time? That’s going to be the indicator of how important this work was.

[Like ProPublica, Oakland Police Beat is encouraging reuse. The site says that "You’re welcome to republish our stories and use our data for free. We publish our stories under an Attribution-NonCommercial-ShareAlike 4.0 License."]

Where do you get the data?

Hyatt: All of it came from city and court documents. Some of it came as .CSV files, some as PDFs that we had to scrape.

How much time and effort did it take to ingest, clean, structure and present?

Hyatt: Almost all of the court docs had to be human-read. It was a laborious process of digging to find officer names and what the allegations were. Combining city settlement data records and court docs took close to five months. Then, we discovered that the city’s data had flaws and that took another couple of months to resolve.

Some of the data was surprisingly easy to get. I didn’t expect the City Attorney’s office to be so forthcoming with information. Other stuff was surprisingly difficult. The OPD refused to give us awards data before 2007. They claim that they didn’t keep that data on individual officers before then. I know that’s completely false, but we’re a tiny project. We don’t have the resources to take them to court over it. Our tools were very simple.

Did you pay for it?

Hyatt: We used PACER a ton. The bill was close to $900 by the time we were done. We mainly worked out of spreadsheets. I had a handful of command line tools that I used to clean and process data. I ran a virtual machine so that I could use some Linux-bases tools as well. I heart Open Refine. We experimented with using Git for version control on stories we were writing.

” A used chemical agent grenade found on the streets in downtown Oakland following Occupy demonstrations in 2011. Photo by Eric K Arnold.

Will you be publishing data, methodology as you went along?

Hyatt: The methodology post covers all of our stories. We’ll continue to publish stories, as well as some data sets that we got along the way that we decided not to put into our main dataset, like several hundred city attorney reports about the settled cases.

What’s the funding or revenue model for the site? Where will this be in one year? Or 5?

Hyatt: Everyone wants grant-funded journalism startups to be sustainable, but, so often, they start strong and then peter out when resources run dry.

Instead of following that model, I knew from the start that this was going to be a phased project. We had some great grants that got us started, but I didn’t know what the funding picture was going to look like once we started running stories. So, I tried to turn that limitation into a strength.

We’re publishing eight weeks worth of stories and data. We’re going to cram as much awesome into those weeks as we can and then, if needed, we can step away and let this project stand on its own.

With that said, we’re already looking for funding for a second phase (which will focus on teens and the OPD). When we get it, we’ll use this current data as a springboard for Phase 2.

Could this approach be extended to other cities?

Hyatt: The OPD and its problems are pretty unique in the USA. This was successful because there was so much stuff to work with in Oakland. I don’t think our mentality for creating and building this project was unique.

How It's Made, Research

Debugging the backlash to data journalism

13

While the craft and context that underlies “data journalism” is well-known to anyone who knows the history of computer-assisted reporting (CAR), the term itself is a much more recent creation.

This past week, data journalism broke into the hurly burly of mainstream discourse, with the predictable cycle of hype and then backlash, for two reasons:

1)  The launch of Nate Silver’s FiveThirtyEight this past week, where he explicitly laid out his vision for data journalism in a manifesto on “what the fox knows.” He groups statistical analysis, data visualization, computer programming and “data-literate reporting” under the rubric of data journalism.

2) A story in USA Today on the booming market for data journalists and the scoops and audiences they create and enable. The “news nerd job” openings at both born-digital and traditional media institutions shows clear demand across the industry.

There are several points that I think are worth making in light of these two stories.

First, if you’re new to this discussion, Mark Coddington has curated the best reviewscomments and critiques of FiveThirtyEight.com in his excellent weekly digest of the news at the Nieman Journalism Lab. The summary ranges from the quality of 538′s stories to criticism of Nate Silver‘s detachment or even “data journalism” and questions the notion of journalists venturing into empirical projects at all. If you want more context, start there.

Second, it’s great to see the topic of data journalism getting its moment in the sun, even if some of the reactions to Silver’s effort may mistake the man or his vision for the whole practice. Part of the backlash has something to do with high expectations for Silver’s effort. FiveThirtyEight is a new, experimental media venture in which a smart guy has been empowered to try to build something that can find signal in the noise (so to speak) for readers. I’m more than willing to give the site and its founder more time to find its feet.

Third, while FiveThirtyEight is new, as are various other startups or ventures within media companies, data journalism and its practice are not new, along with existing critiques of its practices or or of programming in journalism generally. There are powerful new digital tools and platforms. If we broaden the debate to include screeds asserting that journalists don’t have to know how to code, it’s much easier to find a backlash, along with apt responses about the importance of courses in journalism school or digital literacy, grounded in the importance of looking ahead to the future of digital media, not its ink-stained past.

Fourth, a critical backlash against computers, coding and databases in the media isn’t new. As readers of this blog certainly know, data journalism’s historic antecedent, computer-assisted reporting, has long since been recognized as an important journalistic discipline, as my colleague Susan McGregor highlighted last year in Columbia Journalism Review.

Critics have been assessing the credibility of CAR for years,  If you take a longer view, database-driven journalism has been with us since journalists first started using mainframes, arriving in most newsrooms in a broad sense over two decades ago.

The idea of “computer-assisted reporting” now feels dated, though, inherited from a time when computers were still a novelty in newsrooms. There’s probably not a single reporter or editor working in a newsroom in the United States or Europe today who isn’t using a computer in the course of journalism.

Many members of the media may use several of them over the course of the day, from the powerful handheld computers we call smartphones to laptops and desktops, crunching away at analysis or transformations, or servers and cloud storage, for processing big data at Internet scale.

After investigating the subject for many months, it’s fair to say that the powerful new tools and increased sophistication differentiates the CAR of decades ago from the way data journalism is being practiced today.

While I’ve loosely defined data journalism as “gathering, cleaning, organizing, analyzing, visualizing and publishing data to support the creation of acts of journalism,” a more succinct definition might be the “application of data science to journalism.”

Other observers might suggest that data journalism involves applying the scientific method or social science and statistical analysis to journalism. Philip Meyer called the latter “precision journalism” in the 1970s.

2014 was the year that I saw the worm really turn on the use of term “data journalism,” from its adoption by David Kaplan, a pillar of the investigative journalism community, to its use as self-identification by dozens of attendees, to the annual conference of the National Institute for Computer-Assisted Reporting (NICAR), where nearly a thousand journalists from 20 countries gathered in Baltimore to teach, learn and connect. Its younger attendees use titles like “data editor,” “data reporter” or “database reporter.”

The NICAR conference has grown by leaps and bounds since its first iteration, two decades ago, tripling in size in just the past four years. That rapid expansion is happening for good reason: that strong, clear market demand for data journalists in both traditional media outlets I mentioned earlier.

The size of NICAR 2014 may have given some long-time observers pause, in terms of the effect upon the vibrant community that has grown around it for years or the focus on tools.

“I’m a little worried that NICAR has gotten too big, like SXSW, and that it will lose its soul,” said Matt Waite, a professor of practice at the College of Journalism and Communications at the University of Nebraska, in an interview. “I don’t think it’s likely.”

Fifth, there is something important happening around the emergence of data journalism. I thought that the packed hallways and NICAR sessions accurately reflect what’s happening in the industry.

“Five years ago, this kind of thing was still seen in a lot of places at best as a curiosity, and at worst as something threatening or frivolous,” said Chase Davis, assistant editor for interactive news at the New York Times, in an interview.

“Some newsrooms got it, but most data journalists I knew still had to beg, borrow and steal for simple things like access to servers. Solid programming practices were unheard of. Version control? What’s that? If newsroom developers today saw Matt Waite’s code when he first launched PolitiFact, their faces would melt like Raiders of the Lost Ark.

Now, our team at the Times runs dozens of servers. Being able to code is table stakes. Reporters are talking about machine frickin’ learning, and newsroom devs are inventing pieces of software that power huge chunks of the web.”

What’s happening today does have some genuinely interesting novelty to it, from the use of Amazon’s cloud to the maturation of various open source tools that have been funded by the Knight Foundation, like the Overview Project, Document Cloud, the PANDA Project, or free or open source tools like Google Spreadsheets, Fusion Tables, and Open Refine.

These are still relatively new and powerful tools, which will both justify excitement about their applications and prompt  understandable skepticism about what difference will they make if a majority of practicing journalists aren’t quite ready to use them yet.

One broader challenge that the adoption of “data journalism” has created in mainstream discourse is that it may then be divorced  from the long history that has come before, as Los Angeles Times data editor Ben Welsh reminded this year’s NICAR conference in a brilliant lightning talk.

What ever we call it, if you look around the globe, the growing importance of data journalism is now clear, given the explosion in data creation. Data and journalism have become deeply intertwined, with increased prominence.

To make sense of the data deluge, journalists today need to be more numerate, technically literate and logical. They need to be able to add context, fact-check sources, and weave in narrative, interrogating data just as a reporter would skeptically interview human sources for hidden influences and biases.

If you read Anthony DeBarros’ post on CAR and data journalism in 2010, you’d be connected to the past, but it’s fair to guess that most people who read Nate Silver’s magnum opus on FiveThirtyEight’s approach to data journalism had not. In 3500 words or so, Silver didn’t link to DeBarros, Philip Meyer, or a single organization that’s been practicing, researching or expanding data journalism in the past decade, perhaps the most fertile time for the practice in history.

Journalists have been gathering data and analyzing it for many decades, integrating it into their stories and broadcasts in tables, charts and graphics, like a box score that compares the on-base percentage for baseball player at a given position over time. Data is a critical component to substantiating various aspects of a story, as it’s woven into the way that the story was investigated and reported.

There have been reporters going to libraries, agencies, city halls and courts to find public records about nursing homes, taxes, and campaign finance spending for decades. The difference today is that in addition to digging through dusty file cabinets in court basements, they might be scraping a website, or pulling data from an API that New York Times news developer Derek Willis made, because he’s the sort of person who doesn’t want to have to repeat a process every time and will make data available to all, where possible.

Number-crunching enables Pulitzer Prize-winning stories like the one on speeding cops in Florida Welsh referenced in his NICAR talk, or The Los Angeles Times’ analysis of ambulance response times. That investigation showed the public and state something important, which was that the data quality used to analyze performance was poor because the fire stations weren’t logging it well.

The current criticism of data journalism is a tiny subset of broader backlash against the hype around “big data,” which has grown in use in recent years, adopted all the way up to President Obama in the White House. Professional pundits and critics will always jump on opportunities to puncture hype. (They have families and data plans to support too, after all.)

I may even have inadvertently participated in creating hype around “data journalism” myself over the years, although I maintain that my interest and coverage has always been grounded in my belief that it’s importance has grown because of bigger macro trends in society. The number of sensors and mobile devices that are going to come online in the next couple years are going to exponentially expand the amount of data available to interrogate. As predictive policing  or “personalized redlining” become real, intrusive forces in the lives of Americans, data journalism will become a crucial democratic bulwark against the increased power of algorithms in society.

That puts a huge premium upon the media having the capacity to do this kind of work, and editors hiring them. They should: data journalism is creating both scoops and audiences. It’s also a fine reason to be focused on highlighting that demand and to celebrate the role of NICAR and data journalism MOOCs have in training an expanding tribe, along with the willingness of the people who have gone before to help others who want to learn.

I expect to see more mainstream pushback regarding data journalism from members of the media who are highly proficient at interviewing, writing and editing, but perhaps less so with other skills that are now part of the reporter’s modern toolkit, like video, social media, Web development or mobile reporting. Professional pundits who don’t ground their assertions in history or science may not fare quite as well, in this world. Researchers who blog, by contrast, will. As more sources for expert, data-driven analysis of law, science, medicine or technology go direct online, opinion journalists without deep subject matter expertise are going to have to recalibrate.

It’s possible that there could also be a (much smaller) backlash from long-time practitioners that observe too much of a focus on the tools at NICAR.

“I’m concerned that it’s become too focused on data, and not enough on journalism,” said Waite. “There used to be much more on stories, with a focus on beats. People would talk about how they reported out stories, not technology. The number of panels about algorithm design are growing, and the number of story panels are shrinking. They’re not as well attended. That’s a reflection of the wishes of the attendees, but it troubles me.”

There may also be people who may push back against the meaning of “data journalist” being diluted, though I doubt we’ll see much of it. People the top of the profession and have serious technical chops which enable them to do much more than download a .csv file and making it into an infographic. These folks are proficient in Python, R and other programming languages, able to pursue scraping, cleaning and interrogation of huge data sets with complicated statistical analyses. At the edges of that gradient, there is computational journalism, although that is a specialty that doesn’t seem to exist outside of the academy.

Every one of the data journalists I’ve met over the years, however, cared a lot more about good code, clean data and beautiful design than the semantics of what to call them, or defending their professional turf.

Of the 997 NICAR attendees, how many were students and investigative reporters, editors who had showed up for the first time to learn these skills? If you told me a majority, I wouldn’t be surprised.

My sense was that in 2014, the unprecedented number of people who came had internalized the message that data journalism was important and they need to know how to do some of these things, or at least know what they are. They want to know what forking code on Github means, or at least what Github is and how people use it.

I don’t mean to knock the digital literacy of the NICAR attendees, as my sense was that it is higher than any other gathering of journalists in the world, but it’s easy for people to forget that there’s a significant portion of the public for whom these concepts are novel.

I think that’s true of the new media industry too, in which digital literacy and numeracy is perhaps not what it could be. There’s now more pressure on people in the industry to learn more, and for those who want to enter it to have more basic data skills. That’s driven some changes in the NICAR program.

“The temptation is that NICAR will become all about code-sharing,” said Waite. “That would lose the value-add, which is how the code relates to journalism. What’s different, versus programming or Web development?”

This reflects a common dividing line I’ve seen between people in the business world: the “suits” versus hoodies, jeans versus khakis, or MBA’s vs developers. Today, the world of the “news hacker” is being democratized — a good thing — so there’s always going to be a little bit of a discomfort around something that stretched from being a smaller tribe that self identifies into something bigger.

I expect that the backlash within the NICAR community to its expanded ranks and role in the industry will be minimal, leaving people room to work, collaborate, learn and teach. We’d be better off focused on the journalism itself, from storytelling to rigorous fact checking, and a bit less focused upon the tools, however new and shiny some may be.

“I’m not overly pessimistic about NICAR — quite the opposite,” said Waite, “but this focus on the data part of data journalism and less on the journalism part of data journalism is a nagging worry in the back of my head.”

That’s not to say that the technology isn’t worth considering or covering, as I have for years. We have huge amounts of data going online today, more than we ever had before, and media have access to much more powerful personal machines and cloud computing to process it.

Even with the new tech, they’re still doing something old: practicing journalism! The approach may start to look a bit more scientific, over time. An editor might float an assertion or hypothesis about new in the world, and then assigns an investigative journalist to go find out whether it’s true or not. To that, you need to go find data, evidence and knowledge about about it. To prove to skeptical readers that the conclusion is sound, the data journalist may need show his or her work, from the data sources to the process used to transform and present them.

It now feels cliched to say it in 2014, but in this context transparency may be the new objectivity. The latter concept is not one that has much traction in the scientific community, where observer effects and experimenter bias are well-known phenomena. Studies and results that can’t be reproduced are regarded with skepticism for a reason.

Such thinking about the scientific method and journalism isn’t new, nor is its practice in by journalists around the country who have pioneered the craft of data journalism with much less fanfare than FiveThiryEight.

“As we all know, there’s a lot of data out there,” said Ben Welsh, editor of the Los Angeles Times Data Desk. “and, as anyone who works with it knows, most of it is crap. The projects I’m most proud of have taken large, ugly datasets and refined them into something worth knowing: a nut graf in an investigative story or a data-driven app that gives the reader some new insight into the world around them.”

The graphic atop this post comes from that Data Desk. While you the work that created the image, it’s online if you want to look for it: The Los Angeles Times released both the code and data behind the open source maps of California’s emergency medical agencies it published in the series.

Moreover, it wasn’t the first time. As Welsh wrote, the Data Desk has “previously written about the technical methods used to conduct [the] investigation, released the base layer created for an interactive map of response times and contributed the location of LAFD’s 106 fire station to the Open Street Map.”

This is what an open source newsroom that practices open data journalism looks like. It’s not just applying statistics and social science to polls and publishing data visualizations. If FiveThirtyEight, Vox, The New York Times Uptake or other outlets want to publish data journalism and build out the newsroom stack, that’s the high bar that’s been set. (Update: I was heartened to learn that FiveThirtyEight has a Github account.) In sharing not only its code but its data, the Los Angeles Times also set a notable example for the practice of open journalism in the 21st century.

I don’t know about you, but I think that’s a much more compelling vision for what data journalism is and how it has been, is being and could be applied in the 21st century than the fox’s tale.

Postscript: Good news: 538 is both listening and acting.

Between the Spreadsheets, How It's Made

Of scripts, scraping and quizzes: how data journalism creates scoops and audiences

6

As last year drew to a close, Scott Klein, a senior editor of news applications at ProPublica, made a slam-dunk prediction: “in 2014, you will be scooped by a reporter who knows how to program.”

While the veracity of his statement had already been shown in numerous examples, including the ones linked in his post, two fascinating stories published in the month since his post demonstrate just how creatively a good idea and a clever script can be applied — and a third highlights why the New York Times is investing in data-driven journalism and journalists in the year ahead.

Tweaking Twitter data

One of those stories went online just two days after Klein’s post was published, weeks before the new year began. Jon Bruner, a former colleague and data journalist turned conference co-chair at O’Reilly Media, decided to apply his programming skills to Twitter, randomly sampling about 400,000 accounts over time. The evidence he gathered showed that amongst the active Twitter accounts he measured, the median account has 61 followers and follows 177 users.

number_of_followers_histogram-620x467

 

“If you’ve got a thousand followers, you’re at the 96th percentile of active Twitter users,” he noted at Radar. This data also enabled Bruner to make a widely-cited (and tweeted!) conclusion: Twitter is “more a consumption medium than a conversational one–an only-somewhat-democratized successor to broadcast television, in which a handful of people wield enormous influence and everyone else chatters with a few friends on living-room couches.”

How did he do it? Python, R and MySQL.

“Every few minutes, a Python script that I wrote generated a fresh list of 300 random numbers between zero and 1.9 billion and asked Twitter’s API to return basic information for the corresponding accounts,” wrote Bruner. “I logged the results–including empty results when an ID number didn’t correspond to any account–in a MySQL table and let the script run on a cronjob for 32 days. I’ve only included accounts created before September 2013 in my analysis in order to avoid under-sampling accounts that were created during the period of data collection.”

A reporter that didn’t approach researching the dynamics of Twitter this way, by contrast, would be left to try the Herculean task of clicking through and logging attributes for 400,000 accounts.

That’s a heavy lift that would strain the capacity of the most well-staffed media intern departments on the planet to deliver upon in a summer. Bruner, by contrast, told us something we didn’t know and backed it up with evidence he gathered. If you contrast his approach to commentators who make observations about Twitter without data or much experience, it’s easy to score one for the data journalist.

Reverse engineering how Netflix reverse engineered Hollywood

netflix-genre-generatornetflix-genre-generator

Alexis Madrigal showed the accuracy of Klein’s prediction right out of the gate when he published a fascinating story on how Netflix reverse engineered Hollywood on January 2.

If you’ve ever browsed through Netflix’s immense catalog, you probably have noticed the remarkable number of personalized genres exist there. Curious sorts might wonder how many genres there are, how Netflix classifies them and how those recommendations that come sliding in are computed.

One approach to that would be to watch a lot of movies and television shows and track how the experience changes, a narrative style familiar to many newspaper column readers. Another would be for a reporter to ask Netflix for an interview about these genres and consult industry experts on “big data.” Whatever choice the journalist made, it would need to advance the story.

As Madrigal observed in his post, assembling a comprehensive list of Netflix microgenres “seemed like a fun story, though one that would require some fresh thinking, as many other people had done versions of it.”

Madrigal’s initial exploration of Netflix’s database of genres, as evidenced by sequential numbering in the uniform resource locator (URLs) in his Web browser, taught him three things: there were a LOT of them, organized in a way he didn’t understand, and manually exploring them wasn’t going to work.

You can probably guess what came next: Madrigal figured out a way to scrape the data he needed.

“I’d been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web,” he wrote. “Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file. After some troubleshooting and help from [Georgia Tech Professor Ian] Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.”

What he found was staggering: 76,897 genres.  Then, Madrigal did two other things that were really interesting.

First, he and Bogost built the automatic genre generator that now sits atop his article in The Atlantic, giving users something to play with when they visited. That sort of interactive would not be possible in print nor without collecting and organizing all of that data.

Second, he contacted Netflix public relations about what they had found, who then offered him an interview with Todd Yellin, the vice president of product at Netflix that had created Netflix’s system. The subsequent interview Madrigal scored and conducted provided him and us, his dear readers, with much more insight into what’s going on behind the scenes. For instance, Yellen explained to him that “the underlying tagging data isn’t just used to create genres, but also to increase the level of personalization in all the movies a user is shown. So, if Netflix knows you love Action Adventure movies with high romantic ratings (on their 1-5 scale), it might show you that kind of movie, without ever saying, ‘Romantic Action Adventure Movies.’”

The interview also enabled Madrigal to make a more existential observation: “The vexing, remarkable conclusion is that when companies combine human intelligence and machine intelligence, some things happen that we cannot understand.”

Without the data he collected and created, it’s hard to see how Madrigal or how anyone else would have been able to publish this feature.

That is, of course, exactly what Scott Klein highlighted in his prediction: “Scraping websites, cleaning data, and querying Excel-breaking data sets are enormously useful ways to get great stories,” he wrote. “If you don’t know how to write software to help you acquire and analyze data, there will always be a limit to the size of stories you can get by yourself.”

Digging into dialect

The most visited New York Times story of 2013 was not an article: it was a news application. Specifically, it was an interactive feature called “How Y’all, Youse, and You Guys Talk,” by Josh Katz and Wilson Andrews.

dialect_question_nyt

While it wasn’t a scoop, it does suggest us something important about how media organizations can use the Web to go beyond print. As Robinson Meyer pointed out at The Atlantic, the application didn’t go live until December 21, which means it generated all of those clicks (25 per user) in just eleven days.

The popularity of the news app becomes even more interesting if you consider that it was created by an intern: Katz hadn’t joined the New York Times full-time when he worked on it. As Ryan Graff reported for the Knight Lab, in March 2010 Katz was a graduate student in the Department of Statistics at North Carolina State University. (He’s since signed on to work on the forthcoming data-driven journalism venture.)

Katz made several heat maps using data from the Harvard Dialect Survey and posted them online. That attracted the attention of the Times and led him to an internship. Once ensconced at the Old Grey Lady, he created a quiz to verify the data and update it, using a dialect quiz. He then tested 140 questions on some 350,000 people to determine the most-telling questions. With that data in hand, Katz worked with graphics editor Wilson Andrews to create the quiz that’s still online today.

What this tells us about data-driven journalism is not just that there is a demand for skills in D3, R and statistics newsrooms: it’s that there’s a huge demand for the news applications those that possess them can create. Such news apps can find or even create massive audiences online, an outcome that should be of considerable interest to the publishers that run the media companies that deploy them.

On to the year ahead

All of these stories should cast doubt about the contention that data-driven journalism is a “bogus meme,” fated to sit beside “hyperlocal” or blogging as saviors of journalism. There are several reasons not to fall into this way of thinking.

First, journalism will survive the death or diminishment of its institutions, regardless of the flavor of the moment. (This subject has been studied and analyzed at great depth in the Tow Center’s report on post-industrial journalism.)

Second, data-driven journalism may be a relatively new term but it’s the evolutionary descendent of a much older practice: computer-assisted reporting. Moreover, journalists have been using statistics and tables for many decades. While interactive features, news applications, collaborative coding on open source platforms, open data and data analyses that leverages machine-learning and cloud computing are all new additions to this landscape, many practitioners of computer-assisted reporting have been doing it for decades. Expect new ones from Nate Silver’s rebooted FiveThirtyEight project in the months ahead.

Finally, the examples I’ve given show how compelling, fascinating stories can be created by one or two journalists coding scripts and building databases, including automating the work of data collection or cleaning.

That last point is crucial: by automating tasks, one data journalist can increase the capacity of those they work with in a newsroom and create databases that may be used for future reporting. There’s one reason (among many) that ProPublica can win Pulitzers prizes without employing hundreds of staff.

Announcements, How It's Made, Tips & Tutorials

News App and Data Guides from ProPublica

6

Coding the news now has a manifesto. ProPublica’s developers launched a series of news application guides, including a coding manifesto, this morning. The guides, which all live on GitHub, are intended to give insight into the programming ethos of the non-profit investigative journalism outfit. As the manifesto says, “We’re not making any general statements about anything beyond the environment we know: Doing journalism on deadline with code.”

Scott Klein, Jeff Larson and Jennifer LaFleur wrote the guides, which include a news app style guide, a data check-list and a design guide. These resources add to the ever-growing community of news application developers, many of whom are actively blogging about and sharing their working processes.

Read all the guides here.

How It's Made

How it’s Made: Google Hangouts

2

There is a basic how-to guide to starting a Google Hangout at the bottom of the post.

The New York Times recently held Google Hangouts with voters, two during each political convention. Times columnists Frank Bruni, Gail Collins and Charles Blow moderated four conversations about the economy, bipartisanship, whether there is a Republican war on women, in addition to voters who had switched from supporting Obama in 2008 to supporting Romney in 2012. Voters talked about their lives and needs. About their struggles, financial and otherwise, and about what they want to happen during the next presidential term. (more…)