How It's Made, Research

What’s the Upshot? A promising data-driven approach to the news.


This morning, The New York Times officially launched its long-awaited data-driven news site, “The Upshot.”

David Leonhardt, the site’s managing editor, introduced The Upshot in a long note posted to Facebook and then to nytimes.this morning, explaining how the site aspires to help readers navigate the news.

Leonhardt shared two reasons for The Upshot’s launch. First, help people to understand the news better:

“We believe we can help readers get to that level of understanding by writing in a direct, plain-spoken way, the same voice we might use when writing an email to a friend. We’ll be conversational without being dumbed down. We will build on the excellent journalism The New York Times is already producing, by helping readers make connections among different stories and understand how those stories fit together. We will not hesitate to make analytical judgments about why something has happened and what is likely to happen in the future. We’ll tell you how we came to those judgments — and invite you to come to your own conclusions.”

Second, make the most of the opportunity afforded by the growth of the Internet and the explosion of data creation.

Data-based reporting used to be mostly a tool for investigative journalists who could spend months sorting through reams of statistics to emerge with an exclusive story. But the world now produces so much data, and personal computers can analyze it so quickly, that data-based reporting deserves to be a big part of the daily news cycle.

One of our highest priorities will be unearthing data sets — and analyzing existing ones — in ways that illuminate and explain the news. Our first day of material, both political and economic, should give you a sense of what we hope to do with data. As with our written articles, we aspire to present our data in the clearest, most engaging way possible. A graphic can often accomplish that goal better than prose. Luckily, we work alongside The Times’s graphics department, some of the most talented data-visualization specialists in the country. It’s no accident that the same people who created the interactive dialect quiz, the deficit puzzle and therent-vs-buy calculator will be working on The Upshot.

The third goal, left unsaid by Leonhardt, is the strategic interest in the New York Times has in creating a media entity that generates public interest and draws the massive audience that  Nate Silver’s (now departed) FiveThirtyEight blog did, as the 2014 midterm elections draw near. In the fall of 2012,  20% of the visitors to the sixth-most-trafficked  website in the world were checking out 538. Many were coming specifically for 538.

First impressions

My aesthetic impressions of The Upshot have been overwhelmingly positive: the site looks great on a smartphone, tablet or laptop, and loads quickly. I also like how each columnist’s Twitter handle is located below their headshot and the smooth integration of social sharing tools.

My impression of the site’s substance were similarly positive: the site led off with a strong story on American middle class and income inequality based upon public data, an analysis of affirmative action polling, a data-rich overview of how the environment has changed in the 44 years since the first Earth Day, a look at what good marathons and bad investments have in common, a short item on how some startups are approaching regulated industries, political field notes from Washington and a simple data visualization of Pew Internet data that correlates an appreciation for Internet freedom with Internet use. Whew! internet-use-freedom-nyt-graphic The feature that many political junkies will appreciate most, however, is a clever, engaging interactive that forecasts the outcome of the 2014 election in the U.S. Senate.

A commitment to showing their work

What really made me sit up and take notice of The Upshot, however, was the editorial decisions to share how they found the income data at LIS, link to the dataset, and share both the methodology behind the forecasting model and the code for it on Github. That is precisely the model for open data journalism that embodies the best of the craft, as it is practiced in 2014, and sets a high standard right out of the gate for future interactives at The Upshot and for other sites that might seek to compete with its predictions. They even include those estimates: leaderboard-upshot Notably, FiveThirtyEight is now practicing a more open form of data journalism as well, “showing their work”:


Early reviews

I’m not alone in positive first impressions of the content, presentation and strategy of the Times’ new site: over at the Guardian Datablog, James Ball published an interesting analysis of data journalism, as seen through the initial foray of The Upshot, FiveThirtyEight and Vox, the “explanatory journalism” site Ezra Klein, Melissa Bell and Matt Yglesias, among others, launched this spring.

Ball’s whole post is worth reading, particularly with respect to his points about audience, diversity, personalization, but the part I think is particularly important with respect to data journalism is the one I’ve made above, regarding being open about the difficult, complicated process of reporting on data as a source:

Doing original research on data is hard: it’s the core of scientific analysis, and that’s why academics have to go through peer-review to get their figures, methods and approaches double-checked. Journalism is meant to be about transparency, and so should hold itself to this standard – at the very least.

This standard is especially true for data-driven journalism, but, sadly, it’s not always lived up to: Nate Silver (for understandable reasons) won’t release how his model works, while FivethirtyEight hasn’t released the figures or work behind some of their most high-profile articles.

That’s a shame, and a missed opportunity: sharing this stuff is good, accountable journalism, and gives the world a chance to find more stories or angles that a writer might have missed.

Counter-intuitively, old media is doing better at this than the startups: The Upshot has released the code driving its forecasting model, as well as the data on its launch inequality article. And the Guardian has at least tried to release the raw data behind its data-driven journalism since our Datablog launched five years ago.

Ball may have contributed to some category confusion by including Vox in his analysis of this new crop of data journalism startups, and he’s not alone: Mathew Ingram also groups Vox together with The Upshot and 538 in his post on “explanatory journalism.”

Both could certainly be forgiven, given Leonhardt’s introduction expressed a goal to help readers understand and Nate Silver’s explicit focus upon explanation as a component of his approach to data-driven journalism. The waters about what to call the product of these startups is are considerably muddied at this point.

Hopefully, over time, those semantic waters clarify and reveal accurate, truthful and trustworthy journalism. Whatever we call them, there’s plenty of room for all of these new entrants to thrive, if they inform the public and build audiences. 

“I think all of these sites are going to succeed,” said Leonhardt, in an interview with Capital New York. “There is much more demand for this kind of journalism right now than there is supply.”

In an interview with Digiday, Leonhardt futher emphasized this view:

“I don’t think this is about a competition between these sites to see which will emerge victorious,” he said. “There is more than enough room for any site that is providing journalism of this kind to succeed. Given there’s a hunger for conversational journalism and database journalism, as long you’re giving people reporting that’s good, you’re going to succeed.”

How It's Made, Research, Tips & Tutorials

Oakland Police Beat applies data-driven investigative journalism in California


One of the explicit connections I’ve made over the years lies between data-driven investigative journalism and government or corporate accountability. In debugging the backlash to data journalism, I highlighted the work of The Los Angeles Times Data Desk, which has analyzed government performance data for accountability, among other notable projects. I could also have pointed to the Chicago Sun-Times, which applied data-driven investigative methods to determine  that the City of Chicago’s 911 dispatch times vary widely depending on where you live, publishing an interactive map online for context, or to a Pulitzer Prize-winning story on speeding cops in Florida.


This week, there’s a new experiment in applying data journalism  to local government accountability in Oakland, California, where the Oakland Police Beat has gone online. The nonprofit website, which is part of Oakland Local and The Center for Media Change and funded by The Ethics and Excellence in Journalism Foundation and The Fund for Investigative Journalism, was co-founded by Susan Mernit and Abraham Hyatt, the former managing editor of ReadWrite. (Disclosure: Hyatt edited my posts there.)

Oakland Police Beat is squarely aimed at shining sunlight on the practices of Oakland’s law enforcement officers. Their first story out of the gate is pulled no punches, finding that Oakland’s most decorated officers were responsible for a high number of brutality lawsuits and shootings.

The site also demonstrated two important practices that deserve to become standard in data journalism: explaining the methodology behind their analysis, including source notes, and (eventually) publishing the data behind the investigation. 

To learn more about why Oakland Police Beat did that, how they’ve approach their work and what the long game is, I contacted Hyatt. Our interview follows, lightly edited and hyperlinked for context. Any [bracketed] comments are my own.

So, what exactly did you launch? What’s the goal?

Hyatt: We launched a news site and a database with 25 years worth of data about individual Oakland Police Department (OPD) officers who have been involved in shootings and misconduct lawsuits.

Oakland journalists usually focus (and rightfully so) on the city’s violent crime rate and the latest problems with the OPD. We started this project by asking if we could create a comprehensive picture of the officers with the most violent behavior, which is why the OPD is where it is today. We started requesting records and tracking down information. That eventually became the database. It’s the first time anyone in Oakland has created a resource like this.

What makes this “data-driven journalism?”

Hyatt: We started with the data and let it guide the course of the entire project. The stories we’ve written all came from the data.

Why is sharing the data behind the work important?

Hyatt: Sharing is critical. Sharing, not traffic, is the metric I’m using to gauge our success, although traffic certainly is fun to watch, too. That’s the main reason that we’re allowing people to download all of our data. (The settlement database will be available for download next week.)

How will journalists, activists, and data nerds use it over time? That’s going to be the indicator of how important this work was.

[Like ProPublica, Oakland Police Beat is encouraging reuse. The site says that "You’re welcome to republish our stories and use our data for free. We publish our stories under an Attribution-NonCommercial-ShareAlike 4.0 License."]

Where do you get the data?

Hyatt: All of it came from city and court documents. Some of it came as .CSV files, some as PDFs that we had to scrape.

How much time and effort did it take to ingest, clean, structure and present?

Hyatt: Almost all of the court docs had to be human-read. It was a laborious process of digging to find officer names and what the allegations were. Combining city settlement data records and court docs took close to five months. Then, we discovered that the city’s data had flaws and that took another couple of months to resolve.

Some of the data was surprisingly easy to get. I didn’t expect the City Attorney’s office to be so forthcoming with information. Other stuff was surprisingly difficult. The OPD refused to give us awards data before 2007. They claim that they didn’t keep that data on individual officers before then. I know that’s completely false, but we’re a tiny project. We don’t have the resources to take them to court over it. Our tools were very simple.

Did you pay for it?

Hyatt: We used PACER a ton. The bill was close to $900 by the time we were done. We mainly worked out of spreadsheets. I had a handful of command line tools that I used to clean and process data. I ran a virtual machine so that I could use some Linux-bases tools as well. I heart Open Refine. We experimented with using Git for version control on stories we were writing.

“ A used chemical agent grenade found on the streets in downtown Oakland following Occupy demonstrations in 2011. Photo by Eric K Arnold.

Will you be publishing data, methodology as you went along?

Hyatt: The methodology post covers all of our stories. We’ll continue to publish stories, as well as some data sets that we got along the way that we decided not to put into our main dataset, like several hundred city attorney reports about the settled cases.

What’s the funding or revenue model for the site? Where will this be in one year? Or 5?

Hyatt: Everyone wants grant-funded journalism startups to be sustainable, but, so often, they start strong and then peter out when resources run dry.

Instead of following that model, I knew from the start that this was going to be a phased project. We had some great grants that got us started, but I didn’t know what the funding picture was going to look like once we started running stories. So, I tried to turn that limitation into a strength.

We’re publishing eight weeks worth of stories and data. We’re going to cram as much awesome into those weeks as we can and then, if needed, we can step away and let this project stand on its own.

With that said, we’re already looking for funding for a second phase (which will focus on teens and the OPD). When we get it, we’ll use this current data as a springboard for Phase 2.

Could this approach be extended to other cities?

Hyatt: The OPD and its problems are pretty unique in the USA. This was successful because there was so much stuff to work with in Oakland. I don’t think our mentality for creating and building this project was unique.

How It's Made, Research

Debugging the backlash to data journalism


While the craft and context that underlies “data journalism” is well-known to anyone who knows the history of computer-assisted reporting (CAR), the term itself is a much more recent creation.

This past week, data journalism broke into the hurly burly of mainstream discourse, with the predictable cycle of hype and then backlash, for two reasons:

1)  The launch of Nate Silver’s FiveThirtyEight this past week, where he explicitly laid out his vision for data journalism in a manifesto on “what the fox knows.” He groups statistical analysis, data visualization, computer programming and “data-literate reporting” under the rubric of data journalism.

2) A story in USA Today on the booming market for data journalists and the scoops and audiences they create and enable. The “news nerd job“ openings at both born-digital and traditional media institutions shows clear demand across the industry.

There are several points that I think are worth making in light of these two stories.

First, if you’re new to this discussion, Mark Coddington has curated the best reviewscomments and critiques of in his excellent weekly digest of the news at the Nieman Journalism Lab. The summary ranges from the quality of 538′s stories to criticism of Nate Silver‘s detachment or even “data journalism” and questions the notion of journalists venturing into empirical projects at all. If you want more context, start there.

Second, it’s great to see the topic of data journalism getting its moment in the sun, even if some of the reactions to Silver’s effort may mistake the man or his vision for the whole practice. Part of the backlash has something to do with high expectations for Silver’s effort. FiveThirtyEight is a new, experimental media venture in which a smart guy has been empowered to try to build something that can find signal in the noise (so to speak) for readers. I’m more than willing to give the site and its founder more time to find its feet.

Third, while FiveThirtyEight is new, as are various other startups or ventures within media companies, data journalism and its practice are not new, along with existing critiques of its practices or or of programming in journalism generally. There are powerful new digital tools and platforms. If we broaden the debate to include screeds asserting that journalists don’t have to know how to code, it’s much easier to find a backlash, along with apt responses about the importance of courses in journalism school or digital literacy, grounded in the importance of looking ahead to the future of digital media, not its ink-stained past.

Fourth, a critical backlash against computers, coding and databases in the media isn’t new. As readers of this blog certainly know, data journalism’s historic antecedent, computer-assisted reporting, has long since been recognized as an important journalistic discipline, as my colleague Susan McGregor highlighted last year in Columbia Journalism Review.

Critics have been assessing the credibility of CAR for years,  If you take a longer view, database-driven journalism has been with us since journalists first started using mainframes, arriving in most newsrooms in a broad sense over two decades ago.

The idea of “computer-assisted reporting” now feels dated, though, inherited from a time when computers were still a novelty in newsrooms. There’s probably not a single reporter or editor working in a newsroom in the United States or Europe today who isn’t using a computer in the course of journalism.

Many members of the media may use several of them over the course of the day, from the powerful handheld computers we call smartphones to laptops and desktops, crunching away at analysis or transformations, or servers and cloud storage, for processing big data at Internet scale.

After investigating the subject for many months, it’s fair to say that the powerful new tools and increased sophistication differentiates the CAR of decades ago from the way data journalism is being practiced today.

While I’ve loosely defined data journalism as “gathering, cleaning, organizing, analyzing, visualizing and publishing data to support the creation of acts of journalism,” a more succinct definition might be the “application of data science to journalism.”

Other observers might suggest that data journalism involves applying the scientific method or social science and statistical analysis to journalism. Philip Meyer called the latter “precision journalism” in the 1970s.

2014 was the year that I saw the worm really turn on the use of term “data journalism,” from its adoption by David Kaplan, a pillar of the investigative journalism community, to its use as self-identification by dozens of attendees, to the annual conference of the National Institute for Computer-Assisted Reporting (NICAR), where nearly a thousand journalists from 20 countries gathered in Baltimore to teach, learn and connect. Its younger attendees use titles like “data editor,” “data reporter” or “database reporter.”

The NICAR conference has grown by leaps and bounds since its first iteration, two decades ago, tripling in size in just the past four years. That rapid expansion is happening for good reason: that strong, clear market demand for data journalists in both traditional media outlets I mentioned earlier.

The size of NICAR 2014 may have given some long-time observers pause, in terms of the effect upon the vibrant community that has grown around it for years or the focus on tools.

“I’m a little worried that NICAR has gotten too big, like SXSW, and that it will lose its soul,” said Matt Waite, a professor of practice at the College of Journalism and Communications at the University of Nebraska, in an interview. “I don’t think it’s likely.”

Fifth, there is something important happening around the emergence of data journalism. I thought that the packed hallways and NICAR sessions accurately reflect what’s happening in the industry.

“Five years ago, this kind of thing was still seen in a lot of places at best as a curiosity, and at worst as something threatening or frivolous,” said Chase Davis, assistant editor for interactive news at the New York Times, in an interview.

“Some newsrooms got it, but most data journalists I knew still had to beg, borrow and steal for simple things like access to servers. Solid programming practices were unheard of. Version control? What’s that? If newsroom developers today saw Matt Waite’s code when he first launched PolitiFact, their faces would melt like Raiders of the Lost Ark.

Now, our team at the Times runs dozens of servers. Being able to code is table stakes. Reporters are talking about machine frickin’ learning, and newsroom devs are inventing pieces of software that power huge chunks of the web.”

What’s happening today does have some genuinely interesting novelty to it, from the use of Amazon’s cloud to the maturation of various open source tools that have been funded by the Knight Foundation, like the Overview Project, Document Cloud, the PANDA Project, or free or open source tools like Google Spreadsheets, Fusion Tables, and Open Refine.

These are still relatively new and powerful tools, which will both justify excitement about their applications and prompt  understandable skepticism about what difference will they make if a majority of practicing journalists aren’t quite ready to use them yet.

One broader challenge that the adoption of “data journalism” has created in mainstream discourse is that it may then be divorced  from the long history that has come before, as Los Angeles Times data editor Ben Welsh reminded this year’s NICAR conference in a brilliant lightning talk.

What ever we call it, if you look around the globe, the growing importance of data journalism is now clear, given the explosion in data creation. Data and journalism have become deeply intertwined, with increased prominence.

To make sense of the data deluge, journalists today need to be more numerate, technically literate and logical. They need to be able to add context, fact-check sources, and weave in narrative, interrogating data just as a reporter would skeptically interview human sources for hidden influences and biases.

If you read Anthony DeBarros’ post on CAR and data journalism in 2010, you’d be connected to the past, but it’s fair to guess that most people who read Nate Silver’s magnum opus on FiveThirtyEight’s approach to data journalism had not. In 3500 words or so, Silver didn’t link to DeBarros, Philip Meyer, or a single organization that’s been practicing, researching or expanding data journalism in the past decade, perhaps the most fertile time for the practice in history.

Journalists have been gathering data and analyzing it for many decades, integrating it into their stories and broadcasts in tables, charts and graphics, like a box score that compares the on-base percentage for baseball player at a given position over time. Data is a critical component to substantiating various aspects of a story, as it’s woven into the way that the story was investigated and reported.

There have been reporters going to libraries, agencies, city halls and courts to find public records about nursing homes, taxes, and campaign finance spending for decades. The difference today is that in addition to digging through dusty file cabinets in court basements, they might be scraping a website, or pulling data from an API that New York Times news developer Derek Willis made, because he’s the sort of person who doesn’t want to have to repeat a process every time and will make data available to all, where possible.

Number-crunching enables Pulitzer Prize-winning stories like the one on speeding cops in Florida Welsh referenced in his NICAR talk, or The Los Angeles Times’ analysis of ambulance response times. That investigation showed the public and state something important, which was that the data quality used to analyze performance was poor because the fire stations weren’t logging it well.

The current criticism of data journalism is a tiny subset of broader backlash against the hype around “big data,” which has grown in use in recent years, adopted all the way up to President Obama in the White House. Professional pundits and critics will always jump on opportunities to puncture hype. (They have families and data plans to support too, after all.)

I may even have inadvertently participated in creating hype around “data journalism” myself over the years, although I maintain that my interest and coverage has always been grounded in my belief that it’s importance has grown because of bigger macro trends in society. The number of sensors and mobile devices that are going to come online in the next couple years are going to exponentially expand the amount of data available to interrogate. As predictive policing  or “personalized redlining” become real, intrusive forces in the lives of Americans, data journalism will become a crucial democratic bulwark against the increased power of algorithms in society.

That puts a huge premium upon the media having the capacity to do this kind of work, and editors hiring them. They should: data journalism is creating both scoops and audiences. It’s also a fine reason to be focused on highlighting that demand and to celebrate the role of NICAR and data journalism MOOCs have in training an expanding tribe, along with the willingness of the people who have gone before to help others who want to learn.

I expect to see more mainstream pushback regarding data journalism from members of the media who are highly proficient at interviewing, writing and editing, but perhaps less so with other skills that are now part of the reporter’s modern toolkit, like video, social media, Web development or mobile reporting. Professional pundits who don’t ground their assertions in history or science may not fare quite as well, in this world. Researchers who blog, by contrast, will. As more sources for expert, data-driven analysis of law, science, medicine or technology go direct online, opinion journalists without deep subject matter expertise are going to have to recalibrate.

It’s possible that there could also be a (much smaller) backlash from long-time practitioners that observe too much of a focus on the tools at NICAR.

“I’m concerned that it’s become too focused on data, and not enough on journalism,” said Waite. “There used to be much more on stories, with a focus on beats. People would talk about how they reported out stories, not technology. The number of panels about algorithm design are growing, and the number of story panels are shrinking. They’re not as well attended. That’s a reflection of the wishes of the attendees, but it troubles me.”

There may also be people who may push back against the meaning of “data journalist” being diluted, though I doubt we’ll see much of it. People the top of the profession and have serious technical chops which enable them to do much more than download a .csv file and making it into an infographic. These folks are proficient in Python, R and other programming languages, able to pursue scraping, cleaning and interrogation of huge data sets with complicated statistical analyses. At the edges of that gradient, there is computational journalism, although that is a specialty that doesn’t seem to exist outside of the academy.

Every one of the data journalists I’ve met over the years, however, cared a lot more about good code, clean data and beautiful design than the semantics of what to call them, or defending their professional turf.

Of the 997 NICAR attendees, how many were students and investigative reporters, editors who had showed up for the first time to learn these skills? If you told me a majority, I wouldn’t be surprised.

My sense was that in 2014, the unprecedented number of people who came had internalized the message that data journalism was important and they need to know how to do some of these things, or at least know what they are. They want to know what forking code on Github means, or at least what Github is and how people use it.

I don’t mean to knock the digital literacy of the NICAR attendees, as my sense was that it is higher than any other gathering of journalists in the world, but it’s easy for people to forget that there’s a significant portion of the public for whom these concepts are novel.

I think that’s true of the new media industry too, in which digital literacy and numeracy is perhaps not what it could be. There’s now more pressure on people in the industry to learn more, and for those who want to enter it to have more basic data skills. That’s driven some changes in the NICAR program.

“The temptation is that NICAR will become all about code-sharing,” said Waite. “That would lose the value-add, which is how the code relates to journalism. What’s different, versus programming or Web development?”

This reflects a common dividing line I’ve seen between people in the business world: the “suits” versus hoodies, jeans versus khakis, or MBA’s vs developers. Today, the world of the “news hacker” is being democratized — a good thing — so there’s always going to be a little bit of a discomfort around something that stretched from being a smaller tribe that self identifies into something bigger.

I expect that the backlash within the NICAR community to its expanded ranks and role in the industry will be minimal, leaving people room to work, collaborate, learn and teach. We’d be better off focused on the journalism itself, from storytelling to rigorous fact checking, and a bit less focused upon the tools, however new and shiny some may be.

“I’m not overly pessimistic about NICAR — quite the opposite,” said Waite, “but this focus on the data part of data journalism and less on the journalism part of data journalism is a nagging worry in the back of my head.”

That’s not to say that the technology isn’t worth considering or covering, as I have for years. We have huge amounts of data going online today, more than we ever had before, and media have access to much more powerful personal machines and cloud computing to process it.

Even with the new tech, they’re still doing something old: practicing journalism! The approach may start to look a bit more scientific, over time. An editor might float an assertion or hypothesis about new in the world, and then assigns an investigative journalist to go find out whether it’s true or not. To that, you need to go find data, evidence and knowledge about about it. To prove to skeptical readers that the conclusion is sound, the data journalist may need show his or her work, from the data sources to the process used to transform and present them.

It now feels cliched to say it in 2014, but in this context transparency may be the new objectivity. The latter concept is not one that has much traction in the scientific community, where observer effects and experimenter bias are well-known phenomena. Studies and results that can’t be reproduced are regarded with skepticism for a reason.

Such thinking about the scientific method and journalism isn’t new, nor is its practice in by journalists around the country who have pioneered the craft of data journalism with much less fanfare than FiveThiryEight.

“As we all know, there’s a lot of data out there,” said Ben Welsh, editor of the Los Angeles Times Data Desk. “and, as anyone who works with it knows, most of it is crap. The projects I’m most proud of have taken large, ugly datasets and refined them into something worth knowing: a nut graf in an investigative story or a data-driven app that gives the reader some new insight into the world around them.”

The graphic atop this post comes from that Data Desk. While you the work that created the image, it’s online if you want to look for it: The Los Angeles Times released both the code and data behind the open source maps of California’s emergency medical agencies it published in the series.

Moreover, it wasn’t the first time. As Welsh wrote, the Data Desk has “previously written about the technical methods used to conduct [the] investigation, released the base layer created for an interactive map of response times and contributed the location of LAFD’s 106 fire station to the Open Street Map.”

This is what an open source newsroom that practices open data journalism looks like. It’s not just applying statistics and social science to polls and publishing data visualizations. If FiveThirtyEight, Vox, The New York Times Uptake or other outlets want to publish data journalism and build out the newsroom stack, that’s the high bar that’s been set. (Update: I was heartened to learn that FiveThirtyEight has a Github account.) In sharing not only its code but its data, the Los Angeles Times also set a notable example for the practice of open journalism in the 21st century.

I don’t know about you, but I think that’s a much more compelling vision for what data journalism is and how it has been, is being and could be applied in the 21st century than the fox’s tale.

Postscript: Good news: 538 is both listening and acting.

Between the Spreadsheets, How It's Made

Of scripts, scraping and quizzes: how data journalism creates scoops and audiences


As last year drew to a close, Scott Klein, a senior editor of news applications at ProPublica, made a slam-dunk prediction: “in 2014, you will be scooped by a reporter who knows how to program.”

While the veracity of his statement had already been shown in numerous examples, including the ones linked in his post, two fascinating stories published in the month since his post demonstrate just how creatively a good idea and a clever script can be applied — and a third highlights why the New York Times is investing in data-driven journalism and journalists in the year ahead.

Tweaking Twitter data

One of those stories went online just two days after Klein’s post was published, weeks before the new year began. Jon Bruner, a former colleague and data journalist turned conference co-chair at O’Reilly Media, decided to apply his programming skills to Twitter, randomly sampling about 400,000 accounts over time. The evidence he gathered showed that amongst the active Twitter accounts he measured, the median account has 61 followers and follows 177 users.



“If you’ve got a thousand followers, you’re at the 96th percentile of active Twitter users,” he noted at Radar. This data also enabled Bruner to make a widely-cited (and tweeted!) conclusion: Twitter is “more a consumption medium than a conversational one–an only-somewhat-democratized successor to broadcast television, in which a handful of people wield enormous influence and everyone else chatters with a few friends on living-room couches.”

How did he do it? Python, R and MySQL.

“Every few minutes, a Python script that I wrote generated a fresh list of 300 random numbers between zero and 1.9 billion and asked Twitter’s API to return basic information for the corresponding accounts,” wrote Bruner. “I logged the results–including empty results when an ID number didn’t correspond to any account–in a MySQL table and let the script run on a cronjob for 32 days. I’ve only included accounts created before September 2013 in my analysis in order to avoid under-sampling accounts that were created during the period of data collection.”

A reporter that didn’t approach researching the dynamics of Twitter this way, by contrast, would be left to try the Herculean task of clicking through and logging attributes for 400,000 accounts.

That’s a heavy lift that would strain the capacity of the most well-staffed media intern departments on the planet to deliver upon in a summer. Bruner, by contrast, told us something we didn’t know and backed it up with evidence he gathered. If you contrast his approach to commentators who make observations about Twitter without data or much experience, it’s easy to score one for the data journalist.

Reverse engineering how Netflix reverse engineered Hollywood


Alexis Madrigal showed the accuracy of Klein’s prediction right out of the gate when he published a fascinating story on how Netflix reverse engineered Hollywood on January 2.

If you’ve ever browsed through Netflix’s immense catalog, you probably have noticed the remarkable number of personalized genres exist there. Curious sorts might wonder how many genres there are, how Netflix classifies them and how those recommendations that come sliding in are computed.

One approach to that would be to watch a lot of movies and television shows and track how the experience changes, a narrative style familiar to many newspaper column readers. Another would be for a reporter to ask Netflix for an interview about these genres and consult industry experts on “big data.” Whatever choice the journalist made, it would need to advance the story.

As Madrigal observed in his post, assembling a comprehensive list of Netflix microgenres “seemed like a fun story, though one that would require some fresh thinking, as many other people had done versions of it.”

Madrigal’s initial exploration of Netflix’s database of genres, as evidenced by sequential numbering in the uniform resource locator (URLs) in his Web browser, taught him three things: there were a LOT of them, organized in a way he didn’t understand, and manually exploring them wasn’t going to work.

You can probably guess what came next: Madrigal figured out a way to scrape the data he needed.

“I’d been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web,” he wrote. “Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file. After some troubleshooting and help from [Georgia Tech Professor Ian] Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.”

What he found was staggering: 76,897 genres.  Then, Madrigal did two other things that were really interesting.

First, he and Bogost built the automatic genre generator that now sits atop his article in The Atlantic, giving users something to play with when they visited. That sort of interactive would not be possible in print nor without collecting and organizing all of that data.

Second, he contacted Netflix public relations about what they had found, who then offered him an interview with Todd Yellin, the vice president of product at Netflix that had created Netflix’s system. The subsequent interview Madrigal scored and conducted provided him and us, his dear readers, with much more insight into what’s going on behind the scenes. For instance, Yellen explained to him that “the underlying tagging data isn’t just used to create genres, but also to increase the level of personalization in all the movies a user is shown. So, if Netflix knows you love Action Adventure movies with high romantic ratings (on their 1-5 scale), it might show you that kind of movie, without ever saying, ‘Romantic Action Adventure Movies.’”

The interview also enabled Madrigal to make a more existential observation: “The vexing, remarkable conclusion is that when companies combine human intelligence and machine intelligence, some things happen that we cannot understand.”

Without the data he collected and created, it’s hard to see how Madrigal or how anyone else would have been able to publish this feature.

That is, of course, exactly what Scott Klein highlighted in his prediction: “Scraping websites, cleaning data, and querying Excel-breaking data sets are enormously useful ways to get great stories,” he wrote. “If you don’t know how to write software to help you acquire and analyze data, there will always be a limit to the size of stories you can get by yourself.”

Digging into dialect

The most visited New York Times story of 2013 was not an article: it was a news application. Specifically, it was an interactive feature called “How Y’all, Youse, and You Guys Talk,” by Josh Katz and Wilson Andrews.


While it wasn’t a scoop, it does suggest us something important about how media organizations can use the Web to go beyond print. As Robinson Meyer pointed out at The Atlantic, the application didn’t go live until December 21, which means it generated all of those clicks (25 per user) in just eleven days.

The popularity of the news app becomes even more interesting if you consider that it was created by an intern: Katz hadn’t joined the New York Times full-time when he worked on it. As Ryan Graff reported for the Knight Lab, in March 2010 Katz was a graduate student in the Department of Statistics at North Carolina State University. (He’s since signed on to work on the forthcoming data-driven journalism venture.)

Katz made several heat maps using data from the Harvard Dialect Survey and posted them online. That attracted the attention of the Times and led him to an internship. Once ensconced at the Old Grey Lady, he created a quiz to verify the data and update it, using a dialect quiz. He then tested 140 questions on some 350,000 people to determine the most-telling questions. With that data in hand, Katz worked with graphics editor Wilson Andrews to create the quiz that’s still online today.

What this tells us about data-driven journalism is not just that there is a demand for skills in D3, R and statistics newsrooms: it’s that there’s a huge demand for the news applications those that possess them can create. Such news apps can find or even create massive audiences online, an outcome that should be of considerable interest to the publishers that run the media companies that deploy them.

On to the year ahead

All of these stories should cast doubt about the contention that data-driven journalism is a “bogus meme,” fated to sit beside “hyperlocal” or blogging as saviors of journalism. There are several reasons not to fall into this way of thinking.

First, journalism will survive the death or diminishment of its institutions, regardless of the flavor of the moment. (This subject has been studied and analyzed at great depth in the Tow Center’s report on post-industrial journalism.)

Second, data-driven journalism may be a relatively new term but it’s the evolutionary descendent of a much older practice: computer-assisted reporting. Moreover, journalists have been using statistics and tables for many decades. While interactive features, news applications, collaborative coding on open source platforms, open data and data analyses that leverages machine-learning and cloud computing are all new additions to this landscape, many practitioners of computer-assisted reporting have been doing it for decades. Expect new ones from Nate Silver’s rebooted FiveThirtyEight project in the months ahead.

Finally, the examples I’ve given show how compelling, fascinating stories can be created by one or two journalists coding scripts and building databases, including automating the work of data collection or cleaning.

That last point is crucial: by automating tasks, one data journalist can increase the capacity of those they work with in a newsroom and create databases that may be used for future reporting. There’s one reason (among many) that ProPublica can win Pulitzers prizes without employing hundreds of staff.

Announcements, How It's Made, Tips & Tutorials

News App and Data Guides from ProPublica


Coding the news now has a manifesto. ProPublica’s developers launched a series of news application guides, including a coding manifesto, this morning. The guides, which all live on GitHub, are intended to give insight into the programming ethos of the non-profit investigative journalism outfit. As the manifesto says, “We’re not making any general statements about anything beyond the environment we know: Doing journalism on deadline with code.”

Scott Klein, Jeff Larson and Jennifer LaFleur wrote the guides, which include a news app style guide, a data check-list and a design guide. These resources add to the ever-growing community of news application developers, many of whom are actively blogging about and sharing their working processes.

Read all the guides here.

How It's Made

How it’s Made: Google Hangouts


There is a basic how-to guide to starting a Google Hangout at the bottom of the post.

The New York Times recently held Google Hangouts with voters, two during each political convention. Times columnists Frank Bruni, Gail Collins and Charles Blow moderated four conversations about the economy, bipartisanship, whether there is a Republican war on women, in addition to voters who had switched from supporting Obama in 2008 to supporting Romney in 2012. Voters talked about their lives and needs. About their struggles, financial and otherwise, and about what they want to happen during the next presidential term. (more…)

How It's Made

How it’s made: Stop-and-frisk stepper graphic


The other week, the New York World published a data reporting project with the Guardian examining the NYPD’s controversial Stop, Question and Frisk policy. Last year NYPD Commissioner Kelly issued an order to curtail low-level marijuana arrests following stop-and-frisks. WNYC had previously reported that the NYPD manufactured such arrests by ordering people to remove marijuana from their pockets and then charging them for the more serious crime of possesion in public view.

Our investigation found that marijuana arrests actually rose after Kelly’s order. But finding that story involved diving into the data.

Thanks to a recent lawsuit, the NYPD releases a database each year of every single “stop-and-frisk” that officers make. Unfortunately, the database is so big it can’t easily be opened in Excel and the data also requires some serious “cleaning” to be usable.

To address these issues, we analyzed the data using the open source statistics program R, which can handle data cleaning, interrogation, and visualization in one program. Because R lets you type in commands that apply across multiple files, it removes the need for switching among Excel windows. R also supports the SQL-like queries through the sqldf extension package, which makes more complex database systems so powerful.

Because we were interested in when certain types of stop-and-frisk incidents had taken place, we used R to split the day, month, and year of each incident’s date field into individual columns. This set the data up for the next step in our analysis, which was to count up how many marijuana arrests occurred each month.

Using SQL queries, we were able to group and count the data by month and crime type. We focused our searches on marijuana possession (which in the NYPD data was spelled “marihuana”).

We ran a number of queries to see month-to-month trends and also compared across years to see how 2011 compared to data to 2008. This gave us valuable context because stops actually dropped in November and December of 2011, but not as much as they did in those same months in prior years. If Kelly’s order impacted officer behavior, we should have seen relatively dramatic decreases during those months, but found only slight declines. This context was vital to our story, and explaining why the 2011 drop was not significant was a high priority for our final visualization.

We also ran queries comparing arrests to stops as well as isolating specific precincts. However, only a few of these queries yielded results that were worthy of inclusion in the final interactive.

Visualizing Part 1:
In order to find the trends mentioned above, though, we first had visualize our query results, which R can do, too. An extension package for R called ggplot2 will generate high-quality, customizable line graphs that could be used directly for print graphics. However, we wanted ours to be interactive, which required some additional work.

Visualizing Part 2:
SVG (Serialized Vector Graphic) is a type of graphic that is drawn dynamically on a computer screen, which means that it can be highlighted, clicked, rolled over, or animated in ways that .jpg, .gif and .png files can’t. The ggplot2 graphics can be converted to SVG, and then published to the web using a javascript library called called Raphaël. Although this requires some copying-and-pasting, the clean, dynamic graphics it produces are worth it.

Putting it all together
To better tell the story, we compiled four sets of charts that we incorporated into a so-called “stepper graphic.” Thanks to the newsapps team at ProPublica, there is a great open-source library ( for building these graphics. Turning my four charts into four different “slides” was as easy as creating a function for each of them and then copying in their Raphaël code. The stepper graphic library took care of numbering and transitions. We built the grid and axes with standard HTML and CSS, and made label fades using simple jQuery fadeIn() and fadeOut() methods.

Finally, once we confirmed we were running the story with the Guardian, we adjusted the styles to make sure it would mesh well with their design. So we made the months lowercase, the font Georgia, and the line fuschia – perhaps the most important part.

How It's Made

How It’s Made: Tow Center/ScraperWiki DataCamp Winning Entry

The analysis in progress

The analysis in progress

In early February, the Tow Center hosted a Journalism Data Camp with Knight News Challenge winner ScaperWiki, which provides tools and training to journalists working with difficult data. The goal of the camp was to bring together journalists and computer scientists to make data more accessible, analyze it, and create stories around the theme of “New York Accountability”. A group of journalism school students attended the event to gain experience with data journalism. Marc Georges, one of the students who was part of the winning team, describes how his group’s project was developed.

Attendees at the event started out by forming groups and identifying a data source to mine for stories. Our group consisted of current journalism school students Curtis Skinner, Eddie Small, Isha Sonni, Trinna Leong, Keldy Ortiz, Salim Essaid, Sara Alvi and myself, as well as recent graduate and New York World fellow Michael Keller, and GSAS statistics student Brian Abelson.

Salim pitched the group a project focused on stop-and-frisks in Shia communities in New York City. A recent AP report showed that in 2006, the NYPD had recommended increased surveillance of Shia communities and had identified nine mosques for possible surveillance. Our team wanted to know if this recommendation had resulted in an increase in stop-and-frisks of Middle-Eastern New Yorkers and if anything in the data would tell us whether or not police actually targeted these communities. Here’s what we learned in trying to put together this story:

Data is Dirty
Just because data is available doesn’t mean you’ll be able to use it quickly and easily. Our first main challenge was accessing and cleaning data on stop-and-frisks in New York City. The NYPD makes this data available on their website but there’s a ton of it–400,000 cells of Excel values for every year.

Curtis, Eddie, Isha, Trinna, Keldy and Sarah researched and collected our data but one of the most basic issues we ran into was trying to determine how many stop and frisks affected people of Middle-Eastern descent. Although the NYPD tracks the race of those it stops, Middle-Eastern people are categorized as whites so it was not possible to isolate that ethnic group directly. As a workaround, we considered using census data to find predominately Middle-Eastern neighborhoods, but ran into the same issue. After reviewing the information we did have, we came up with the idea of using proximity to the mosques mentioned in the AP report as a marker for ethnicity. We thought it was fair to assume the closer a stop was to a mosque, the more likely the person being stopped was Middle-Eastern. We decided to look at a radius of 900ft, the average length of a New York City block.

Once we were able to isolate our data set, we realized that working with such large amounts of data wasn’t feasible without some type of automation. The coders at the event were really helpful in writing a script that scraped the data for the variables we needed. That let us isolate the key aspects of the stop-and frisks we wanted to use and move forward in mapping our data.

Mapping is Hard
One of our main goals in the project was comparing the incidences of stop-and-frisks near these 9 mosques with other areas in New York City. To do that, we needed to be able to map our cleaned data. Sounds simple enough, right?

Our initial map of stop-and-frisks for 2006, color-coded by race.

Our initial map of stop-and-frisks for 2006, color-coded by race.


Creating our maps turned out to be one of the most difficult and time-consuming aspects of our weekend. Our main problem was that our location data for our stop-and-frisks was in a format which the NYPD uses called State Plane while the location data for our mosques was in longitude and latitude. Brian Abelson, a graduate student at Columbia pursuing a degree in Quantitative Methods in the Social Sciences saved the day by converting our data and then mapping it.

Brian used a mapping tool called ArcGis to solve the conversion problem so we could view the stop-and-frisks and mosques on the same map.  Brian, Michael and I then used a program called R, to further scrape the data and isolate points by specific variables, like race.  Brian then used ArcGis to isolate all the points within a 900ft radius of a mosque so we could see how the rates of stop-and-frisks changed over time. Mike Dewar, one of the coders at the event, was also very helpful in nailing down our approach to identifying stop-and-frisks near mosques.  Mike wrote an algorithm for us to measure the distance between any one point and a mosque.  We didn’t end up using Mike’s algorithm but talking the problem over with him and discussing various approaches was a great help in tackling the larger issue of working with such a large data set.

An initial map of stop-and-frisks for 2006.  White dots show the number of stops, red dots the number of arrests.

An initial map of stop-and-frisks for 2006. White dots show the number of stops, red dots the number of arrests.

Analyzing Data leads to More Analyzing Data
Once we were able to map our data, we could see how many stop-and-frisks occurred near these nine mosques in 2006. When we compared the first three months of the year, before the recommendation for surveillance was made, with the next nine months, we did see a small increase. However, to know if this is statistically relevant or markedly different from stop-and-frisks in other areas of New York, we have to do a lot more research and analysis.

Stop-and-frisks for two mosques prior to March 2006

Stop-and-frisks for two mosques prior to March 2006

Stop-and-frisks for two mosques after March 2006

Stop-and-frisks for two mosques after March 2006

Jeremy Baron of WikiMedia New York City and Thomas Levine of ScraperWiki, two coders from the event, worked with us after the event to help automate our workflow. Jeremy wrote a script which aggregated our data and put into a database while Thomas helped us throughout the process in fixing and checking our sql and javascript code.

Our next step has been designing the right control group against which to test our data, and we received great feedback after the event on the best way to do so. One surprising thing has been how rich our data set is. Further analysis may show that stop-and-frisks near these mosques wasn’t unusual, but as we continue working with the data, it has already given us ideas for more stories we can work on.

How It's Made, World

How It’s Made: Overseas Investigations


In the fall of 2010, students from the Stabile Center for Investigative Journalism started what became a yearlong investigation into the multi-billion dollar deals of The China International Fund (CIF), a Hong Kong-based company with investments in African oil, diamonds and minerals.

Published on the cover of the Chinese business magazine Caixin and by iWatch News, the story details how CIF’s network of more than 64 companies struck opaque deals with African leaders such as Robert Mugabe in Zimbabwe and Eduardo dos Santos in Angola.  They followed CIF to Guinea, where the company is connected to the country’s former military regime.  

In addition to analyzing international oil sales, locating public documents in foreign countries, and finding sources overseas, two students went to Guinea to investigate CIF’s investments on the ground.

Below, the four reporters describe their methods.

Understanding the World of Oil Sales - Himanshu Ojha

Oil trading is as complex as it is lucrative, and comes in many forms—only some of which actually involve the sale of oil for cash. We knew that CIF’s sister company, China Sonangol, was purchasing Angolan oil and selling to the Chinese national oil company Sinopec. After some digging, we decided that we wanted to focus on Angolan crude oil exports to China. But first we had to find the data.

The US Energy Information Administration’s website offered a brief, accessible introduction to the energy industry, with separate pages for major oil producers and consumers.

For specifics, we went to Comtrade -  a free website that tracks the import and export data of commodities. It’s a complicated, Internet Explorer-specific interface, so it takes some fiddling. To use it you need to know four things:

  1. The code of the commodity that you’re researching.  In our case this was HS2709 – crude oil from petroleum.
  2. The country reporting the statistic.  For us this was China’s import figures sourced from its customs office.
  3. The “partner” – i.e. the other country. Angola in our case.
  4. The time frame.  Comtrade provides annual data, which we searched starting in 2003 – when the first CIF-related company was incorporated.

Using Comtrade’s shortcut query, we were able to generate figures showing the annual dollar value of Chinese imports of crude oil from Angola.

For monthly figures, we went to TradeMap.  Though it is a pay service, they do offer a trial version. They also often provide their services free to NGOs.

Though we would have liked to look at Angolan exports to China to see if they matched the Chinese import data, Comtrade does not hold export figures from Angola, and oil export figures from Angola’s government were too aggregated for our purposes.

The data that we got from Comtrade gave us some context for China Sonangol’s oil sales to China.  Whenever we found specific information regarding a sale, we were able to estimate what percentage it represented of the overall oil trade between Angola and China.

Finding Public Documents in Foreign Countries - Beth Morrissey

When we started researching CIF, we pulled together all the news articles, NGO reports, and government research we could find.  After combing through these documents we had dozens of names of companies and people connected to CIF, but we knew very little about each company and person.  For the companies, we wanted to know about their directors and how long they had been incorporated.  For the people, we wanted to know about their previous work experience and what role they performed for CIF.

Step 1: Finding Hong Kong Pubic Records
Because both public and private companies are required to keep their records on file with Hong Kong’s corporate registry, we started by searching that database for the names of every company we knew was connected to CIF.  This gave us a the names of directors, the address, and the date of incorporation for each company.

We were also interested in court records related to CIF-connected companies and people, but Hong Kong does not have a public online database of court records.  Instead we found a pay-for-use database called D-Law, which has a large cache of Hong Kong corporate records. We used D-Law to check the court records for the name of every person and company connected to CIF.

Step 2: Cross Referencing
To make sure we weren’t missing anything, we then compared information we found in the court records and the corporate registry.  For example, if a court record mentioned the name of a new person, we would then run his or her name through the corporate registry.  If that person was the director of any Hong Kong companies, we’d also run the company names through the D-Law database.

Finally, we took all of the addresses listed in the corporate registry and court records and ran them through Hong Kong’s land registry to find out who owned the buildings mentioned in the documents.

Step 3: Filling in the Gaps
Not all of the CIF-related companies were in the Hong Kong corporate registry, however, so we knew that they had to be incorporated in other countries.

We used the Investigative Dashboard to locate the corporate registries of places like Bermuda and the Cayman Islands.   When our other research didn’t indicate the location of a company, we used trial and error, running the names of CIF-related companies and people through the corporate registries of likely countries to see if we could find any documents.

Step 4: Keeping Track of It All
We used Document Cloud to keep track of our documents and share them with the members of our team.  Document Cloud is online service that allows you to upload PDFs, JPEGs, and other documents and share them with other people who have an account.  It also converts all uploaded documents to text, making them searchable. Though Document Cloud accounts are only available upon request, services like Evernote, Scribd, and Dropbox perform similar functions.

Finding Sources Overseas - Laura Rena Murray

One way of digging up information about a company is by contacting their competitors.  As part of my research, I went looking for the CEO of a company that was competing with CIF.

After calling the company’s main office several times, however, it became clear that no one intended to speak with the press nor would they pass along my messages.  After calling several of his former offices and companies looking for information, I realized that the CEO was better known by his middle name.

Doing a search using his middle name, I was able to find and confirm his family residence in the US, which ultimately led me to his private consulting firm.  When I called, the phone rang through to voicemail, but the message included his London cell phone number.  I used Skype  to call his London number and he immediately answered.

Another way to find sources is to identify shareholders or directors who are particularly active in your target company.  CIF has a director who held high-ranking positions in state-owned companies. I had more luck getting him on the phone, in part because he was harder to track down and had not been contacted yet by other reporters.

Finally, when looking for information about foreign companies or individuals, search in the native language of your sources; the same goes for email communication.  Web searches for the CIF directors and their companies yielded skimpy results in English. By using Chinese characters to search for information, I was able to track down a lot more background information and up-to-date contact details.  In this situation, Google Translate is your friend. Do not rely on it when writing emails, however. Find a native speaker to help, and then use Google Translate to decipher the replies.

How to Plan a Trip Overseas - Patrick Martin-Ménard

When investigating another country, there is only so much one can do over the phone or online.  Visiting a place, even for just a few days, can make a significant difference in the documents you obtain and the information your sources may be willing to give you.

That is not to say, however, that you should simply jump on a plane and go on a “fishing” trip.  Careful advance planning is required to make sure that the time and resources you invest have a good chance of yielding results.

Contact sources long in advance
You want to know what you’re doing and who you’re going to be talking to before you arrive.  Otherwise, you’ll be wasting valuable time on the ground looking for sources.

Find a Fixer
A fixer is someone who helps you with the logistics of the trip and works with you on the ground, helping you make contact with sources and organizing travel arrangements.  Fixer fees vary greatly from place to place, in part depending on the level of danger and difficulty involved.  You’ll also want to hirer a driver, so you don’t have to negotiate the local roads on your own. In your negotiations with your fixer and your driver, make sure that they will be with you at all times during the day for the duration of the trip. Be sure to establish specific working hours and dispositions for extra time. For good fixer and driver recommendations, contact foreign correspondents who have worked in the area.

Stay Safe
This may seem obvious, but it’s important to think about your own safety above all.  Make sure you are aware of the risks associated with the subject you’re investigating well in advance.  Sensitivities differ from place to place, and you don’t want to jeopardize your story or yourself by discussing controversial issues too openly.

How It's Made

How It’s Made: NY World POPS Map


In October, The New York World began a collaboration with WNYC and the Brian Lehrer Show to crowdsource ratings of New York City’s several hundred privately-owned public spaces. To facilitate and catalog this process, The New York World’s Michael Keller developed an interactive map of these spaces, which served as both a guide and a repository for audience contributions. Below, he discusses the process and technologies used to create this piece. Complete coverage of the project can be found here.

1. Where did the idea come from?

The Occupy Wall Street in late September was probably what brought the term “privately owned public space” to most New Yorkers’ attention. These spaces — like Zuccotti Park —were created from land deals made between developers and the city: if developers ceded some land for public use such as a plaza, they could build taller skyscrapers or get other incentives potentially worth millions of dollars. For my part, though, the sudden notoriety of Zucotti Park reminded me of a file I had seen a few months ago on the NYC Datamine entitled “Privately Owned Public Spaces” — or POPS.

I mentioned this find to Yolanne Almanzar, the New York World’s public space reporter, who was looking into other POPS and found that a 2008 Manhattan Community Board 6 survey found that many such spaces in midtown were actually closed to the public or in poor condition.

After locating and converting the Datamine Access file to a spreadsheet, I found it contained the 391 addresses of all of the POPS in the city. From there we had a question to answer: were landlords living up to their end of the deal in providing useable public space? Our editors Alyssa Katz and Amanda Hickman pitched the idea to editors at WNYC that week as a collaborative crowdsourcing project. They liked it and we went from there.

2. Who worked on it?

Working with WNYC’s map wizard John Keefe, I used Google Fusion Tables to map the addresses from the Datamine and then used JavaScript to add features like the progress bar, address finder, and GPS locator.

Once the map was developed, Yolanne and I worked with the WNYC producers Jody Avirgan and Paige Cowett to figure out what specific questions we wanted New York World readers and WNYC listeners to answer. Yolanne also went on the Brian Lehrer show to discuss the project twice — at the beginning and end of the month-long crowdsourcing phase.

Once we started getting responses, Yolanne and I used twitter and email to follow up on interesting sites.

3. How long did it take?

The initial map took a few days to conceptualize and code but the whole project is still ongoing. A crowdsourcing project is a differs from a traditional story because you spend a great deal of time planning and reviewing information before you have enough information to know what the story is.

We stopped accepting new reader reports at the beginning of November and are now pulling building records to see exactly what these sites were supposed to offer to the public and what they got in return. For me, it’s a great way to work because you’re in touch with readers, and you’re both invested in the project and anxious to see what comes of it.

4. What processes or technologies were used?

We used Google Fusion Tables to map the POPS, but first I had to format and verify the data.

I used Microsoft Excel’s “=CONCATENATE()” function to add “New York, NY” to the addresses so that Fusion Tables could properly geocode them — convert them to latitude and longitude so that they can be mapped.

Oddly, Zuccotti Park at 1 Liberty Plaza didn’t geocode correctly and was placed in the middle of the South Tower at the WTC memorial. ProPublica’s Al Shaw pointed out that sometimes Google doesn’t work nicely with addresses that start with a single number. By changing the “1” to a “One” it geocoded properly. To be safe, I manually spelled out all of the single number addresses. (This glitch also comes up using normal Google maps. Type in “1 Liberty Plaza New York, NY” and see where it drops the pin.)

We spot checked a few other locations, and manually geocoded their position where necessary. Out of 391 entries, about a dozen were incorrectly placed by Fusion Tables. Reader feedback helped us find a few, too.

One of the big challenges with this dataset was getting the infowindows — the little info bubbles that appear when you click — to appear in a standardized way since the data weren’t consistent: some spaces contained a building name and an address whereas others just showed an address.

To get around this, I made two new columns in my Excel spreadsheet called displayName and displayAddress. displayName would contain the site name if it had one, or the address if it didn’t. displayAddress would have an address if a site had a name, otherwise I left it empty. Although Because Fusion Tables won’t allow JavaScript in infowindows, I used CSS to style the site name and address as elements of an unordered list with each line styled to


While the default infowindow layout puts break tags at the end of each line, my list layout meant that if displayAddress were blank it wouldn’t leave an awkward empty line in the text.

I also collapsed additional details of each site into a new column and used that for the infowindow so it would be consistent for all of the entries.

To create the progress bar for the map, I used the Fusion Tables SQL API to query our fusion table and count the number of columns that have been marked as visited.

Through the Google Maps API documentation, I found that you can relatively easily add GPS functionality to a site with javascript. I added this feature to help people use the map when they were out reporting on these spaces. And similar to the address finder, I see it as a way the story can be personalized for people who want to find spaces near where they live or work.

For the address locator, I borrowed some javascript code from John Keefe at WNYC.

Because mobile was a big concern for the map, I made a second stylesheet that showed only the address finder and the GPS button when it was viewed on a mobile device. Though creating the second stylesheet wasn’t difficult, making sure it was showing up correctly across devices required a bit of a hack.

For most devices, I could detect the mobile device using:

media='only screen and (max-device-width:480px)'

For iPad or horizontal views on those devices I added:

media='only screen and (max-device-width:768px) and (max-device-width:1024px)'

To make sure those dogged infowindow didn’t show up super small on mobile devices, Keefe sent along this piece of code to go in theof my page to scale the viewport appropriately:

<meta name="viewport" content="width = device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0" />  
<meta name="apple-mobile-web-app-capable" content="yes" />

That did the trick.

5. What was learned from it?

We used Google Forms to get the reports from our readers and listeners, which was a good solution in that WNYC was familiar with it and it’s easily embeddable. But we spent a lot of time curating reader responses so it would have been much nicer to input our responses into a database whose results we could query and sort. The Google Form inputs our results into a spreadsheet that we had to then manually arrange and create lists from, which was a time-consuming process.

It would also have been nice to have a more robust wrapper for our fusion table so that we could easily pull in comments or media from our reader-submitted database and display them alongside the map. This would have meant designing a flexible layout to handle multiple content types as well as a backend database to categorize this information, but this was far beyond what our timeframe allowed.

A recurring lesson on this and other projects has been that designing user-interfaces is much like the scene in Alice in Wonderland where Alice is racing the Red Queen. Alice runs as fast as she can but she doesn’t move from where she stands. The Red Queen says: “Now, here, you see, it takes all the running you can do to keep in the same place.” When starting with messy data, it takes a lot of thinking just to get an interface that works without giving the reader a headache.