How It’s Made: Tow Center/ScraperWiki DataCamp Winning Entry

The analysis in progress

The analysis in progress

In early February, the Tow Center hosted a Journalism Data Camp with Knight News Challenge winner ScaperWiki, which provides tools and training to journalists working with difficult data. The goal of the camp was to bring together journalists and computer scientists to make data more accessible, analyze it, and create stories around the theme of “New York Accountability”. A group of journalism school students attended the event to gain experience with data journalism. Marc Georges, one of the students who was part of the winning team, describes how his group’s project was developed.

Attendees at the event started out by forming groups and identifying a data source to mine for stories. Our group consisted of current journalism school students Curtis Skinner, Eddie Small, Isha Sonni, Trinna Leong, Keldy Ortiz, Salim Essaid, Sara Alvi and myself, as well as recent graduate and New York World fellow Michael Keller, and GSAS statistics student Brian Abelson.

Salim pitched the group a project focused on stop-and-frisks in Shia communities in New York City. A recent AP report showed that in 2006, the NYPD had recommended increased surveillance of Shia communities and had identified nine mosques for possible surveillance. Our team wanted to know if this recommendation had resulted in an increase in stop-and-frisks of Middle-Eastern New Yorkers and if anything in the data would tell us whether or not police actually targeted these communities. Here’s what we learned in trying to put together this story:

Data is Dirty
Just because data is available doesn’t mean you’ll be able to use it quickly and easily. Our first main challenge was accessing and cleaning data on stop-and-frisks in New York City. The NYPD makes this data available on their website but there’s a ton of it–400,000 cells of Excel values for every year.

Curtis, Eddie, Isha, Trinna, Keldy and Sarah researched and collected our data but one of the most basic issues we ran into was trying to determine how many stop and frisks affected people of Middle-Eastern descent. Although the NYPD tracks the race of those it stops, Middle-Eastern people are categorized as whites so it was not possible to isolate that ethnic group directly. As a workaround, we considered using census data to find predominately Middle-Eastern neighborhoods, but ran into the same issue. After reviewing the information we did have, we came up with the idea of using proximity to the mosques mentioned in the AP report as a marker for ethnicity. We thought it was fair to assume the closer a stop was to a mosque, the more likely the person being stopped was Middle-Eastern. We decided to look at a radius of 900ft, the average length of a New York City block.

Once we were able to isolate our data set, we realized that working with such large amounts of data wasn’t feasible without some type of automation. The coders at the event were really helpful in writing a script that scraped the data for the variables we needed. That let us isolate the key aspects of the stop-and frisks we wanted to use and move forward in mapping our data.

Mapping is Hard
One of our main goals in the project was comparing the incidences of stop-and-frisks near these 9 mosques with other areas in New York City. To do that, we needed to be able to map our cleaned data. Sounds simple enough, right?

Our initial map of stop-and-frisks for 2006, color-coded by race.

Our initial map of stop-and-frisks for 2006, color-coded by race.


Creating our maps turned out to be one of the most difficult and time-consuming aspects of our weekend. Our main problem was that our location data for our stop-and-frisks was in a format which the NYPD uses called State Plane while the location data for our mosques was in longitude and latitude. Brian Abelson, a graduate student at Columbia pursuing a degree in Quantitative Methods in the Social Sciences saved the day by converting our data and then mapping it.

Brian used a mapping tool called ArcGis to solve the conversion problem so we could view the stop-and-frisks and mosques on the same map.  Brian, Michael and I then used a program called R, to further scrape the data and isolate points by specific variables, like race.  Brian then used ArcGis to isolate all the points within a 900ft radius of a mosque so we could see how the rates of stop-and-frisks changed over time. Mike Dewar, one of the coders at the event, was also very helpful in nailing down our approach to identifying stop-and-frisks near mosques.  Mike wrote an algorithm for us to measure the distance between any one point and a mosque.  We didn’t end up using Mike’s algorithm but talking the problem over with him and discussing various approaches was a great help in tackling the larger issue of working with such a large data set.

An initial map of stop-and-frisks for 2006.  White dots show the number of stops, red dots the number of arrests.

An initial map of stop-and-frisks for 2006. White dots show the number of stops, red dots the number of arrests.

Analyzing Data leads to More Analyzing Data
Once we were able to map our data, we could see how many stop-and-frisks occurred near these nine mosques in 2006. When we compared the first three months of the year, before the recommendation for surveillance was made, with the next nine months, we did see a small increase. However, to know if this is statistically relevant or markedly different from stop-and-frisks in other areas of New York, we have to do a lot more research and analysis.

Stop-and-frisks for two mosques prior to March 2006

Stop-and-frisks for two mosques prior to March 2006

Stop-and-frisks for two mosques after March 2006

Stop-and-frisks for two mosques after March 2006

Jeremy Baron of WikiMedia New York City and Thomas Levine of ScraperWiki, two coders from the event, worked with us after the event to help automate our workflow. Jeremy wrote a script which aggregated our data and put into a database while Thomas helped us throughout the process in fixing and checking our sql and javascript code.

Our next step has been designing the right control group against which to test our data, and we received great feedback after the event on the best way to do so. One surprising thing has been how rich our data set is. Further analysis may show that stop-and-frisks near these mosques wasn’t unusual, but as we continue working with the data, it has already given us ideas for more stories we can work on.