CU Community, The Tow Center

What can journalism learn from computer science?

1

Journalism needs an algorithm. That’s not to say machines should replace reporters, but that reporters should be thinking more like machines: systematically. From computer programs that automate news stories, to data-driven narratives and mobile app development — journalism’s relationship with computer science is becoming ever more involved. Integrating technology into journalism, however, doesn’t simply mean installing Excel on newsroom computers, or teaching journalism students basic HTML and CSS. Applying core computing concepts to reporting and story telling can not only improve journalists’ production efficiency, but also shape their narratives. (more…)

CU Community

Columbia University and Reuters to work on Advanced Data Visualization Project

0

Columbia University and Thomson Reuters announced the launch of the Advanced Data Visualization Project (ADVP) based at Columbia’s Graduate School of Architecture, Planning and Preservation (GSAPP). The initiative, sponsored by Thomson Reuters, will facilitate research into data visualization and its implications for academia and industry in a world increasingly awash with data.

Read the full Reuters press release here.

 

(Photo: AP /Diane Bondareff)

Past Events

Event: Journalism and Technology Breakfast

0

The Tow Center hosted its inaugural Journalism and Technology Breakfast on Wednesday 30 May at Soho House. Journalists and tech entrepreneurs gathered at the swanky Chelsea members’ club to discuss the interplay of digital innovation and journalism over artisan granola and baked goods. The event, moderated by Tow Director Emily Bell, is to be the first of a twice yearly event which aims to plug Columbia Journalism School further into the New York tech community. In his opening remarks the Dean of the Journalism School, Nicholas Lemann, said the event was keeping in tone with the move of the school towards further engaging with the digital journalism world.

The first speaker, John Borthwick, CEO and founder of betaworks, spoke of the changing landscape of technology and its impact on journalism. Borthwick said that when he founded the new media investment company four years ago, he did so “outside of the noise” around current media-tech startups. Borthwick described betaworks as a company rather than a fund; a position that allows the organization to participate in the development of its investment projects without becoming trapped in the politics of legacy organizations. Speaking in relation to one of the recipient companies of betaworks’ investment, bitly, Borthwick emphasized the importance of data in the newsroom. “The data layer is a shadow because it’s part of how we live; it’s there but usually not observed”, he said.

Since its launch last year, the New York World has focused on producing heavily data-driven stories about government accountability in New York. Editor, Alyssa Katz, introduced the work of two of her team that particularly demonstrated the role of data in finding stories. Via video link, Michael Keller presented a four-part interactive - Our Future Selves. Keller was unable to attend the breakfast because he was in Paris receiving the second place prize for the project at the Global Editors Network International Data Journalism Awards. The piece, originally produced for Columbia Journalism School’s News21 workshop, was published by the Washington Post. It uses census data – collected and analyzed by Keller and his partners on the project, Jason Alcorn and Emily Liedel – to show the effect of an aging population.

Alice Brennan went on to explain another project she produced with Keller and other members of the New York World team. Using NYPD stop and frisk data, the New York World worked on a series of stories about incidents of stoppings around city and the demographics behind the figures. Brennan said the biggest challenge the team faced was the state of the data, which took three weeks to clean and required interrogation of 117 columns of data.

CEO and co-founder of BuzzFeed, Jonah Peretti, and editor-in-chief, Ben Smith, closed the event with a discussion that built on Borthwick’s remarks about the changing nature of the web. Peretti described the BuzzFeed homepage as “a place to share”, catering to the shift in the behavior of internet users. Smith went on to explain the impact of Twitter and how its changed the way people converge on the social web. “The beast wandered off to tweet. People were no longer hitting refresh on their RSS feeds anymore,” he said.

Like Borthwick, Peretti and Smith acknowledged the importance of data in the newsroom. Web publication is not only a cheaper production option than print, they said, but also gives editors and journalists a clearer picture of their audience. There has been a fetishization of the news which lacked engagement with the larger picture. The initial excitement generated by the gizmos has now faded and journalists and developers have arrived at a place to think more critically.

Past Events

Reconstruction of International Journalism

0
The Tow Center for Digital Journalism is hosting an evening panel discussion at the Columbia Graduate School of Journalism.

“The Reconstruction of International Journalism: Changes in Large Newsrooms”

Tuesday, May 22 from 5-7pm
Columbia Graduate School of Journalism, Stabile Student Center — main level, turn left as you enter the building.

Chair: C.W. Anderson (CUNY)

Panelists:

Caitlin B Petre (New York University): “Interviewing the Interviewer: The Challenges and Opportunities of Questioning Journalists”

Nikki Usher (George Washington University): “Ethnography in a Time of Big Newsroom Uncertainty”

Valerie Belair-Gagnon (City University, London): “Beyond the Physicality of the BBC Newsroom(s)”

Respondent: Michael Schudson (Columbia)

How It's Made

How it’s made: Stop-and-frisk stepper graphic

0

The other week, the New York World published a data reporting project with the Guardian examining the NYPD’s controversial Stop, Question and Frisk policy. Last year NYPD Commissioner Kelly issued an order to curtail low-level marijuana arrests following stop-and-frisks. WNYC had previously reported that the NYPD manufactured such arrests by ordering people to remove marijuana from their pockets and then charging them for the more serious crime of possesion in public view.

Our investigation found that marijuana arrests actually rose after Kelly’s order. But finding that story involved diving into the data.

Thanks to a recent lawsuit, the NYPD releases a database each year of every single “stop-and-frisk” that officers make. Unfortunately, the database is so big it can’t easily be opened in Excel and the data also requires some serious “cleaning” to be usable.

To address these issues, we analyzed the data using the open source statistics program R, which can handle data cleaning, interrogation, and visualization in one program. Because R lets you type in commands that apply across multiple files, it removes the need for switching among Excel windows. R also supports the SQL-like queries through the sqldf extension package, which makes more complex database systems so powerful.

Cleaning
Because we were interested in when certain types of stop-and-frisk incidents had taken place, we used R to split the day, month, and year of each incident’s date field into individual columns. This set the data up for the next step in our analysis, which was to count up how many marijuana arrests occurred each month.

Querying
Using SQL queries, we were able to group and count the data by month and crime type. We focused our searches on marijuana possession (which in the NYPD data was spelled “marihuana”).

We ran a number of queries to see month-to-month trends and also compared across years to see how 2011 compared to data to 2008. This gave us valuable context because stops actually dropped in November and December of 2011, but not as much as they did in those same months in prior years. If Kelly’s order impacted officer behavior, we should have seen relatively dramatic decreases during those months, but found only slight declines. This context was vital to our story, and explaining why the 2011 drop was not significant was a high priority for our final visualization.

We also ran queries comparing arrests to stops as well as isolating specific precincts. However, only a few of these queries yielded results that were worthy of inclusion in the final interactive.

Visualizing Part 1:
In order to find the trends mentioned above, though, we first had visualize our query results, which R can do, too. An extension package for R called ggplot2 will generate high-quality, customizable line graphs that could be used directly for print graphics. However, we wanted ours to be interactive, which required some additional work.

Visualizing Part 2:
SVG (Serialized Vector Graphic) is a type of graphic that is drawn dynamically on a computer screen, which means that it can be highlighted, clicked, rolled over, or animated in ways that .jpg, .gif and .png files can’t. The ggplot2 graphics can be converted to SVG, and then published to the web using a javascript library called called Raphaël. Although this requires some copying-and-pasting, the clean, dynamic graphics it produces are worth it.

Putting it all together
To better tell the story, we compiled four sets of charts that we incorporated into a so-called “stepper graphic.” Thanks to the newsapps team at ProPublica, there is a great open-source library (http://www.propublica.org/nerds/item/anatomy-of-a-stepper-graphic) for building these graphics. Turning my four charts into four different “slides” was as easy as creating a function for each of them and then copying in their Raphaël code. The stepper graphic library took care of numbering and transitions. We built the grid and axes with standard HTML and CSS, and made label fades using simple jQuery fadeIn() and fadeOut() methods.

Finally, once we confirmed we were running the story with the Guardian, we adjusted the styles to make sure it would mesh well with their design. So we made the months lowercase, the font Georgia, and the line fuschia – perhaps the most important part.

Past Events

Event: Doing Data Journalism

0

The archived video of this event can also be accessed here.

A panel of six of journalism’s movers and shakers convened at Columbia Journalism School on March 28 to debate the current state of data-centered reporting and interactive visualizations.

The panel, moderated by Columbia professor Susan McGregor, tackled numerous issues surrounding data journalism, but their first hurdle was to simply define data journalism — a type of storytelling quickly gaining momentum in newsrooms.

“Data journalism is just journalism,” said Julia Angwin, Wall Street Journal’s technology editor.

Angwin likened the collection of data sets to the age-old process of conducting interviews. The difference is that the technology available today allows journalists to examine data sets more exhaustively, beyond the limits of interviews and common knowledge, she added.

Angwin has worked on projects like “What They Know” which examined the tangled nature of online privacy. Her team secured data using code forensics, which she said helps “break stories” and “expand journalism.”

Jo Craven McGinty, projects editor for Computer Assisted Reporting at The New York Times, called data journalism “documents reporting on steroids,” implying data journalism allows journalists to dive into larger and more complicated data sets with the help of database systems and spreadsheets.

Scott Klein, editor of news applications at ProPublica, said the field of data journalism should also recognize the potential of “news applications” which weigh the presentation of data as greatly as its gathering, reporting and analysis.

Web scraping for jouranalism

Blog post on data scraping for ProPublica's "Dollars for Docs." Photo: Rani Molla.

ProPublica projects like Dollars for Docs — which examined doctor payoffs from drug companies using data scraped from pharmaceutical websites — allows users to search for their own doctors and view any payments they received.

Klein said this type of user interface was a key component of data journalism: “It can tell your personal story…and how it matters to you.”

But the use of data or technology in storytelling does not change the inherent concepts of journalism, Klein added.

“This is journalism that is native to the web, but it’s still just journalism,” he said. “The rules all still apply, the methodology is the same, the rigor is the same…the editorial judgement is all the same.”

It is the concept of data, though, that might need restructuring, according to Aron Pilhofer, editor of interactive news at The New York Times. Tools like Document Cloud (which he and Klein helped develop) allow even plain text documents to become data, by enriching them with metadata and providing search functionality.

The panel then turned to a discussion about which comes first: the data or the story. The panelists unanimously agreed the story idea almost always leads to the data research.

But data analysis rarely — almost never, according to Mo Tamman, a Reuters data journalist — yields the expected results. It almost always reroutes the story to an unanticipated conclusion.

Angwin stressed this kind of journalism can be thought of “testing hypotheses.” It is ultimately using data to verify or rethink story ideas.

Tamman added it is crucial to bring in outside experts almost immediately and “suck their brains dry” in order to better understand, authenticate and contextualize the meaning of the data.

But as the burgeoning practices of data journalism expands, newsrooms must adapt, according to Angwin. Newsrooms are currently “allergic to margins of error,” she said, and they must learn to cope with results that cannot be verified 100 percent — a typical situation when dealing with large data sets.

Newsrooms must also become more math friendly and data literate — something McGinty says can mean simply knowing when data and documents can successfully augment one’s storytelling.

Embracing data journalism may even support new business models, such as the new joint venture from Reuters and The New York Times data teams which will offer “white glove” Olympics coverage, Pilhofer said. However, even small changes in newsrooms — like seating data teams together — can be essential in fostering innovative thinking among the staffers.

And most importantly, Tamman said, newsrooms and journalists doing data driven journalism must incorporate into their reporting practice the process of finding a story’s “empirical spine.”

The fundamentals of this process rely on using data analysis to develop the story’s hypothesis, and then allowing the reporting to “flow” from that analysis or “spine.” This process contrasts with the practice of many journalists, panelists said, who only look to data after substantively completing their story — sometimes to discover that that story is completely inconsistent with the data.

Technical consultant and privacy expert Ashkan Soltani says this issue can be addressed in part by having a reporter seek out qualitative interviews while a data team independently looks into the quantitative data, thereby simultaneously obtaining both sides of the story.

“You can come together and ask ‘Do they confirm each other or have different findings?’,” he said. “That can then merge together to form the spine of the story.”

Angwin says another solution could stem from news organizations collecting their own data sets.

“Data itself is political,” Angwin said, referring to choices and process involved in gathering data.  If news organizations amass their own data, she said, it could help reporters find the data that best addresses the questions raised by their qualitative reporting – something existing data sets are not always sufficient to do.

The panelists also debated the best platform to convey a data driven story, but ultimately felt nuance can be expressed in graphic visualizations just as well as in long-form narratives or news apps.

It is the journalistic backbone and purpose of such pieces — which use data intelligently and appropriately — that truly makes them data journalism. Visualizations or data sets without these qualities don’t deserve the title.

As Pilhofer put it: “If you aren’t telling a story in the presentation piece or approaching it with a journalistic intent, then you’re wasting everyone’s time.”

Featured image by Rani Molla.

Past Events

“Adapt or Die” – Data at Yale

0

Data Journalism is a rapidly advancing and exciting new field in journalism. And like any new field, there are questions about effective data journalism – ranging from best practices and tools to standards and ethical conundrums.
The Data Journalism Conference organized by the Information Society Project at the Yale Law School on Mar 9th was an attempt to get industry practitioners, experts, veteran journalists and lawyers to discuss the best practices followed of news teams at major news organizations as well as debate on implications of the questions that data journalism raises.

Panel 1:  Data Journalism Forms and Practices
This panel kicked off with an intriguing statement.  Amanda Cox, a Graphics Editor at The New York Times challenged the term “data journalism” stating that if it is a real form of journalism, then it doesn’t need the qualifier “data.” She then gave a brief overview of successful projects at the Times, along with the elements that made them successful. For example, the Budget Puzzle: You Fix the Budget allows readers to virtually control the nation’s finances, while The Jobless Rate for People Like You drives home the point that”Not all groups have felt the recession equally.”

First Panel: Amanda Cox, Reginald Chua, Katharine Jarmul, Dafna Linzer and Simon Ferrari

Reginald Chua, editor of Data and Innovation at Thomson Reuters spoke on the evolution of data journalism as well as the paradigm shift it requires for legacy news organizations. He also addressed issues of privacy, immediacy and the availability of data.
Katharine Jarmul, lead developer at Loud3r was emphatic about fostering a spirit of collaboration between editorial and IT teams in order to deliver effective data journalism pieces. “Data is no fun without a story,” she said. She also put in perspective the capabilities and limitations of the developer-journalist communities and laid to rest a few myths about them.
Dafna Linzer, a Senior Investigative Reporter at ProPublica, walked the audience through the project  “Presidential Pardons Heavily Favor Whites.” She spoke of the challenges of putting this project together (there was no data!) to the time and effort invested to put it together (almost a year).
Simon Ferrari, Game designer for journalists is working on an exciting research area – Newsgames. He spoke of it as the next frontier where simulation is key, just like a video game. The basic premise of his presentation was that stories about current events, infographics etc. can be translated to videogames in order to engage audiences better.
The ensuing discussion included debates about open-sourcing code (some organizations do, others don’t), how smaller news organizations can still do great data journalism (using open source tools, publicly available data), how data can be verified and annotating infographics to direct the audience’s attention. Another important issue discussed was the slow turnaround of data that makes it difficult to accompany breaking news items with data journalism pieces.

Part 2: The Influence of Data on News Work
The second panel focused on problems brought up by these new data journalism tools and practices. What responsibility do journalists have to the data? What about the confidentiality and privacy concerns related to the data? How reliable is the data released by government?
C.W. Anderson, assistant professor of Media Culture at College of Staten Island presented “Teaching a great many numbers and pieces of paper to speak clearly – the long history of data journalism.”  He pointed out that it’s the journalists’ job to make their sources “talk.”  Brian Boyer, news applications editor at Chicago Tribune and creator of the PANDA Project, gave a witty and thought-provoking talk about data journalism as craft.   

Brian Boyer

Hannah Fairfield, graphics director from Washington Post, discussed how her team worked with many political journalists, to develop with The U.S. Congress Votes Database. Fairfield also pointed out that many times we ourselves become the data, as illustrated in the projects Mapping the census and Is life getting better or worse?
After that, Matt Stiles, a data journalist from NPR, talked about the inevitability of the new trends in data journalism. While some journalists are still writing traditional “He said” or “She said”, stories instead of studying data, he suggested that data the empirical nature of journalism reveals patterns and empowers audience. “Adapt or die,” said Stiles.
Finally, author and journalist Steven Waldman, addressed the other side of the question: where does data come from? Since much data comes from government, it takes a lot of effort to clean up. Is data journalism about these “pyrrhic” victories? What does the rise of data journalism say about the need for systems of open, transparent government? Audience and lecturers discussed these questions along with the economic value of data in the digital era and user engagement.

Past Events

Cyberscholars Gather at Columbia

0

The Cyberscholar Working Group, a monthly gathering of researchers from Harvard, MIT, Yale and Columbia, met at Columbia’s Journalism School on March 27th in an event sponsored by the Tow Center. Here’s a roundup of the presentations:

John Kelly, “Analyzing Russian Social Media”

Kelly’s talk was based on a series of reports he’s working on for the Berkman Center for Internet and Society at Harvard on network mapping of social media in Russian. The maps show what people are paying attention to and what they care about online. Kelly has mapped social media in twelve languages and found different network topologies (subject clusters) for all of them. In Russia, Kelly found the most popular forms of social media were those that combined blogging with Facebook-like networking tools. He says most discourse about politics, culture and public life was happening on LiveJournal. The other most active blogging segment in Russia is “instrumental blogging,” or blogging for hire—people who are paid to endorse certain products or viewpoints. Another study by Kelly on Twitter use in Russia found substantial SEO-driven activity by bots and others. (Image courtesy of Berkman Center)

David Thaw, “Comparing Management-Based Regulation and Prescriptive Legislation: How to Improve Information Security Through Regulation”

Thaw says information security failures are widespread, and the private sector bears much of the responsibility for addressing them. Few regulations are in place to incentivize better security, which is expensive. But Thaw says laws have started to catch up. He categorizes cybersecurity two ways: prescriptive-based legislation and management-based regulations. Prescriptive-based legislation includes security breach notification laws, which require organizations to notify people who are affected by a breach. Management-based regulation involves organizations themselves in the rulemaking process. Thaw says organizations are told they must develop plans to meet particular aspirational goals—it’s up to them to determine how to meet those goals. Thaw says the optimal outcome is a mix of prescriptive-based legislation and management-based regulation.

Shlomit Yanisky-Ravid, “Traditional Knowledge – Culture Expression and Access to Knowledge: The Open Questions”

Yanisky-Ravid discussed IP-related challenges for “traditional knowledge,” or information shared in a given community that is part of and sustains its culture, and is not considered to belong to any individual person. Conflicts can arise when traditional knowledge is used in a way that results in financial gain, such as when Israeli singer Idan Raichel incorporated traditional Ethiopian music into his songs. The question is, how do you determine who should benefit? Is it a state, an ethnic group or tribe, or a region? Another question to answer is why should anyone be compensated for the use of traditional knowledge? Yanisky-Ravid suggests that a solution lies in moral rights, including the right of attribution, the right not to have traditional knowledge modified without permission, and the right not to have it used in a manner that discredits traditional knowledge holders. Yanisky-Ravid says traditional knowledge could be treated like trademarks, tracked in an international registry.

Harris Chen, “The Future Criminal Investigation in the Digital Age”

Chen, a prosecutor in Taiwan, says criminal investigations are increasingly being digitized. Police are using Facebook photos and GPS, for example, to find criminals. Police are also seizing domain names—even in investigations where the person who actually committed wrongdoing is unknown. Chen also says social network sites are being used in investigations—he cited a crime ring kingpin who was captured after his son posted photos of his family at a vacation home. In the future, Chen says he expects police will gather even more evidence online, and get more cooperation from the operators of websites, social networks and cloud servers. He also expects governments to recruit more geeks as special agents. Future challenges associated with these types of investigations will include the need to balance discovery with privacy rights.

The next Cyberscholar Working Group will take place in April at Yale (date TBA). For more information, contact Kate Fink.

Past Events

Cyber Scholars and Doing Data Journalism

0

In March the Tow Center will sponsor three events related to major issues in digital journalism.

March 27th – CyberScholar Working Group Forum
On March 27th, we will host this month’s Cyberscholar Working Group, a forum for fellows and affiliates of MIT, Yale Law School Information Society Project, Columbia University, and the Berkman Center for Internet & Society at Harvard University to discuss their ongoing research. This month’s gathering will take place at Columbia University’s Graduate School of Journalism on March 27th,6-9 p.m. in Room 107B. Presentations will include:

Harris Chen, Harvard: “The Future Criminal Investigation in the Digital Age”
Shlomit Yanisky-Ravid, Yale: “Traditional Knowledge – Culture Expression and Access to Knowledge: The Open Questions”
David Thaw, Yale: “Comparing Management-Based Regulation and Prescriptive Legislation: How to Improve Information Security Through Regulation”
John Kelly, Columbia/Harvard: “Analyzing Russian Social Media”

Details and RSVP: http://cwgmar2012.eventbrite.com/
Contact: Kate Fink, kaf2155@columbia.edu

 

March 28th – Doing Data Journalism: It’s Not Just Numbers

Data journalism is quickly becoming one of the hottest topics in the industry – but what exactly is it, and what tools, teams and techniques are necessary for doing it well?

On March 28th, the Tow Center for Digital Journalism will host several of data journalism’s most prominent innovators and practitioners for a discussion about the possibilities and pitfalls of this evolving field. We hope you will join us at Columbia Journalism School from 6 – 7:30pm to hear their perspectives and join the discussion.

Panelists include:

Julia Angwin, tech editor for The Wall Street Journal
Jo Craven McGinty, projects editor for Computer Assisted Reporting at The New York Times
Scott Klein, editor of News Applications at ProPublica
Aron Pilhofer, editor of Interactive News at The New York Times
Ashkan Soltani, technical consultant and privacy expert
Mo Tamman, award-winning data journalist at Reuters

The event is free and open to the public.

Details and RSVP: http://cujtowdoingdatajournalism.eventbrite.com/
Contact: Susan E. McGregor, sem2196@columbia.edu

 

Finally, on March 6 the Tow Center partnered with the New York World to sponsor a unique panel discussion about government accountability and transparency in the age of digital records. The event brought together panelists from government and non-profit sectors to help elucidate the goals and challenges of open government initiatives. The lively discussion addressed how legislative wording can have significant impact on the types of records that are made publicly available, and what the opportunities are for further development in this area.

Panelists included:

Philip Ashlock, OpenPlans
Andrew Hoppin, New Amsterdam Ideas and formerChief Information Officer, New York State Senate
Amy Ngai, Sunlight Foundation
New York City Council Member Gale Brewer (invited)
Michael Powell, The New York Times
Moderated by Alex Howard, Government 2.0, correspondent, O’Reilly Radar

How It's Made

How It’s Made: Tow Center/ScraperWiki DataCamp Winning Entry

0
The analysis in progress

The analysis in progress

In early February, the Tow Center hosted a Journalism Data Camp with Knight News Challenge winner ScaperWiki, which provides tools and training to journalists working with difficult data. The goal of the camp was to bring together journalists and computer scientists to make data more accessible, analyze it, and create stories around the theme of “New York Accountability”. A group of journalism school students attended the event to gain experience with data journalism. Marc Georges, one of the students who was part of the winning team, describes how his group’s project was developed.


Attendees at the event started out by forming groups and identifying a data source to mine for stories. Our group consisted of current journalism school students Curtis Skinner, Eddie Small, Isha Sonni, Trinna Leong, Keldy Ortiz, Salim Essaid, Sara Alvi and myself, as well as recent graduate and New York World fellow Michael Keller, and GSAS statistics student Brian Abelson.

Salim pitched the group a project focused on stop-and-frisks in Shia communities in New York City. A recent AP report showed that in 2006, the NYPD had recommended increased surveillance of Shia communities and had identified nine mosques for possible surveillance. Our team wanted to know if this recommendation had resulted in an increase in stop-and-frisks of Middle-Eastern New Yorkers and if anything in the data would tell us whether or not police actually targeted these communities. Here’s what we learned in trying to put together this story:

Data is Dirty
Just because data is available doesn’t mean you’ll be able to use it quickly and easily. Our first main challenge was accessing and cleaning data on stop-and-frisks in New York City. The NYPD makes this data available on their website but there’s a ton of it–400,000 cells of Excel values for every year.

Curtis, Eddie, Isha, Trinna, Keldy and Sarah researched and collected our data but one of the most basic issues we ran into was trying to determine how many stop and frisks affected people of Middle-Eastern descent. Although the NYPD tracks the race of those it stops, Middle-Eastern people are categorized as whites so it was not possible to isolate that ethnic group directly. As a workaround, we considered using census data to find predominately Middle-Eastern neighborhoods, but ran into the same issue. After reviewing the information we did have, we came up with the idea of using proximity to the mosques mentioned in the AP report as a marker for ethnicity. We thought it was fair to assume the closer a stop was to a mosque, the more likely the person being stopped was Middle-Eastern. We decided to look at a radius of 900ft, the average length of a New York City block.

Once we were able to isolate our data set, we realized that working with such large amounts of data wasn’t feasible without some type of automation. The coders at the event were really helpful in writing a script that scraped the data for the variables we needed. That let us isolate the key aspects of the stop-and frisks we wanted to use and move forward in mapping our data.

Mapping is Hard
One of our main goals in the project was comparing the incidences of stop-and-frisks near these 9 mosques with other areas in New York City. To do that, we needed to be able to map our cleaned data. Sounds simple enough, right?

Our initial map of stop-and-frisks for 2006, color-coded by race.

Our initial map of stop-and-frisks for 2006, color-coded by race.

 

Creating our maps turned out to be one of the most difficult and time-consuming aspects of our weekend. Our main problem was that our location data for our stop-and-frisks was in a format which the NYPD uses called State Plane while the location data for our mosques was in longitude and latitude. Brian Abelson, a graduate student at Columbia pursuing a degree in Quantitative Methods in the Social Sciences saved the day by converting our data and then mapping it.

Brian used a mapping tool called ArcGis to solve the conversion problem so we could view the stop-and-frisks and mosques on the same map.  Brian, Michael and I then used a program called R, to further scrape the data and isolate points by specific variables, like race.  Brian then used ArcGis to isolate all the points within a 900ft radius of a mosque so we could see how the rates of stop-and-frisks changed over time. Mike Dewar, one of the coders at the event, was also very helpful in nailing down our approach to identifying stop-and-frisks near mosques.  Mike wrote an algorithm for us to measure the distance between any one point and a mosque.  We didn’t end up using Mike’s algorithm but talking the problem over with him and discussing various approaches was a great help in tackling the larger issue of working with such a large data set.

An initial map of stop-and-frisks for 2006.  White dots show the number of stops, red dots the number of arrests.

An initial map of stop-and-frisks for 2006. White dots show the number of stops, red dots the number of arrests.

Analyzing Data leads to More Analyzing Data
Once we were able to map our data, we could see how many stop-and-frisks occurred near these nine mosques in 2006. When we compared the first three months of the year, before the recommendation for surveillance was made, with the next nine months, we did see a small increase. However, to know if this is statistically relevant or markedly different from stop-and-frisks in other areas of New York, we have to do a lot more research and analysis.

Stop-and-frisks for two mosques prior to March 2006

Stop-and-frisks for two mosques prior to March 2006

Stop-and-frisks for two mosques after March 2006

Stop-and-frisks for two mosques after March 2006



Jeremy Baron of WikiMedia New York City and Thomas Levine of ScraperWiki, two coders from the event, worked with us after the event to help automate our workflow. Jeremy wrote a script which aggregated our data and put into a database while Thomas helped us throughout the process in fixing and checking our sql and javascript code.

Our next step has been designing the right control group against which to test our data, and we received great feedback after the event on the best way to do so. One surprising thing has been how rich our data set is. Further analysis may show that stop-and-frisks near these mosques wasn’t unusual, but as we continue working with the data, it has already given us ideas for more stories we can work on.