How it’s made: Stop-and-frisk stepper graphic
The other week, the New York World published a data reporting project with the Guardian examining the NYPD’s controversial Stop, Question and Frisk policy. Last year NYPD Commissioner Kelly issued an order to curtail low-level marijuana arrests following stop-and-frisks. WNYC had previously reported that the NYPD manufactured such arrests by ordering people to remove marijuana from their pockets and then charging them for the more serious crime of possesion in public view.
Our investigation found that marijuana arrests actually rose after Kelly’s order. But finding that story involved diving into the data.
Thanks to a recent lawsuit, the NYPD releases a database each year of every single “stop-and-frisk” that officers make. Unfortunately, the database is so big it can’t easily be opened in Excel and the data also requires some serious “cleaning” to be usable.
To address these issues, we analyzed the data using the open source statistics program R, which can handle data cleaning, interrogation, and visualization in one program. Because R lets you type in commands that apply across multiple files, it removes the need for switching among Excel windows. R also supports the SQL-like queries through the sqldf extension package, which makes more complex database systems so powerful.
Because we were interested in when certain types of stop-and-frisk incidents had taken place, we used R to split the day, month, and year of each incident’s date field into individual columns. This set the data up for the next step in our analysis, which was to count up how many marijuana arrests occurred each month.
Using SQL queries, we were able to group and count the data by month and crime type. We focused our searches on marijuana possession (which in the NYPD data was spelled “marihuana”).
We ran a number of queries to see month-to-month trends and also compared across years to see how 2011 compared to data to 2008. This gave us valuable context because stops actually dropped in November and December of 2011, but not as much as they did in those same months in prior years. If Kelly’s order impacted officer behavior, we should have seen relatively dramatic decreases during those months, but found only slight declines. This context was vital to our story, and explaining why the 2011 drop was not significant was a high priority for our final visualization.
We also ran queries comparing arrests to stops as well as isolating specific precincts. However, only a few of these queries yielded results that were worthy of inclusion in the final interactive.
Visualizing Part 1:
In order to find the trends mentioned above, though, we first had visualize our query results, which R can do, too. An extension package for R called ggplot2 will generate high-quality, customizable line graphs that could be used directly for print graphics. However, we wanted ours to be interactive, which required some additional work.
Visualizing Part 2:
Putting it all together
To better tell the story, we compiled four sets of charts that we incorporated into a so-called “stepper graphic.” Thanks to the newsapps team at ProPublica, there is a great open-source library (http://www.propublica.org/nerds/item/anatomy-of-a-stepper-graphic) for building these graphics. Turning my four charts into four different “slides” was as easy as creating a function for each of them and then copying in their Raphaël code. The stepper graphic library took care of numbering and transitions. We built the grid and axes with standard HTML and CSS, and made label fades using simple jQuery fadeIn() and fadeOut() methods.
Finally, once we confirmed we were running the story with the Guardian, we adjusted the styles to make sure it would mesh well with their design. So we made the months lowercase, the font Georgia, and the line fuschia – perhaps the most important part.