Computational Campaign Coverage with the PollyVote: The data

Our goal within the research project Computational Campaign Coverage is to (a) develop a fully automated news platform for covering forecasts of the 2016 US presidential election and (b) analyze how people perceive the quality of the automated news content. For creating the automated news, we rely on forecasting data provided by the project. As described in a previous post, the PollyVote has successfully predicted elections since 2004 by applying evidence-based forecasting principles. In particular, the PollyVote applies the principle of combining by averaging forecasts from six different methods, namely polls, prediction markets, expert judgment, citizen forecasts, econometric models, and index models.

In order to generate automated news from these data, the first step is to ensure that the underlying data are available and of high quality. That is, you want to have data that are accurate and complete. This blog post describes our efforts in gathering these data and transferring them to a format that can be used to automatically generate news.

The PollyVote method and the underlying data are published in peer-reviewed scientific journals and are thus fully transparent and publicly available. Since the PollyVote incorporates all available forecasts in the combination, the dataset is quite extensive. For example, the data that were used to predict the 2012 election include nearly 8,000 individual daily forecasts (e.g., individual polls or model predictions). Note, however, that this figure only refers to predictions at the national (popular vote) level. If one also includes forecasts at the state level, which is our goal for the 2016 election, the dataset grows dramatically. Needless to say, this situation perfectly meets to conditions under which automation is most useful: if (a) there are good data available and (b) a large number of routine news stories need to be written.

For generating automated news stories, we collaborate with the German company AX Semantics, which is responsible for developing the underlying algorithms. Therefore, a first challenge within our project was to develop an interface through which AX Semantics can automatically obtain the PollyVote data in a structured (i.e., machine-readable) format. To allow for this possibility, project member Mario Haim developed an API, which contains both historical and 2016 forecast data for the combined PollyVote as well as its components at the national and the state level. However, access to the API is not limited to our project partners. Instead, in an effort to make our procedures fully transparent, we decided to make all data publicly available and free to use under the MIT license. Interested users may obtain data through a specific URL, and a dedicated API call generator allows for specifying an exact request. Details on the data as well as instructions for how to obtain them can be found here. Also, note that this is work in progress. Please write to us if you find any errors in the data.

In the next post, I will describe our approach for generating automated news articles, some of which have already been published in both English and German language. Note, however, that we are still early in the process. The quality of the texts will further improve. Yet, we decided to start publishing right away so that users can track how the texts have improved over time.