Computational campaign coverage based on

Algorithms for automatically generating stories from machine-readable data are shaking up the news industry, not least since the Associated Press, one of the world’s largest news and most reputable news organizations, has started to automate the production and publication of quarterly earnings reports. Once developed, such algorithms can create thousands of routine news stories for a particular topic, usually faster, cheaper, and with fewer errors than any human journalist ever could. In the recently published Guide to Automated Journalism, I summarized the status quo of automated news generation, raised key questions for future research, and discussed potential implications for journalists, news consumers, media outlets as well as society at large.

Despite its potential, the technology is in an early market phase. Automated news generation is still limited to routine and repetitive topics for which (a) clean and accurate data are available, (b) the stories merely summarize facts, and therefore (c) leave little room for uncertainty and interpretation. Popular examples include recaps of lower league sports events, financial news, crime reports or weather forecasts. For such topics, research finds little difference in people’s relative perception of human-written and automated news. Also, due to the low-involvement nature of these topics, readers may be less concerned about issues regarding algorithmic transparency and accountability.

But what if the stories cover a high-involvement topic that also involves uncertainty? How do users perceive automated news for such topics and to what extent are they interested in how the underlying algorithms work? To study these questions, our research team recently embarked on a project called Computational Campaign Coverage. This project aims to study the creation and consumption of automated news for forecasts of this year’s U.S. presidential election.

This is the first in a series of blog posts that aims to describe this research. This initial post describes the project that provides the underlying forecast data based on which we will generate automated news stories. This project, which I have been involved with since 2007, is called the PollyVote.

PollyVote_parrotThe was founded in 2004 to demonstrate advances in forecasting methodology for the high-profile application of U.S. presidential election forecasting. Across the last three elections, PollyVote’s final forecast missed the election outcome on average by only about half a percentage point. In comparison, the respective error of the final Gallup poll was more than three times higher. The performance is even more impressive for long-term forecasts, when polls are only of limited value. Since 2004, the PollyVote has correctly predicted the election winner months in advance, and more accurately than any other method.

This performance is possible because the PollyVote strictly applies evidence-based forecasting principles. In particular, the PollyVote relies on the principle of combining forecasts, which has a long history in the forecasting literature and is well-established as a powerful method to reduce forecast error. Combining forecasts increases accuracy because, first, the approach allows for including more information and, second, cancels out the bias of individual methods. While combining is useful whenever more than one forecast is available, the approach is particularly valuable if (1) many forecasts from evidence-based methods are available, (2) the forecasts draw upon different methods and data, and (3) there is uncertainty about which method is most accurate.

These conditions apply to election forecasting. First, there are many evidence-based methods for predicting election outcomes. While most people may think of polls as the dominant method to forecast elections, asking people for whom they are going to vote is actually among the least useful methods, except shortly before Election Day. For example, one usually gets much more accurate forecasts by obtaining people’s expectations rather than their intentions. This can be done by simply asking experts (or even regular citizens) who they think is going to win, or letting people bet on the election outcome and using the resulting odds as forecasts.

Another useful approach is to develop quantitative forecasting models based on theories of voting and electoral behavior. For example, many political economy models rely on the idea of retrospective voting, which assumes that voters reward the incumbent party for good, in particular economic, performance and punish it otherwise. Other models assume voters to think prospectively, for example, by assessing which candidate would do a better job in handling the issues or leading the country.

Since all these methods and models rely on different data, election forecasting meets the second condition for when combining is most beneficial. And, finally, third, in most situations it is difficult to determine a priori which method will provide the best forecast, particularly if the election is still far away. The reason is that every election is held in a different context and has its idiosyncrasies. Therefore, methods that worked well in the past may not necessarily work well in the future. For example, while prediction (or betting) markets were among the most accurate methods for forecasting the U.S. presidential elections from 1992 to 2008, they did not do well in 2012.

Figure 1: PollyVote method of combining forecasts

Figure 1: PollyVote method of combining forecasts

As shown in Figure 1, the PollyVote harnesses the benefits of combining under these ideal conditions by averaging forecasts within and across different methods, namely polls, prediction markets, expert judgment, citizen forecasts, and quantitative models. The current forecast predicts the Democrats to win 52.0% of the national popular two-party vote, compared to 48.0% for the Republicans. Interestingly, there is some disagreement among the component methods. While four component methods (i.e., polls, index models, prediction markets, and experts) all favor the Democrats, econometric models have the Republicans in the lead. This reveals another important benefit of the PollyVote, its educational aspect. In collecting and aggregating forecasts from different evidence-based methods, the platform provides a valuable source for those interested in election forecasting, allowing readers to learn about different methods and to compare their forecasts.

In the next blog post, I will provide more details about the PollyVote data. For those who cannot wait, all data that we use in our research are publicly available.