Trust the polls, just don't trust their sample By Pixabay

Trust the polls, just don't trust their sample

Nobody expected what happened on November 8th. In the aftermath of the election there was a lot of criticism of pollsters, since almost all polls had shown Clinton as the big favorite. Almost no pollsters got the election results right. What went wrong?

Sampling bias

One of the most probable causes was that the samples the polls used were off: they missed a certain demographic and therefore underestimated Trump’s chances. In methodological terms, this was a severe case of sampling bias. The hypothesis that sampling bias was indeed the issue seems plausible given that one of the few pollsters that did come up with an accurate prediction, the RAND corporation, actually tackled this issue. This corporation has the ability to do something few of us can, but all of us should: they can carry out a proper random sample. What they did was to take the telephone book of the USA and randomly select 6000 telephone numbers. Each number was called and asked to participate in an ongoing online survey about the election. So, you might say, this is what all pollsters do, right? Now here's the kicker. If for some reason the people called had no internet, RAND would provide internet for them. Although costly, this strategy actually ensured that their sample was closer to a proper random sample than other pollsters. In addition, using an online survey as opposed to a telephone survey eliminated another source of bias, namely social desirability: “I’m voting for Trump but I’m not going to tell you that over the phone.” RAND’s random sample indicated a clear win for Trump: they captured the responses of a specific subgroup of people who would normally have been left out of the sample, but who overwhelmingly voted for Trump.

Include uncertainty

Having the means to provide internet for your participants if they don’t have it is of course a luxury. So if we don’t have the means to take a random sample in this way, how can we get around this? One option is to incorporate uncertainty into your prediction model. This is the approach of the FiveThirtyEight model. FiveThirtyEight uses a complex model to aggregate and weigh the information from almost all national polls conducted in the USA. But in addition to the outcomes of the polls, it also incorporates information on the quality of these polls: How well did the polls predict previous elections, or what was the sampling strategy of the poll; a telephone or internet survey? The model even incorporates information on the correlation between states; if one state votes for Clinton, a demographically similar state might also vote the same way. All in all, the most beautiful thing about this model is that it includes uncertainty: it incorporates the fact that the outcome of a poll might be wrong. In statistical jargon we have an error-in-variables model. FiveThirtyEight runs the model 10,000 times, and for each run it counts the number of electoral votes predicted for each candidate, ending up with a probability that a candidate will win. Eventually the model predicted a 71.4% of a Clinton win. But before you start yelling that this prediction was wrong, the confidence intervals of the predicted outcomes show that a Trump win was well within limits. The model was not wrong; real life just caught up with it, and something that had a 28.6% chance of happening (which is, after all, a reasonable chance) did indeed happen.

Upcoming Dutch election

Given the unexpected, but in hindsight not so unexpected, outcomes of the last ‘big’ votes, the Brexit referendum and the U.S. elections, I’m really curious, and maybe a bit anxious (with sampling bias in mind, allowing people to directly vote online seems like a very bad idea), how the upcoming Dutch election will turn out. Hopefully the pollsters will be a bit more open about the uncertainty of their predictions. But until then: I’ll trust the polls; I just won’t trust their sample.