I will provide, below, a brief and very basic introduction to the main concepts involved in the polling process, to setup the stage for the main point I would like to make in the last few paragraphs. If you are familiar with statistics and probability theory, you are encouraged to skip to the last few paragraphs where I explain the main point of this post.
Let’s assume we have an election, for a single position, between nominees from two parties, party \(A\) and party \(B\), and denote these nominees by \(n_A\) for party \(A\) and \(n_B\) for party \(B\). Let’s also denote, by \(V_A\), the set of all voters who plan on voting for \(n_A\), and by \(V_B\), the set of all voters who plan on voting for \(n_B\). One can imagine that members of \(V_A\) and \(V_B\) can be from different demographics, with different backgrounds, levels of education, and/or work experience, and even with different reasons of their own for their decision to vote for the specific candidate. Nevertheless, all members of each set have one thing in common, and that is they are planning on voting for that specific candidate. Let’s denote these two common properties by \(v(x) = n_A\) and \(v(x) = n_B\), to say that person \(x\) is planning on voting for \(n_A\) or \(n_B\). With this definition, we can more formally define the sets \(V_A\) and \(V_B\) as follows:
\(V_A = \{x \in V: v(x) = n_A\}\) ,
\(V_B = \{x \in V: v(x) = n_B\}\) ,
where \(V\) is the set of all potential voters in the upcoming election, i.e., eligible voters who are actually planning on voting in the upcoming election. We also denote, by \(|V_A|\) and \(|V_B|\), the size of each of these sets, i.e., the number of the members of each set. Since we are assuming that each person can vote for only one of the two candidates, the sets \(V_A\) and \(V_B\) have no intersection:
\(V_A \cap V_B = \emptyset\) .
Furthermore, for simplicity we assume that the support of the function \(v(x)\) above is the entire set \(V\), and that it takes the only two values of \(n_A\) and \(n_B\) (no write-in candidates, etc). As a result, the union of the two sets \(V_A\) and \(V_B\) is \(V\):
\(V_A \cup V_B = V\) .
If we define \(\alpha_V = {|V_A| \over |V|}\) and \(\beta_V = {|V_B| \over |V|}\), the goal of the election polling would be to find a reliable estimate of \(\alpha_V\) (or \(\beta_V\) since \(\alpha_V + \beta_V = 1\)).
The above deterministic formulation of the problem, while may seem more natural, actually makes calculations more difficult. A more simplified formulation would consider the \(\alpha_V\) and \(\beta_V\) values as probabilities, i.e.,
\(\alpha_V = \text{Prob}\{v(x) = n_A\}\) ,
\(\beta_V = \text{Prob}\{v(x) = n_B\}\) .
In this formulation, we have \(\text{E}\{|V_A|\} = \alpha_V |V|\) and \(\text{E}\{|V_B|\} = \beta_V |V|\).
Now, let’s consider an election polling that, among other questions, also asks the participants which of the two candidates they are planning to vote for in the upcoming election. Let’s denote, by \(P\), the set of all participants in the poll (where \(P \subset V\)), and by \(P_A\) and \(P_B\), the sets of poll participants who respond to that question according the letter used in the naming of the set, e.g., \(P_A\) is the set of poll participants who indicated that they were planning on voting for \(n_A\):
\(P_A = \{x \in P: v(x) = n_A\} = P \cap V_A\) ,
\(P_B = \{x \in P: v(x) = n_B\} = P \cap V_B\) .
Note that we are assuming that all poll participants are among the potential voters, and do answer the specific question regarding their choice between candidates \(n_A\) and \(n_B\), and that they are truthful in their answers. With these assumptions, we have
\(P_A \cap P_B = \emptyset\) ,
\(P_A \cup P_B = P\) ,
and if we define \(\alpha_P = {|P_A| \over |P|}\) and \(\beta_P = {|P_B| \over |P|}\), one might assume \(\alpha_P\) to be a good estimate of \(\alpha_V\). This may indeed be the case, under certain conditions. The accuracy of the estimate may depend on several factors, among which are the poll size (\(|P|\)) and the sampling method. For example, we may assume that we use a uniform and independent sampling of the potential voter population, i.e., potential voters are selected independently from each other, and with the same probability. However, note that the sampling process itself is a two-step process: (1) reach out, and (2) participation. The first step is typically the one that the pollster can control, e.g., they may choose phone numbers according to a uniform distribution over the set of the phone numbers of all potential voters. Let’s say the pollster uses a probability value of \(p_r\) for this step (in practice, pollsters may partition the target population into several demographic sectors and address the non-homogeneous nature of the political preferences among different demographics by assigning different weights at the time of sampling and/or processing the data of each demographic, but we ignore that process here, for simplicity). The second step, however, is not usually under pollster’s control. It depends on other characteristics that are specific to the individual voters. If, only for the moment, we assume that the probability of participation is the same for all potential voters, say \(p_p\), then we can assume the overall sampling probability to be \(p = p_r \times p_p\). With these assumptions, we will have
\(p = \text{Prob}\{x \in P\}\) ,
\(\text{E}\{|P|\} = p |V|\) .
Now, similar to what we did above, we can write the variables \(\alpha_P\) and \(\beta_P\) as probabilities:
\(\alpha_P = \text{Prob}\{v(x) = n_A | x \in P\} = {\text{Prob}\{v(x) = n_A \land x \in P\} \over \text{Prob}\{x \in P\}} = {\alpha_V \times p \over p} = \alpha_V\) ,
\(\beta_P = \text{Prob}\{v(x) = n_B | x \in P\} = {\text{Prob}\{v(x) = n_B \land x \in P\} \over \text{Prob}\{x \in P\}} = {\beta_V \times p \over p} = \beta_V\) ,
which seem to imply that a reliable estimate of \(\alpha_V\) can be obtained from a reliable estimate for \(\alpha_P\) (which we could obtain with a large enough sample size). Note, however, that the above two equations rely on the assumption that the two events of \(x \in P\) and \(v(x) = n_A\) (or \(n_B\)) are independent. This is what we meant above when we assumed, for the moment, that the probability of participation was the same for all potential voters, and denoted that probability by \(p_p\). In reality, this may not be the case, and can create quite a bit of skew in the results.
Consider for example, the ideal scenario in which \(p_p\) is the same for all voters regardless of their candidate preference. In this case, the above equations will hold, and the pollster can simply focus on making the estimate of \(\alpha_P\) as accurate as possible, by, e.g., ensuring that \(p_r\) is also independent of voters’ preferred candidates, and also that the sample size \(|P|\) is large enough so that the estimation error is small. If these conditions, that are substantially under the pollster’s control, are satisfied, the polling will be successful, and the results will be reliable.
Now consider two extreme scenarios:
- Scenario 1: \(p_p(x) = \left\{ \begin{matrix} 1 & \text{if } v(x) = n_A \\0 & \text{otherwise} \end{matrix} \right.\)
- Scenario 2: \(p_p(x) = \left\{\begin{matrix} 1 & \text{if } v(x) = n_B \\0 & \text{otherwise} \end{matrix} \right.\)
Note the dependence on \(x\) added to \(p_p\), since in these scenarios, the participation distribution is not uniform across all potential voters. It is not too difficult to verify that in Scenario 1, \(\alpha_P = 1\) (and hence \(\beta_P = 0\)), and in Scenario 2, \(\alpha_P = 0\) (and hence \(\beta_P = 1\)). These results are obviously very unlikely to be even close to being correct. Although these two are very extreme and unlikely examples, they show how the polling results can be affected by factors which are not under pollsters’ control. In the above two examples, even if the sampling distribution is perfect, in the sense that it uniformly selects/reaches out to potential voters regardless of their preferred candidate, the end results can be very skewed due to the non-uniform distribution of participation.
One may ask how probable is for the participation distribution to be non-uniform. Below, I will argue that it is indeed very likely, and can result in large discrepancies or differences between the polling predictions and actual election results.
It seems reasonable to assume that the probability of participation (which indicates the level of willingness of a certain individual to participate in a poll) itself is affected by at least the following two factors:
- The level of passion the individual has about voting in the election or in general in the cause they believe in, and
- The amount of extra time the individual has at hand to answer the questions in the polling questionnaire.
For close elections, it is true that most potential voters have a somewhat elevated level of concern and passion, but that does not mean that each potential voter feels as strongly about their cause as the other potential voter. For example, one very passionate voter may decide to volunteer at a local campaign office for their preferred candidate. Another, very passionate, voter, however, may decide to send threatening texts or social media messages to the first individual, to intimidate and discourage them from participating in the campaign efforts, if the preferred candidates of the two are different. Although both individuals demonstrate strong passion for their cause, their level of passion are certainly at different scales, and one may not assume that the corresponding \(p_p\) values for these two individuals are the same. Similarly for the second factor; no matter how passionate a specific individual may feel about the upcoming election, if they have a very demanding work and/or life schedule, they may not be able to participate in the poll. On the other hand, an individual with a lot of free time on their hand, will be more inclined to participate in the poll, even if they do not feel the same level of passion as the first individual.
One possibility to address this issue is to make \(p_r\) also dependent on voters’ candidate preference, to somehow offset the effect of the non-uniform participation distribution. However, this intentional adjustment/tweak of the sampling distribution should be done very carefully, otherwise that itself may introduce additional inaccuracies in the polling results.
With the seemingly very close upcoming presidential election, different levels/scales of passion that the followers of each candidate exhibit, and the potentially large variance in the types of occupations and family obligations of the potential voters for each candidate, I believe political pollsters have a greater responsibility to try to more accurately quantify the participation distribution (values of \(p_p\) for different sectors of the voting population), and factor those values in when making predictions about the results of the election. This is of course, a suggestion for the legitimate pollsters who are more interested in providing accurate data and predictions regardless of their own political views. Otherwise, those who intentionally engineer polling data and results, through which they intend to influence the results of the actual election, are rather political operatives and not true pollsters, and are exempt from any serious regard for the integrity of data and truthfulness in the results they present.
One response to “How Reliable Are Election Polls?”
Today’s parliamentary election results from France can be considered as an example of the issue discussed in this post. While certainly there were other factors involved as well, I believe the lack of accurate accounting for the participation bias discussed above has played a significant role in making the predictions from election polls so incorrect.