Motivation and Goal

The Problem: About 9-12% of voters in past Michigan elections were late registrants, voters who registered within a year of that election. This portion of the electorate is currently either not registered in Michigan or registered at an alternate address.

The Need: A representative registration-based sample of the expected electorate in Michigan’s 2020 November election using voter file information that is from September 2019.

The Hope: That there are predictive characteristics of late registration and turnout that allow us to oversample from currently registered voters so that they can approximate the profile that characterizes our expectations of these future late registrants.

Background on the Data

Campaign pollsters often complain about the Michigan Qualified Voter File (QVF) being a “mess”. It is actually two different files. One is the election history file, which records what voter identification numbers turned out for each past election. And the second is a registered voter file, which provides limited personal information attached to a voter identification number.

These complaints are based on multiple reasons. There is no party registration and no state record of which party’s primary a voter participated in. These records may be available from local clearks, and some commercial files claim to offer them, but it is not of official state record and it is unclear how a private file would accurately retrieve them..

A more relevant reason why Michigan’s file is difficult to utilize is that it is the work of 1599 separate jurisdictions and clerks that are in charge of maintaining registration and voter history records. The QVF provides state level coordination to this effort, but it is updated at different times by different jurisdictions. The only thing that is static within the file is the record of which voter identification numbers turned out in past elections.

Each time a Michigan resident moves out of one jurisdiction and into another jurisdiction, that resident’s voter identification number gets changed, even if that voter remains in the state. There is no institutional mechanism for identifying voter turnout under prior voter identification numbers.

For instance, 17.5% of registered voters and 16.9% of those who turned out in the 2016 election registered since the 2014 midterm and had no available turnout record. That makes vote-file based projections based on past history hard, since much of it is missing.

The Predictability of Late Registration

These data features complicate possible approaches for any need to make a survey of the projected electorate based on the currently registered representative of the future electorate in November 2020.

However, there are some patterns to late registration and turnout that correlate with some of the limited voter characteristics recorded in the QVF.

By Age

One of those is age. Not surprisingly, about 40-50% of the late registrants who turnout are 30 or younger. That represents a substantial difference in the projected age distribution of the electorate.

These plots show that young people would have been underrepresented in a survey in 2016 or 2018 if using an early voter file sample. But the relationship in how much off these two samples differ by youth is obscured by the difference in the overall turnout rate of 2016 and 2018.

To demonstrate this point, I recalibrate these proportions into factor changes, which divide the proportions represented by the dots in red (proportion of all voters) by the dots in blue (proportion of early registered voters). These factors measure the extent to which each age category proportion grows or shrinks when including all registered voters on election day.

So now we see a type of relationship that can be projected into 2020. The pattern of how much an age group is over- or under-represented retains proportionality from 2016 to 2018 and looks fairly smooth. People under 40 register late at a higher rate and become a larger proportion of the electorate than people over 40, who ultimately constitute a smaller proportion of the electorate than among the early registrants.

By Jurisdiction

Another clear characteristic associated late registrant turnout is the jurisdiction of residence. As above, I focus on factor change scores which measure the relative change in proportion of the electorate when moving from early registrants to all registered voters.

When comparing 2016 and 2018 factor changes in each jurisdiction’s proportion of the early registered and final electorate, we see a strong, nearly linear correlation. Since age is such a strong predictor of late registrant turnout, it is perhaps also not surprising that jurisdictions also have a cross-election tendency to either provide mostly early registered voters or late registered voters.

Indeed, given the high correlation, a somewhat backhanded, but effective approach is to simply average the two factor scores, and then recalibrate them in 2020 to be on the same turnout scale as 2016. That is take the average of the factor score by jurisdiction and then regress the 2016 score on that average score, and then use the fitted value as the expectation of 2020.

The fitted line above demontstrates the basic idea. For those jurisdictions that averaged a factor change of 1.1, the linear regression rescaling would translate that average factor change across 2016 and 2018 to an expected level of 1.15 for 2016.

Although it is possible to refine this estimate even further to level of a jurisdiction’s precinct. The smaller numbers in many communities make this tactic less efficient, and it may be unnecessary if age is also included as an identifier of greater rates of late registration turnout.

Proposing a Split-Sample RBS and Synthetic Sample Approach

Having identified two fairly predictable patterns of late registration and turnout contribution, we can now consider possible approaches for acheiving early registration sample representativeness. I can recognize two possible approaches:

The well-worn approach would be to apply post-stratification weighting. This approach would simply overweight respondents in a registration based sampling mechanism to re-adjust the sample to approximate the projected distribution of voters by age and jurisdiction. A big problem with this approach however, is that this would likely require some aggregation of jurisdiction units and binning of age to eliminate zeroed-out cells and allow practical weight estimates. Considering the degree of aggregation this would require, it is unclear if the reduction in bias would offset the loss in efficiency.

A more complicated, but appealing approach would be to use split sampling, in which one sample is based on a RBS, and the other uses sample matching from a synthetic projection of late registrants. There are various degrees of intensity to this approach, but it would follow the same basic process.

The idea is to perform a registration based sampling of voters from the current 2019 QVF based on predicted turnout patterns, this would constitute approximately 90% of the overall sample. The second sample approximates the distribution of late registrants by modeling a synthetic sample based on the patterns identified above. The sample then draws a matching profile from the current QVF (based on age and location) but with a modified turnout profile to represent a late registrant.

I think the logic of this approach is straightforward enough and appealing. But I remain open to considering alternative means for developing the synthetic sample. My intuition is that instead of simply generating synthetic sample from the past 2016 and 2018 files. I would prefer to generate a synthetic sample from the current QVF by the pattern estimates found in 2016 and 2018. That routine proceeds as follows:

  1. Estimate the relationship between late registrant turnout and age (i.e., Figure 2) in 2016 and 2018 across the state using a (weighed) multilevel random effects regression and a log transformation of age and various election-specific constants[1], estimate random slope and constant variation at either the precinct or jurisdiction level.

  2. Generate fitted values from those estimates using the empirical bayes estimates of constant and slope variation for each precinct/jurisdiction.

  3. Apply these fitted values to estimate an individual-level profile factor score for each respondent in the 2019 QVF based on the registrants age and precinct/jurisdiction.

  4. Perform a weighted sample from individuals in the current QVF based on this factor score. Modify the profiles in this sample to represent a late minute registrant for turnout modeling and sample auditing.

I am working on implementing this approach right now. If you have any comments on this approach please let me know. Otherwise, I hope to provide a later post updating you all on its application and success.


  1. One can even include additional observations and controls to get contrasting estimates of these factor changes for primary and general elections.