Note: This report is an abbreviated version of Scott M. Smith’s Masters Paper in the UNC Department of Statistics and Operations Research. You can download and read the full paper as a PDF.
North Carolina has undergone a rapid amount of change over the last 50 years, and this has had a large effect on the demographics and politics of the state. Growing from a population of a little over 5 million people in 1970 to over 10 million today, it is now the ninth most populous state in the U.S. Once a largely Democratic state, the tide has shifted resulting in the election of a Republican majority in both houses of the state legislature in 2010 for the first time in 112 years.
Too often, we are tempted to oversimplify political segments into red vs. blue or republican vs. democratic. As a data scientists, we were asked by Professor Ryan Thornburg to look beyond these standard labels to try to better understand the complexity of our state. To quote Professor Thornburg “…could we begin to use data to describe voters in different ways than just Democrat/Republican or rural/urban? Are there some shared concerns that, if we can just figure out how to connect groups of people …might have some positive effect on civil discourse in North Carolina...”
Our objective was to use data to help journalists gain insight into political and demographic groupings of the North Carolina electorate. These groupings should enable journalists to identify interesting patterns that could be explored in detail for potential story ideas. These groupings should not mirror the simple, divisive categories commonly used today; they need to dig deeper to find subtle similarities and differences that can help explain the current climate. They must be granular enough to be interesting but not so granular as to be complex or confusing.
In 2014, the Pew Research Center issued an analysis of the “political topology” of the U.S. voter. This analysis used an extensive survey to sort voters “… into cohesive groups based on their attitudes and values.” They used cluster analysis to categorize voters into 8 segments, basing this clustering on survey respondents’ answers to 23 questions. We were tasked to use a similar clustering approach based upon publicly available data focusing on North Carolina voters.
We extracted publicly available data containing registered voters’ demographics, turnout in the 2016 general election and precinct level election returns. This data was used as input to build a model to cluster the 2,704 election precincts in the state. We chose precincts rather than individual voters so that we could use actual 2016 election results (which of course are not available at the individual voter level). We used 14 characteristics (features) as input to our cluster model, and the result was seven clusters containing precincts having similar combinations of voting results, turnout, and demographics. We intentionally did not use geography as input to the model, preferring instead to create an interactive cluster map that would allow us to visualize geographical dimensions independent of the model. Our expectation was that geographic patterns would become evident, since hypothetically people sort themselves into geographically homogeneous groups. The picture below displays these clusters in seven unique colors
Click on the Picture Below to View an Interactive Version of the North Carolina Cluster Map
Clustering is a Machine Learning technique designed to sort entities (e.g. documents, people, products, precincts) into groups (clusters) with similar features. It is a type of unsupervised learning meaning that entities arrange themselves into clusters without human intervention using an algorithm designed to group entities with similar characteristics (features) into clusters (groups). As with all modeling, Cluster Modeling will almost never produce perfect results. Although many entities may have a strong affinity to a specific cluster, there will be entities that could be part of several clusters but must be assigned to one.
Table 1 below lists the features that we used to build the cluster model. Since we were clustering precincts rather than voters, we computed statistical summaries of the voter data to be used as features for each precinct. This table summarizes the mean value of each feature over the 2,704 precincts in North Carolina.
|Fraction Voted for Trump||0.53|
|Trump Premium over Newton||0.01|
|Turnout 2016 General Election||0.69|
|Self Identified as Not Born in N.C.||0.39|
|Self Identified as Having Drivers License||0.83|
Included in our feature list is a synthetic feature called Trump Premium. In the 2016 general election, North Carolina held a statewide race for Attorney General between Republican Buck Newton and Democrat Josh Stein. This race had the unique characteristics of having no incumbent running and also minimal name recognition for either of the candidates so we considered this race a proxy for a generic Democratic and Republican candidate. We calculated the difference between the fraction of votes received by Trump and the fraction of votes received by Newton and used this as a measure of Trump’s appeal (or lack there of) to voters over and above their traditional party preferences.
Click on the picture below to view an interactive map of the Trump Premium. Positive numbers (red and orange) indicate that Trump outperformed Newton, and negative numbers (blue and green) indicate that Newton outperformed Trump.
We created a series of Gaussian Mixture Models using the above features attempting to best satisfy our objective of creating enough clusters to provide interesting, meaningful insight into the North Carolina electorate while avoiding a confusingly excessive number of clusters. We considered many models consisting of various combinations of number of clusters and feature filtering before selecting a seven cluster model with moderate filtering as the best representation of demographic and political characteristics. Please click here to see the details on GitHub.
An important benefit of our modeling method is that instead of assigning a precinct to a specific cluster our Gaussian Mixture Model provided us with the probabilities of each precinct belonging to any one of the seven clusters and assigned the precinct to the cluster with highest probability. We named the probability of a precinct belonging to its assigned cluster cluster strength. Note that the strongest possible cluster strength is 1.00 while the theoretical minimum of cluster strength is 1/7 or 0.14. Most precincts had cluster strength greater than 0.50.
Our final cluster model consists of the seven clusters described in Table 2 below. Detailed analyses of the characteristics of each cluster are available by clicking on the cluster name. An interactive visualization of the model is available by clicking this text or the thumbnail below. In the visualization, you can pan around the state, zoom in and out, and control the transparency of the cluster colors using a slider below the map. If you click on a precinct, demographic and political details on that specific precinct will be displayed.
|Cluster Name||Cluster Description|
|White, Balanced||Mostly white precincts that voted a little above average for Trump. Above average Republican and Unaffiliated, below average Democratic.|
|Diverse, Balanced||Diverse precincts with significant non-White composition. Voted below average for Trump, lower than for Newton.|
|Discouraged African-Americans||Mostly African-American precincts with very low voter turnout. Very Democratic.|
|Passionate Whites||Mostly white precincts with very high voter turnout. Very high vote for Trump, higher than for Newton. Very Republican.|
|Maybe-Trump Democrats||Precincts with meaningful number of Democratic and/or Unaffiliated voters that supported Trump.|
|Discouraged Native Americans||Mostly Native-American precincts with very low voter turnout. Significant Trump support|
|Soldiers and Students||Very young precincts with very low voter turnout. Large number of Unaffiliated voters. Significant Clinton support.|
Click on the thumbnail below to display the interactive visualization.
Precincts in the ‘White, Balanced’ cluster were moderate in nature. They tended to have primarily white voters and some Asian and Hispanic voters, but minimal African-American voters. Politically, these precincts are moderately more Republican and Unaffiliated than average, and a little less Democratic. Other characteristics were about average including Trump/Clinton support. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster ‘White, Balanced’. For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘White, Balanced’.
Precincts in the ‘Diverse, Balanced’ cluster had significant non-White composition. African-Americans, Asian, and Hispanic/Latinos were well represented but not dominant. Clinton support was strong, and there was a meaningfully negative Trump Premium. These precincts tended to be more Democratic and less Republican. Turnout was average, voters were younger and they tended to not be native to North Carolina in the sense that they did identify as non-NC born. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster ‘Diverse, Balanced’. For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘Diverse, Balanced’.
Precincts in the ‘Discouraged African-American’ cluster tended to be located in urban areas. They were strongly African American and non-White, and had very low turnout. They showed extremely strong support for Clinton and had a very negative Trump Premium. They were mostly Democratic with extremely low Republican registration. Voters were also younger and tended to be native to North Carolina in the sense that they did not identify as non-NC born. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster ‘Discouraged African-American’ For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘Discouraged African-American’.
Precincts in the ‘Passionate Whites’ cluster tended to be located in the rural central part of NC. They showed extremely strong support for Trump and had a very positive Trump Premium. White voters tended to dominate, with many precincts over 90% White. There were almost no African-American voters. Turnout was very high, and voters tended to be older. Democratic registration was extremely low, and Republican registration was extremely high. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster ‘Passionate Whites’ For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘Passionate Whites’.
Precincts in the ‘Maybe-Trump Democrats’ cluster tended to be located in the rural eastern part of NC. Precinct racial demographics were not extreme, but tended to be more African-American than White. Most precincts had above average Democratic and below average Republican registration. Voters were a little older and tended to be native-born North Carolinians. These precincts also tended to have a meaningfully positive Trump Premium. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster Maybe-Trump Democrats’ For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘Maybe-Trump Democrats’.
Precincts in this cluster were primarily located in a small area of southern NC. Native Americans were a large percentage of registered voters, and turnout was extremely low. Most precincts had high Democratic and low Republican registration. Support for Trump was high, and there was also a very high Trump Premium. Most voters were native-born North Carolinians. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster ‘Discouraged Native-Americans’ For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘Discouraged Native-Americans’.
Precincts in this cluster were primarily located near military installations or college campuses. Voters were young, and tended to not affiliate with a political party. Republican registration was very low and voters provided very strong support for Clinton, with a very negative Trump Premium. Turnout was very low, but we suspect this is because of the transient nature of these voters. Since voters don’t normally de-register, when students graduate or soldiers get moved, they remain as registered for several years but don’t vote. Click here to view a heat map of this cluster. Note that this heat map displays the degree to which each precinct belongs to cluster ‘Soldiers and Students’ For low values, the selected precincts should, in fact, be assigned to a cluster other than cluster ‘Soldiers and Students’.
Ishan Shah also contributed to this project. This report is an abbreviated version of Scott M. Smith’s Masters Paper in the UNC Department of Statistics and Operations Research. You can download and read the full paper as a PDF.