Data about voting and elections has been around for a long time. Twenty-five years ago, University of North Carolina political science professor Thad Beyle was using Excel to help students analyze and publish N.C. DataNet, and Phil Meyer was teaching classes about precision journalism. But the computing power, availability of open-source analytics tools and the ability to share and collect data over the Internet has drastically changed the power of data to tell stories about democracy.
But it has also increased the power inequality between those that have the time and skill to digest these vast troves of data and those who don’t. Journalists – are certainly average citizens – almost always fall into that second camp.
With the launch of the N.C. Votes project, I hope to use the technical advances of the last quarter century to help make the power of political data more widely accessible to all North Carolinians. That won’t happen at once, so here’s the roadmap of what we’ve done so far and where we plan on going.
Over the last year, I’ve worked with students and faculty from across UNC to collect, organize and analyze the public data that the state board of elections makes available online. As someone who used to transcribe numbers from photocopied papers, I can’t tell you how helpful — and how important — it is for the board to make so much of its data available in machine-readable formats available for public download. It’s this kind of data transparency that improves trust in institutions.
But while there is a lot of data available from the state board of elections, it isn’t always in a format that is the most useful to answer questions that journalists, candidates and voters have about how our democracy works. For example, while the state board provides snapshots of voter registration, it doesn’t provide the data in a way that makes it easy to see changes in the state’s electorate in real time. In a dynamic state like North Carolina, that change is important to see. Our first step was to get the help of Caleb Smith in writing a program that would automatically download voter registration data from the board’s site and load it into our own database in a way that would allow us to follow trends in voter registration county-by-county and week-by-week. Today we’ve made the code that Caleb wrote publicly available on Github.
Since the end of July we’ve been automatically downloading this data weekly and will soon be building web applications that make it easy for reporters to make better understand this data in real time.
After we wrapped our heads around the challenge of tracking changes to the state’s electorate in real time, we tackled the challenge of gathering and organizing all of the state board’s election returns in one place. For this task, we turned to Bill Shi from UNC’s Odum Institute for Research in Social Science. Bill first needed to standardize precinct-level election into a format that would allow users to compare election returns from one campaign to the next. Over the last 10 years, the structure of the board’s election returns data has changed five times.
With Bill’s program, which we’re also making publicly available today, we now have a common format for most elections going back to 2011. Today we are publishing a single file containing election results for nearly 1.5 million combinations of candidates and precincts. This was data that was stored in about 20 files in several directories on the board of elections website. You can download it now as a 14.8 MB tab-delimited text file (But be careful when you open it. It is 137 MB uncompressed and has more than 1.5 million rows and 27 columns. It also still contains a decent amount of “dirty” data in its original format from the Board’s files — such as a precinct in Union County that appears to be named “???”.)
As we continue to work on this project we will add older elections results that aren’t yet in machine readable format and we’ll try to track down some of the results from municipal elections that aren’t on the board’s website.
As Caleb and Bill were acquiring the data and organizing it into a useful format, students from UNC’s Department of Statistics and Operations Research asked how they could get involved in the project, and they spent the summer testing out various statistical analysis techniques that might help us talk about voting and elections beyond the usual partisan labels. Is every Democrat like every other Democrat? How about Republicans? Do rural and urban voters have anything in common? A lot of the national framing of politics might have you thinking so, but Scott Smith and Ishan Shah have done enough analysis of the data that we think there might be a lot of other — and less divisive — ways we can explain the political environment in North Carolina. And if there’s not, their analysis will show that, too. Stay tuned. As we have updates and find new insights we’ll be posting them right here.
In the meantime, the code they have written to analyze the data is publicly available. Remember, this analysis is a first rough draft. We’re publishing it here so that if anyone sees anything we’re missing, they can let us know.
But, the analysis of this data can’t just be left to the lab. That’s why I’ve been using it to teach journalism students about data reporting techniques in my class this semester. And it’s why journalism students Kirk Bado and Danielle Chemtob have been working with the News & Observer’s database editor, David Raynor, to begin to brainstorm about what kind of tools newsrooms would need to make this data truly useful. Danielle’s also been working on a story that will explain all of the pitfalls that citizens might encounter in the data. For example, the database shows more than a thousand 117 year old voters in North Carolina, and one voter who appears to have already voted in the 2018 elections. Danielle dug deeper and found out the board has explanations for these and several other pitfalls that still give a steep learning curve to anyone who wants to better understand state politics.
So what comes next?
We’ve been a tightly knit little group of journalists and statisticians here in Chapel Hill for the last few months, but we want to make sure our work solves real problems and answers real questions for folks like you — reporters, publishers, candidates, and curious citizens alike.
Over the next few months we will begin releasing the cleaned data in structures that allow you to answer common questions about voter participation, demographic changes and precinct-level characteristics across the state and across time. We’ll be posting the data as CSVs that should be small enough for more people to be able to simply open them in Excel or Google Sheets.
The first full tool we plan to release will be a web-based application that will allow reporters to create random samples of voters based on characteristics the reporters define. For example, want to talk to young, urban white women who aren’t affiliated with a political party? We can help reporters make sure those voices get into their stories. Or rural African American voters who voted in 2008 and 2012, but stayed home in 2016? Or maybe you’re doing a story on your municipal elections and you want to avoid interviewing people who never vote in local contests. We think every reporter could use a tool like that and we think it will make the journalism in our state not only easier, but more diverse and representative of voices that don’t normally end up in news stories. And that, we think, can help elected officials better understand the citizens they serve and help all of us from being caught off-guard by changing voter attitudes.
While we’re working on building that tool, we’re also going to be analyzing hundreds of voting precincts across the state to see if there are other ways of categorizing them beyond the rural/urban or Democrat/Republican divides. By looking at voter demographics, voting behavior and election outcomes in different precincts we’ll be better able to understand how the state is changing. We wonder things like whether there are areas of unaffiliated voters that really behave like strong Republican precincts. Or are there some districts that consistently have high turnout and others that don’t? We want to make a website that shows the state in something other than just Red and Blue.
And while that’s all interesting, what can an individual voter do with this data? We don’t quite know yet. Maybe it’s better understanding the effects of legislative districts or municipal election turnout on the relative value of an individual vote. Maybe it’s connecting voters who are separated by geography but who otherwise share common backgrounds and interests. This is what I’ve been calling the productization of the reporting process. Yes, we can use the data to find and tell more interesting and relevant stories, but can we also build services and experiences that solve problems. It’s a little bit humanities, a little bit social science and what I think is the future of media and journalism.
We hope you’ll join us. You can share your ideas with us on Twitter or via email to email@example.com. If you’d like to join our workspace on Slack or fork our code on Github, we’d love for you do that, too. We’re always looking for students, academics or anyone else who would like to contribute with questions, code or reporting that uses this data. And when we find them, they’ll be getting lots of love right here, along with all our other updates as the project moves along.