Differentiating Democrats using debate data:
2020 Democratic Primary Debate Bingo
Primary Elections in the United States are the first step towards choosing a President, with each of the major parties holding their own “mini elections” to decide who they will nominate for the position. One of the most well-known traditions of these contests is the meeting of the candidates on-stage for a series of debates.
A map of delegate allocation for the 2020 Democratic Primary election
Ostensibly, these debates help voters differentiate between candidates and decide which of them best represents their interests. In practice, however, it’s difficult to do so when there are 15-20 people all trying to get voters’ attention. Gamifying the debates using Bingo could help alleviate this problem, making them more fun and entertaining.
Bingo is a simple game in which players are given a 5x5 grid of randomly assigned values from a pool of possible results. Values from this pool are then drawn via some process and if a value appears on a bingo card, its cell is filled in. A player wins if five adjacent or diagonal cells in their grid get filled in. Traditionally, the values are randomly drawn integers, but for our purposes they will likely be words or phrases said by the candidates, which we hope are more deterministic.
An example of a traditional B-I-N-G-O card
People have long been playing games based on predictions (such as Bingo or drinking games), but we wanted to see if we could apply some data science techniques to the problem of generating Bingo cards for these debates that (1) result in a winner, (2) don’t result in too many winners, and (3) aren’t generally something that would be produced by a single person’s recollection of candidates’ typical speech.
Data was gathered with BeautifulSoup to scrape election transcripts from rev.com, a website that employs freelancers to type up transcripts of public events. There were a few formatting inconsistency issues, but we eventually managed to clean the data and get each statement by a candidate across all debates into a single data frame with the text of the statement, the speaker, the timestamp of the statement, and the specific debate as columns.
During exploratory data analysis, we produced word clouds that showed which words the candidates used most often, removing stop words such as “and” or “the”.
It’s evident that there is much overlap between candidates doing a simple count of words spoken; the words “got” and “people” appear in almost every cloud, however, these aren’t words that we feel should qualify as stop words that can simply be removed from consideration altogether. Even if we did this, there are certainly other words common to all candidates that would replace them. From this, we determined that in order to make Bingo cards that met our goals, we would need to find some way of removing words from these clouds that are common among many candidates.
We also used the timestamps in the data frame to get estimates of how long a candidate speaks on average, as well as their total speaking time.
Interestingly, the candidates with a high average statement duration are the candidates who spend less time speaking overall. We assume this is because a higher proportion of their statements comprise opening and closing statements, which are much longer than a typical response in these debates
The total duration in minutes is an aggregate of the first few debates, but it makes clear something we already knew: that the amount of time that each candidate is allowed to speak varies wildly. This tells us that any analysis that we do will have to consider this.
In order to generate lists of words that were unique to each candidate, we created a dictionary of all of the words that were spoken in the debate. We then created a binary vector for each statement in which the elements of the vector corresponded to the words in the dictionary. If a word appeared in a statement, its corresponding vector value was 1; the value was 0 otherwise.
We then randomly separated the data for each candidate into a training and test set composed of two classes of equal size: statements made by the candidate and statements made by other candidates. We normalized the size of these sets by the number of statements made by the candidate in order to maintain equal class balance.
A support vector machine classifier was then trained on each of the candidates to determine which words in our dictionary were significant in separating each candidate from all of the other candidates. This process was repeated ten times and the results averaged.
After getting the separating vectors, we verified that our results were consistent with our goal of generating words for each candidate that are not used by other candidates. To do this, we computed the dot product of these vectors for each pair of candidates.
Many of the dot products are negative, especially among candidates for which there is a lot of data, implying that their prediction vectors are misaligned. We were also curious to know which candidates were most similar to each other in this measure, so we wrote a script that picked out the other candidates with which each candidate was most and least misaligned.
We also recorded the accuracy with which the SVM classifier was able to classify candidates’ statements on average. This can be interpreted as a proxy for the distinctness of each candidate’s rhetoric, but this should be qualified by the fact that candidates with more data have generally higher prediction accuracies, so only candidates with comparable amounts of data should be compared in this way.
After all of the above work, we were ready to make some Bingo Cards! For this task we decided to use the support vector machine classifier discussed above to generate Bingo cards using the words that most differentiate a candidate from the rest of the pack. With this list of words in hand, we used overleaf to generate pdfs of Bingo Cards randomly from a list of words and who needs to use them. Due to timing difficulties we decided to focus on retroactive bingo cards, as the differentiating words were more interesting and representative of the candidates policies and personalities.
Unfortunately this project was cut short with the rise of COVID-19, and we were unable to do more than this. Still, the project was a success and was both a great introduction for many into the world of data science and political speech and an excellent opportunity for our more experience members to demonstrate their knowledge of language processing.