Unsupervised Learning of Political Ideology by Word Vector Projections

(Source code available here.)

When I tried out TensorBoard’s embedding projector, I was particularly intrigued by an example of its custom projection functionality: Finding 100 nearest neighbors of the vector “politics” and projecting them onto a “good” to “bad” axis. which yielded the result of “violence” being bad and “discussion” being good.

Here is another fun example: Finding the nearest neighbors of the vector “economics” and projecting them onto the “democrat” to “republican” and the “left” to “right” axes. Note the philosopher F. A. Hayek, an adamant proponent of classical liberalism and free market economy, is correctly projected to the “republican“ and “right” corner.

Inspired by these amusing examples, I propose two hypotheses:

  1. The default word2vec example is trained on a Wikipedia corpus. Suppose we are only interested in exploring vectors of political topics, can we train a model based on a political corpus and learn a better representation of political knowledge?                                                                                                                      
  2. Given a list of, say, U.S. senators, compute the vector projections of each senator's vector onto the "conservative" to "liberal" axis; could the projections' scalar components be interpreted as an valid metric of political ideology?

 

Data & Methods

I wrote a few web crawlers to collect articles from the three premier national newspapers, i.e., The New York Times, The Washington Post, and The Wall Street Journal. Based on the crawled metadata, I filtered out articles that are not US news, world news, or opinion. I pre-processed the text by removing non-alphanumerical characters and covert all letters to lowercase, as it is standard practice with training word vectors. The resulting corpora are as follows:

I trained both vanilla word2vec as a baseline and also its extended model, fastText (Bojanowski et al. 2016, in which Thomáš Mikolov himself is a co-author, so presumably he approved fastText as the second generation of word2vec.) I slightly preferred fastText because it is a character n-gram based model, thus enabling us to query out-of-vocabulary words and multi-word phrases. All models are trained in 15 epochs with 300 dimensional vectors; other parameters are left as default.

To evaluate our model’s ideology prediction, we need a reliable metric to measure political ideology. The natural choice is NOMINATE, a multidimensional scaling model well-respected among political scientists, which measures members of Congress’ ideology based on their roll call votes (Poole & Rosenthal 1985. Data available at voteview.com). 

 An example of NOMINATE. Note that, by default, NOMINATE estimates two dimensions of ideology. Most of the time, however, congressional votes are uni-dimensional. It is therefore common to simply use the first dimension in time series analyses.

An example of NOMINATE. Note that, by default, NOMINATE estimates two dimensions of ideology. Most of the time, however, congressional votes are uni-dimensional. It is therefore common to simply use the first dimension in time series analyses.

 

Experiments

Not to keep you in suspense, the vector projected ideology far exceeds my expectations. Plotting our ideology scores against their corresponding DW-NOMINATE scores reveals a strong correlation evident in plain sight:

Remarkably, this correlation is consistent across years with low variance. Most Pearson’s r and Spearman’s rho stay in the range from 0.7 to 0.8. (See this iPython notebook for complete results.) 

Moreover, this correlation is absent in models trained by the commonly used Wikipedia corpora, where the domain of knowledge is too broad. Similarly, I trained a model using all sections of The New York Times, as opposed to only focusing on political news; the correlation in such model is likewise considerably weaker, even though it is a much bigger corpus (8.52 GB vs. 2.19 GB). This is a noteworthy caveat to the conventional wisdom in machine learning that more training data = better. In this case, when we seek a domain-specific task, it is better to only use domain-specific corpora.

All of the above models are trained with the character n-gram based fastText, partially because I assumed some lesser-known members of Congress may be out of vocabulary. This turns out to be not true. All three newspaper corpora include all members of Congress in their vocabulary with less than 3 exceptions. Comparing the fastText models with the word2vec models, the differences are only marginal. 

 My hometown newspaper, Washington Post, triumphs over the other two hashtag failing papers.

My hometown newspaper, Washington Post, triumphs over the other two hashtag failing papers.

 

Null Results or Future Works?

I strive to be intellectually honest, so here are some sub-publishable results that I think are still worthy of discussion. First, this method works wonderfully for the Senate, but not at all for the House of Representatives. The reason is probably quite simple: There are 435 members of the House, who all face reelections every 2 years. That’s a lot of elected officials. Hence, there are few reasons for national newspapers to have consistent coverage for an average member of the House. When there is no substantive coverage of them in the corpora, our model cannot learn an accurate representation of their ideology.

 Pearson's r = 0.4758,   Spearman's rho = 0.4812

Pearson's r = 0.4758,   Spearman's rho = 0.4812

Further, in addition to members of Congress, I (somewhat arbitrarily) compiled a list of public policies and talking points that are prominent in recent years. I was hoping that I may query these polices and get a sense of their ideology and—by projecting them onto a “good” to “bad” axis—their popular approval.

As opposed to concrete written laws, some of these policies are only abstract concepts, for which quantitative metrics are not readily available. As for now, we have to resort to a bit of domain knowledge to evaluate these projections. The results are kind of a mixed bag:

NYT policies.png

Many of these vector projections do match our intuitions, e.g., "civil rights" and "minimum wage" are on the liberal side, while "supply and demand" and "religious freedom" are on the conservative side. Nonetheless, there are also quite a few that are perplexing, e.g., why are "tax cuts" on the left and "medicaid" on the right? One possible explanation is that Democrats frequently criticize tax cuts, while Republicans vociferously oppose Medicaid. Because the fundamental training task of all word embedding methods is to predict the context words based on a center word/ a co-occurrence matrix, it is possible that a policy may be associated with a party simply due to the party's frequent mention of such policy. To wit, the "liberal" to "conservative" axis measures an issue's saliency within an ideological group, not support or opposition.

Lastly, it's fun to compare between two newspapers known for their opposing ideologies. Unlike The New York Times, note that The Wall Street Journal is not so critical of filibusters, the Iraq War, or F. A. Hayek.

 Americans hate big government more than Darth Vader, which doesn’t make sense considering the Galactic Empire is the epitome of big government.

Americans hate big government more than Darth Vader, which doesn’t make sense considering the Galactic Empire is the epitome of big government.

Again, feel free to check out the source code of this project and play around with it. Please let me know of any interesting finding.