Skip to content

Structural Topic Models to study political texts: an application to the Five Star Movement’s blog

In the last weeks at New York University I had the opportunity to meet Brandon Stewart (Princeton), who gave a talk on Structural Topic Models at the NYU Text as Data Speaker Series, Molly Roberts (University of California), who presented her work at the 2015 SMaPP Global Conference, and finally Kenneth Benoit (LSE) that gave us an introductory workshop on text analysis with his `quanteda` R package.The three of them are significantly advancing the study of political texts, an extremely promising sub-field of political methodology.

The Structural Topic Model

I wanted to learn more about it and so last weekend I practiced a bit with the `stm` R package developed by Dustin Tingley, Molly Roberts, Brandon Stewart and colleagues. This post shows an application to the political blog The comedian Beppe Grillo is the leader of the Five Star Movement, and his blog is the center of the movement’s political communication. My intention is to show how easy it has become to do text analysis in empirical studies, even for researchers who are not advanced in R programming and Bayesian statistics —although clearly the more background knowledge the better.

The reader looking for an introduction to text analysis can find an excellent one in the paper by Grimmer and Stewart (2013). Structural Topic Models (STM) are mixed-membership topic models that can incorporate contextual covariates (i.e. document-specific metadata) in the prior distributions. If you’re novice to the field a great but short (and almost math-free) introduction to probabilistic topic models can be found in this Ted Underwood’s post. The basic intuition behind probabilistic topic models is that each document is in turn represented as a mixture of topics, and each topic is represented as a mixture of words. Thus, each word has a certain probability of belonging to a topic. Also, STM belongs to the class of unsupervised methods of text-analysis. These methods are very appealing because they don’t require the researcher to know ex ante the categories of interest. Thus: 1) they don’t require a supervised (i.e. hand coded) classifier; 2) they allow for the discovery of new categories previously not considered. Of course, if the researcher knows the categories of interest beforehand, a supervised method is preferable (e.g. the Gary King’s `ReadMe` package is useful to estimate documents’ category proportions).

Beppe Grillo and his blog

Going back to the original spirit of this post, what makes the corpus of Beppe Grillo’s blog posts is the emphasis the movement puts on new technologies and the nearly complete avoidance of the “moribund” mainstream media by the movement. The blog is simply the key to understand what this movement talks about. Briefly put, The 5SM is the political movement —fiercely not one of the “dead” political parties!— that obtained 25.56% of the vote for the Chamber of Deputies in the last general elections of 2013 (excluding Aosta Valley and the vote of the Italians abroad). In the latest polls, the 5SM registers between 26.1% and the 27.2% (Italian polls).

The M5S is a populist anti-establishment movement whose success probably lies in the successful attempt to combine the leftist-sounding advocacy of a direct democratic method of citizens’ involvement in the political decisions, with a charismatic and authoritative (when not authoritarian) leadership, particularly effective in mobilising the discouraged right-leaners, in the background of a vast resentment towards the perceived corruption of the political establishment.
nice 5-minute reading in this post by Filippo Tronconi on the LSE blog, and for a more thorough analysis I recommend his edited book.

STM in practice: preliminary steps

Hands on! The first step of this exercise is the construction of the corpus, which in the specific case implies harvesting the blog posts. This has become a relatively straightforward task thanks to the increasing software improvements. Python is advisable for more challenging tasks, but for this purpose I preferred to work in R: to harvest webpages I have used the `rvest` package; to identify and extract from the html code the components I was interested in (the `<body>`, the tags and the date of the post) I’ve found very useful the `SelectorGadget`. Readers can find more information on this R Studio’s blog post. Also, the task was made relatively straightforward for the fact that the has a freely accessible archive. Overall, after deleting the posts from related blogs hosted on (e.g. “LaCosa”, Grillo’s YouTube’s channel) and empty posts (i.e. video posts), the corpus contains 10,175 articles with the relative posting date (from August 1 2005 to October 9 2015). The data frame has four columns (post, date, tags, link), with each row representing a blog post.

To get you a sense the code is something like this:

HarvestCorpus <- function(x) {
# Title and body of the blog post
post <-  html(x, encoding =”UTF-8″) %>% html_nodes(“.BodyPost p, .titolopost”) %>% html_text() # guess_encoding() can be useful here
#  date
date <- html(x) %>% html_nodes(“.posted”) %>% html_text()
corpus <- data.frame(post,tags,date,link, stringsAsFactors=FALSE)
# links is a .csv with all the http links
corpus <- tryapply(links, HarvestCorpus)

I experienced some problems with the “UTF-8” encoding, with accents and apostrophes were displayed as strange characters in some cases. I’ve found useful `guess_encoding()`. This task is clearly language-specific, and other issues might appear in other languages. I’ve finally directly inspected the corpus at multiple random points to exclude other weird things.

We have the corpus and we can now dig into the `stm` functions. The package provides a comprehensive set of functions to support the researcher from the creation of the document-feature matrix to the display of results and this makes text analysis a friendlier environment for novice researchers. Also, the package vignettes are detailed and further support is provided at the dedicated GitHub repository. The first step involves pre-processing the corpus. This includes stop-word, rare words, and punctuation removal, and stemming (reducing the words to their root, e.g. “democratic” and “democracy” into “democr”). The latter is a kind of ruder version of a lemmatizer, but is a quite standard pre-processing in bag-of-words approaches. Next, we have to transform the matrix into a readable format for `stm()`. The task is also supported by dedicated functions (like `textProcessor()`, and `prepDocuments()`, see the vignettes for details).

corpus.proc <-textProcessor(documents=post,

out <- prepDocuments(corpus.proc$documents,

The topics: how many? and how to identify them?

To estimate the structural topic model we have a decision to take: how many topics are we going to choose? This number is chosen ex ante by the researcher in topic models. This decision sounds arbitrary, but we can have a more informed guess by looking at indices such as exclusivity (are the top words of one topic different from the top words of other topics?) and semantic coherence (do top words of one topic tend to co-occur in a document?). Given that is practically impossible to guess the exact number of topics in the corpus, I prefer to consider a wider number of topics rather than a potentially too narrow one. This will probably result in some weird cluster (in fact I find one containing English words, and another for stop-words that were not included in the snowball list. My choice was to use the Lee and Mimno (2014) procedure to produce automatically a number of topics (check `stm` vignettes at footnote 11 for more information). The algorithm identified 77 topics, so I then look at performance indices of the models with 20, 40, and 77 topics and at the light of these scores I decide to proceed with 77 topics.

One caveat: careful validation (i.e. reading samples) is always required to draw any substantive conclusion. However, since this post makes no scientific claims but just shows the functioning of these models, (and also for the taste of it!) these are my guesses: cluster 1 is likely to be about “foreign policy”; 5 5SM; 6 economic crisis; 26 unemployment and precarious working conditions; 29 environmental issues; 33 immigration issue; 36 mafia and corruption; 38 local politics; 43 Anti-austerity / no-euro / Ordoliberalist monetary policy; 54 PM Renzi; 56 general / (anti) left-right politics; 76 water management.

Here follow the top words for these topics (four indices are considered, check out the vignette for more information):

labelTopics(stm, (1,5,6,26,29,33,36,38,43,54,56,76))

Topic 1 Top Words:
##      Highest Prob: guerr, stat, lib, russ, paes, unit, iraq
##      FREX: iraq, gheddaf, iran, occident, lib, sir, bombard
##      Lift: cong, teheran, sunn, sudanes, persic, irachen, isis
##     Score: cong, guerr, iraq, gheddaf, iran, lib, russ

Topic 5 Top Words:
##       Highest Prob: grill, bepp, blog, casalegg, vide, comic, scriv
##       FREX: grill, bepp, vinciamono, casalegg,comic, popul, sanrem
##       Lift:rozz, pulitochiunqu, grill, itb, vinciamono, swift, bepp
##       Score:rozz, grill, bepp, casalegg, vinciamono, comic, blog
Topic 6 Top Words:
##       Highest Prob: ital, deb, paes, eur, stat, europe, europ
##       FREX:deb, rovesc, spread, spagn, grec, pil, bce
##       Lift: rovesc, piigs, eurosovran, pil,  fuoridalleur, euroil
##       Score: rovesc, deb, ital, grec, bce, miliard, europ
Topic 26 Top Words:
##       Highest Prob: lavor, anni, azi, disoccup, pension, precar, sol
##       FREX: precar, licenz, disoccup, lavor, schi, opera, operai
##       Lift:bazzonioperai, bazzonimtinit, cocopr, bazzon, parasubordin, cococ
##       Score: hop, lavor, precar, disoccup, pension, licenz, contratt
Topic 29 Top Words:
##       Highest Prob: nucl, central, stat, ogm, animal, nuclear, sol
##       FREX: ogm, contamin, verones, mais, enel, carbon, coltiv
##       Lift: hans, resources, transgen, nuclearizz, overshoot, silic, monsant
##       Score: hans, nucl, ogm, nuclear, animal, radioatt, mais
Topic 33 Top Words:
##       Highest Prob: immigr, clandestin, leg, ital, cacc, paes, legg
##       FREX: immigr, clandestin, accoglient, migrant, asil, profug, mine
##       Lift: famosissim, frontalier, immigr, clandestin, richiedent, mine, bossifin
##       Score: famosissim, immigr, clandestin, migrant, profug, dublin, mine
Topic 36 Top Words:
##       Highest Prob: maf, arrest, corruzion, stat, polit, mafios, indag
##       FREX: arrest, corruzion, domiciliar, clan, pen, ndranghet, tangent
##       Lift: pioltell, odevain, gregant, ferrandin, gav, binasc, rolex
##       Score: pioltell, arrest, maf, corruzion, mafios, indag, gav
Topic 38 Top Words:
##       Highest Prob: comun, sindac, cittadin, comunal, inceneritor, amministr
##       FREX: differenz, comunal, sindac, rif, inceneritor, ragus, ricicl
##       Lift: lucc, wwwcomunivirtuosiorg, amiat, alvis, assemin, wheeler, piccitt
##       Score: lucc, comun, sindac, comunal, rif, inceneritor, differenz
Topic 43 Top Words:
##       Highest Prob: econom, paes, eur, cris, polit, cresc, stat
##       FREX: german, monetar, pil, inflazion, monet, cresc, tedesc
##       Lift: vulg, evanspritchard, macroeconom, volkswagen, petr, stagnazion
##       Score: vulg, econom, german, deflazion, monetar, eurozon, cresc
Topic 54 Top Words:
##       Highest Prob: renz, govern, mister, matte, premier, riform, sol
##       FREX: mister, renz, nazaren, premier, matte, firenz, ebetin
##       Lift: mister, renziement, renz, cottarell, carra, cipollin, notopregiudic
##       Score: mister, renz, nazaren, premier, matte, riform, firenz
Topic 56 Top Words:
##       Highest Prob: polit, sinistr, part, destr, cos, far, sol
##       FREX: sinistr, destr, intellettual, democraz, rivolu, ideolog, programm
##       Lift: mercantegg, girotondin, epitet, antisistem, cosicc, arrocc, antipolit
##       Score: mercantegg, sinistr, destr, polit, mov, vot, democraz
Topic 76 Top Words:
##       Highest Prob: acqua, pubblic, privatizz, serviz, priv, cittadin, gestion
##       FREX: acqua, privatizz, idric, zuccher, acquedott, potabil, profitt
##       Lift: sbrigat, zuccher, petrell, tubatur, veol, acquedott, acqua
##       Score: sbrigat, acqua, privatizz, idric, zuccher, acquedott, potabil

This shows the frequency of the topic words in the corpus.

Overall, we see that what I labelled “immigration” (or better the cluster 33) and “no-euro” (cluster 43) have not been on average the main focus of the 5SM’s political communication, more centered on the economic crisis (topic 6) and unemployment/precarious working conditions (topic 26).

(Finally) estimating marginal effects with STMs

To illustrate the potential of STMs I will pick two topics. For example, I try to operationalise one consideration from Tronconi’s blog post: “[In the aftermath of 2014 European elections] it also seemed evident that, when forced to choose, Beppe Grillo preferred to align with rightist movements, despite the rhetorical refusal to position the FSM on the left-right dimension of the political spectrum“. So let’s focus on the cluster 33 and 43, namely immigration and the anti-austerity/no-euro (I assume these two labels are correct). These are two issues owned by the right in Italy; intuitively, following this statement we would expect the 5SM to focus more on immigration and on opposing the euro after the May 2014 European elections. The warhorse function of the stm package `estimateEffect()` will fit a regression model in which the proportion of each document about a topic is regressed on a document-specific covariate (in our case `date`, modelled with a b-spline to capture non-linearity). `plot.estimateEffect()` will finally plot the results.

marg.eff33 <- estimateEffect(c(33) ~ s(, # b-spline transformed date



uncertainty = “Global”,

This is not trivial, to give you an idea of the information contained in this plot consider that the list with the estimates is about 1Mb. Yet, the estimation occurred in just few seconds. Anyways, we notice immediately that the series looks quite flat, which is expectable given that is a 10-year span. We can observe an increase around January 2011, and this could be related to the Arab Spring and the escalation of revolts in Libya (in January the 2008 treaty of friendship between Italy and Libya was suspended).
In January 2014 the Senate is about to vote legislation on detention policy, that included decriminalisation of illegal immigration (previously introduced by the Berlusconi government). Grillo is against the decriminalisation but some of his Senators are in favour, thus the 13th of January calls a referendum (notice that the referendum was opened at 10:00 but the post is dated 13 Jan 10:37…) to ask 5SM’s members to express their “binding” view. The majority of them resulted actually in favour of the measure and this led to further disputing Grillo’s leadership. If we trust the model, we would say that the issue was silenced on the blog (given the relation to the leadership disputes), signalising discrepancies between the public discourse and the discourse on the blog. However, we can still observe a second increase in 2015 that could be probably related the refugees’ crisis. The evidence would be mixed, and the reason could be that the immigration issue was actually politicised and related to the leadership of the movement. The second rightist issue could be more informative.

We notice immediately that the anti-austerity topic becomes more important in 2012 (in fact, the speculative attack against the Italian 10y bonds was triggered in July 2011), and after the mid-2014 we observe a quite marked increase of this topic in Grillo’s political discourse. We don’t draw any conclusion from these quick estimates, but putting together the spike of immigration in 2015 and the anti-austerity plot, the sniff test suggest that Tronconi might be right.

Most importantly, the STM proved to be quite effective in operationalising this statement about the change in the movement’s political discourse, which makes the field of potential application of the STM extremely wide.

Published inArticle