Topic Modeling

Ulrike Henny-Krahmer
(CLiGS, University of Würzburg, Germany)

Workshop "Digital tools for the analysis of literary texts"
Verona, 23 & 24 October, 2017

Slides at: https://hennyu.github.io/verona_17

Sessions on Topic Modeling

Monday
- 14-15.30: An Introduction
- 15.45-17: Corpus Preparation
Tuesday
- 09.30-11: Using MALLET and tmw
- 11.30-13: Post Processing, Visualization and Interpretation of Results

Topic Modeling: An Introduction

What is Topic Modeling and how does it work?
Application fields

1. What is Topic Modeling and how does it work?

What is Topic Modeling?

"Topic modeling is complicated and potentially messy but useful and even fun. The best way to understand how it works is to try it."

(Megan R. Brett, "Topic Modeling: A Basic Introduction")

What is Topic Modeling?

Topic Modeling is a quantitative method in text analysis
distributions of words are detected statistically in a corpus of documents
Topic Modeling is especially useful for large collections of texts

The goal of Topic Modeling is...

...to detect hidden semantic structures.

How does it work?

Basic idea from Distributional Semantics:

"a word is characterized by the company it keeps"

(John Firth, 1957)

How does it work?

Topic Modeling identifies automatically recurring themes, motives, discourses
important: without explicit semantic knowledge!

Where does it come from?

Topic Modeling has primarilly been developed empirically
originally developed for Information Retrieval (search for subject-matters)
current method, widely-used: LDA (Latent Dirichlet Allocation), 2003

How does it work?

basic idea

discovers words that occur together again and again, that is, words that occur in similar contexts ⇒ Topics
calculates how important each topic is in each document

How does it work?

a little bit more technically

a topic is a distribution of probabilities of words
a document is a distribution of probabilities of topics

Words, Topics, Documents

(David Blei, "Probabilistic Topic Models", 2012)

Generative, iterative

generative

at the heart of the technique is a generative model
how could the documents have come into being?

iterative

for each __document__ in the collection:

	choose a topic distribution

		for each __word__ in the document:

			choose a topic, to which the word belongs
			choose a word from the topic

repeat the whole process!

Generative, iterative

(Steyvers and Griffiths, "Probabilistic Topic Modeling", 2006)

Generative, iterative

(Steyvers and Griffiths, "Probabilistic Topic Modeling", 2006)

And this is how it works exactly:

And this is how it works exactly...

Topic Modeling Workshop: Mimno from MITH in MD on Vimeo.

Terms and concepts

The process might be a black box.

But the results are not.

And what we put into the process, neither!

word, topic, document have a special meaning in topic modeling

Terms and concepts

words

tokens
sentences are splitted by tokenization
tokens are not always words
"Topic Modeling" can also be a token

Terms and concepts

documents

not: sequences of words and punctuation marks
but: a collection of word counts
e.g. ["to" : 2, "be" : 2, "or" : 1, "not" : 1]

Terms and concepts

corpus

a collection of documents

Terms and concepts

topics

in the underlying model, they are at first not what a text, discourse, conversation is about
technically: a probability distribution over a word vocabulary

important: before we start topic modeling, we decide ourselves what a word and a document is!