"Topic modeling is complicated and potentially messy but useful and even fun. The best way to understand how it works is to try it."(Megan R. Brett, "Topic Modeling: A Basic Introduction")
...to detect hidden semantic structures.
Basic idea from Distributional Semantics:
"a word is characterized by the company it keeps"(John Firth, 1957)
basic idea
a little bit more technically
generative
iterative
for each __document__ in the collection:
choose a topic distribution
for each __word__ in the document:
choose a topic, to which the word belongs
choose a word from the topic
repeat the whole process!
The process might be a black box.
But the results are not.
And what we put into the process, neither!
word, topic, document have a special meaning in topic modeling
words
documents
corpus
topics
important: before we start topic modeling, we decide ourselves what a word and a document is!
"school"
"travel"
"business"
"French intervention in Mexico (1861-1867)"
"description of landscape"
"somewhere in Argentina?"
alternative visualization for words in topics
visualization for topics in documents
Roberto Payró, El falso Inca (Argentina, 1905)
The Starry Night (Anne Sexton), in a Topic Model by Lisa Rhody (2012)
Example: Signs at 40 (grid)
Example: Signs at 40 (over time)
Literary genres as classes, as prototypes, as families
Corpus
Example topic
Topic-based distances to prototypes
Topic-based distances between all novels
How about questions of variation and normalization?
How well do I need to know the texts from close reading?
⇒ What is the research question?
To consider:
Example: Ciro B. Ceballos, Un adulterio, 1901, short novel, Mexico
topics in the document
Example: Ciro B. Ceballos, Un adulterio, 1901, short novel, Mexico
topic 31: vida - campo - estancia (life - countryside - ranch)
Example: Ciro B. Ceballos, Un adulterio, 1901, short novel, Mexico
word | weight in the topic | number of occurrences in the text |
---|---|---|
vida (life) | 193 | 20 |
campo (countryside/field) | 179 | 5 |
estancia (ranch) | 152 | 1 |
año (year) | 151 | 5 |
amigo (friend) | 137 | 11 |
mate (mate tea) | 129 | 0 |
cuero (leather) | 109 | 0 |
sargento (sargent) | 86 | 0 |
pata (paw) | 84 | 0 |
cuchillo (knife) | 76 | 0 |
topic 31: vida - campo - estancia (life - countryside - ranch)
reconsider:
"somewhere in Argentina?"
ideal: similar length
Example: corpus of German long fiction
Example: Grenzboten corpus
no definite answers yet
workarounds:
what we need:
Example
text as linguistic code
the basis for
Example: TreeTagger
also important for Topic Modeling!
what we need:
(Mallet und Python; see http://github.com/cligs/tmw.)
Name | Developer | Language | Link | ||
---|---|---|---|---|---|
MALLET | machine learning for language toolkit | Andrew McCallum et al. | Java | http://mallet.cs.umass.edu/topics.php | |
Gensim | topic modeling for humans | Radim Řehůřek | Python | https://radimrehurek.com/gensim | |
tmw | topic modeling workflow | Christof Schöch | Python | https://github.com/cligs/tmw | |
dfr-browser | a simple topic-model browser | Andrew Goldstone | JavaScript | http://agoldst.github.io/dfr-browser/ |
Where we start from:
Two steps:
/home/ulrike/Programme/mallet-2.0.8RC3/bin/mallet
import-dir --input /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/5_lemmata_N
--output /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/mallet_model/en_lemmata_N.mallet
--keep-sequence
--token-regex "\p{L}+"
--remove-stopwords TRUE
--stoplist-file /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/en_stopwords.txt
(using the cmd Terminal; from C:\Programs\mallet\)
bin\mallet
import-dir
--input C:\Users\[USER]\Desktop\2017_Verona\exercises\5_lemmata_N
--output C:\Users\[USER]\Desktop\2017_Verona\exercises\mallet_model\en_lemmata_N.mallet
--keep-sequence
--remove-stopwords TRUE
--stoplist-file C:\Users\[USER]\Desktop\2017_Verona\exercises\en_stopwords.txt
/home/ulrike/Programme/mallet-2.0.8RC3/bin/mallet train-topics
--input /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/mallet_model/en_lemmata_N.mallet
--num-topics 30
--optimize-interval 50
--num-iterations 500
--num-top-words 30
--output-topic-keys /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/mallet_model/topics-with-words.txt
--output-doc-topics /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/mallet_model/topics-in-texts.txt
--topic-word-weights-file /home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises/mallet_model/word-weights.txt
bin\mallet
train-topics
--input C:\Users\[USER]\Desktop\2017_Verona\exercises\mallet_model\en.mallet
--num-topics 30
--optimize-inteval 50
--num-iterations 500
--num-top-words 20
--output-topic-keys C:\Users\[USER]\Desktop\2017_Verona\mallet_model\topics-with-words.txt
--output-doc-topics C:\Users\[USER]\Desktop\2017_Verona\mallet_model\topics-in-texts.txt
--topic-word-weights-file C:\Users\[USER]\Desktop\2017_Verona\mallet_model\word-weights.txt
tmw has models for:
tmw on GitHub:
https://github.com/cligs/tmw/tree/next
(Mallet und Python; see http://github.com/cligs/tmw.)
after the topic modeling:
by example (30 portuguese novels)
for the author Camilo Castelo Branco
for the author Camilo Castelo Branco
by subgenre
by narrative perspective
distinctive topics for different subgenres
collection of Portuguese novels
collection of Spanish novels
Clustering of authors by topic
Topics in the progression of text
Theory and method
Examples of Topic Modeling analyses
Tools
Slides at: https://hennyu.github.io/verona_17
tmw (next): https://github.com/cligs/tmw/tree/next
CLiGS: http://cligs.hypotheses.de/