LDA Tutorial: Exploring topics

In this page we illustrate various ways to analyze a trained LDA model. As an example, we use a 100-topics LDA model trained with 1553 articles from the Stanford Encyclopedia of Philosophy.

First, we load both a corpus (sep_corpus.npz) and a trained LDA model (sep_model.npz).

$ from vsm.util.corpustools import Corpus
$ c = Corpus.load('sep_corpus.npz')
Loading corpus from descartes_corpus.npz

$ from vsm.model.ldagibbs import LDAGibbs as LDA
$ m = LDA.load('sep_model.npz')
Loading LDA-Gibbs data from descartes.npz

To analyze an LDA model, we create a viewer object from the corpus and the model:

$ from vsm.viewer.ldagibbsviewer import LDAGibbsViewer as LDAViewer
$ v = LDAViewer(c, m)

First, let's plot the log probabilities (not available on remote ipython session).

$ v.logp_plot()
<module 'matplotlib.pyplot' from '/Library/Python/2.7/site-packages/matplotlib-1.3.x-py2.7-macosx-10.8-intel.egg/matplotlib/pyplot.pyc'>

[img here]

The chain seems to be roughly converged.

View topics

To see all topics in the model from command line, type

$ print v.topics()

This gives a list of all topics, each of which is a list of words and corresponding probabilities. In an ip notebook,

$ v.topics()

gives a compact html table where each topic is represented by the top 10 probability words:

Topics Sorted by Index
Topic	Words
0	medieval, abelard, william, socrates, john, ockham, de, bacon, boethius, century
1	one, see, example, may, way, might, many, different, kind, things
2	motion, newton, space, bodies, force, body, matter, atoms, leibniz, forces
3	truth, true, propositions, theory, russell, proposition, facts, false, correspondence, fact
4	hegel, berlin, philosophy, german, religion, herder, fichte, jacobi, cohen, history
5	x, y, f, 0, set, n, algebra, b, c, function
...	........

If not only words characterizing topics but also there probabilities are needed, type

$ v.topics(compact_view=False)

which gives words and their corresponding probability for each topic as:

Topics Sorted by Index
Topic 0		Topic 1		Topic 2
Word	Prob	Word	Prob	Word	Prob
medieval	0.01971	one	0.04451	motion	0.04242
abelard	0.01298	see	0.01712	newton	0.02785
william	0.01152	example	0.01708	space	0.01944
socrates	0.01084	may	0.01640	bodies	0.01890
john	0.01016	way	0.01537	force	0.01337
ockham	0.00981	might	0.01476	body	0.01236
de	0.00954	many	0.01405	matter	0.01198
bacon	0.00952	different	0.01374	atoms	0.01132
boethius	0.00908	kind	0.01362	leibniz	0.00939
century	0.00892	things	0.01315	forces	0.00937

...

To see specific topics, use k_indices as

$ print v.topics(k_indices=[2,6,13])

which lists just three topics, 2, 6 and 13.

Find similar topics

Looking at the topics shown above topic 2 seems to be related to the classical physics. Are there other topics similar to it? To see similarities between topics, we use sim_top_top

$ print v.sim_top_top([2])
--------------------
     Topics: 2
--------------------
Topic     Cosine
--------------------
2         1.00000
89        0.20470
79        0.20083
93        0.19487
21        0.17824
83        0.16438
75        0.11835
36        0.10747
51        0.09974
73        0.09409

These are topics similar to topic 2. Let's see the top 6 topics from this list using k_indices in topics method.

$ v.topics(k_indices=[2, 89, 79, 93, 21, 83])

Topics Sorted by Index
Topic	Words
2	motion, newton, space, bodies, force, body, matter, atoms, leibniz, forces
89	soul, knowledge, human, body, natural, nature, matter, things, mind, material
79	time, change, infinite, past, temporal, sequence, state, finite, chance, future
93	spacetime, theory, relativity, einstein, field, physical, physics, quantum, general, space
21	energy, bohr, principle, quantum, entropy, state, mechanics, correspondence, boltzmann, theory
83	descartes, god, leibniz, spinoza, substance, ideas, mind, malebranche, nature, substances

Hence the topics related to the general/contemporary physics and the modern philosophy are judged as 'similar' to topic 2.

We can also look at the similarities between each pair from a given set of topics by using simmat_topics. This will return a numpy array containing the similarity matrix for a given topics

$ v.simmat_topics(k_indices=[2, 89, 79, 93, 21, 83])

IndexedSymmArray([[ 1.        ,  0.20469593,  0.2008326 ,  0.19487031,  0.17824172, 0.16437827],
                  [ 0.20469593,  1.        ,  0.05569496,  0.08155678,  0.07375815, 0.17739257],
                  [ 0.2008326 ,  0.05569496,  1.        ,  0.15310955,  0.19200546, 0.05160133],
                  [ 0.19487031,  0.08155678,  0.15310955,  1.        ,  0.30703874, 0.05445239],
                  [ 0.17824172,  0.07375815,  0.19200546,  0.30703874,  1.        , 0.04247789],
                  [ 0.16437827,  0.17739257,  0.05160133,  0.05445239,  0.04247789, 1.        ]])

Explore topics by document

In LDA, each document in the corpus is assigned with a probability distribution over topics, which characterizes the content of the document. Suppose we are interested in the SEP article on Descartes, and ask which topics are discussed in it. For this we use doc_topics:

$ print v.doc_topics('descartes.txt')
-----------------------
Document: descartes.txt
-----------------------
Topic     Prob
-----------------------
83        0.21227
89        0.19220
82        0.10778
9         0.08628
2         0.08313
59        0.04773
70        0.04472
51        0.04415
48        0.04071
52        0.03239

Let's look at the top five topics

$ v.topics(k_indices=[83, 89, 82, 9, 2])

Topics Sorted by Index
Topic	Words
83	descartes, god, leibniz, spinoza, substance, ideas, mind, malebranche, nature, substances
89	soul, knowledge, human, body, natural, nature, matter, things, mind, material
82	work, published, first, time, new, one, years, also, could, book
9	would, even, whether, could, two, however, since, rather, another, also
2	motion, newton, space, bodies, force, body, matter, atoms, leibniz, forces

In this list we recognize the terms related to the modern philosophy (topic 83), the mind & body problem (topic 89), and the classical physics (topic 2), as is expected.

Explore topics by words

One can also ask which topics are most relevant to a given word. This gives the contexts in which a particular word is used in the corpus. Let's take the word "anthropomorphism" for example and see which topics are related to this word. To this we use sim_word_top:

$ v.sim_word_top('anthropomorphism')

Sorted by Word Similarity
Topic	Words
76	behavior, psychology, cognitive, mental, human, mind, psychological, attention, imagery, animals
19	god, divine, world, human, religion, theological, power, christian, creation, nature
31	world, one, reality, within, experience, process, human, time, self, individual
...	.........

Topic 76 looks phycology, whereas topic 19 is about theology. So we see that "anthropomorphism" is discussed at least in these two contexts. This makes sense, for in phycology it is often discussed whether one can legitimately project human abilities to animals, whereas in theology the anthropomorphism has been a traditional contention in the god-human relationship.

Cluster topics

When there are lot of topics, one may have a group of related topics. In such cases, it is useful to see clusters of topics. Our LDA viewer supports k-means, spectral clustering and affinity propagation as clustering algorithms. For a description of each algorithm see e.g. here.

Here we use k-means to illustrate topic clusters in our LDA model. k-means algorithm requires cluster number to be fixed. We choose 10 clusters.

$ cls = v.cluster_topics(method='k-means', n_clusters=10)
Initialization complete
Iteration 0, inertia 149.883810088
Iteration 1, inertia 84.901591324
Iteration 2, inertia 84.7033890556
Iteration 3, inertia 84.4245218161
Converged to similar centers at iteration 3

cls now contains a list of 10 clusters:

$ cls
[[19, 23, 31, 69, 83],
 [4, 7, 10, 13, 18, 27, 34, 35, 50, 53, 57, 59, 61, 86, 98],
 [1, 6, 9, 40, 47, 48, 52, 70, 82, 88, 91],
 [2, 21, 28, 51, 72, 75, 79, 93],
 [15, 24, 30, 45, 58, 65, 81, 85, 87],
 [5, 12, 42, 46, 55, 64, 84, 90],
 [11, 16, 17, 20, 29, 33, 38, 41, 43, 44, 56, 63, 66, 67, 74, 77, 78, 96],
 [8, 32, 36, 49, 73, 76, 80, 92, 97, 99],
 [0, 14, 25, 26, 39, 62, 89, 94, 95],
 [3, 22, 37, 54, 60, 68, 71]]

One can look at each cluster by using topics function:

$ v.topics(k_indices=cls[0])

Topics Sorted by Index
Topic	Words
19	god, divine, world, human, religion, theological, power, christian, creation, nature
23	possible, worlds, world, modal, true, object, w, actual, objects, kripke
31	world, one, reality, within, experience, process, human, time, self, individual
69	god, theism, hartshorne, evil, world, universe, existence, chisholm, whitehead, process
83	descartes, god, leibniz, spinoza, substance, ideas, mind, malebranche, nature, substances

Which looks like a mixture of theological / modern philosophy and possible world semantics. As another example, let's look at cluster 5:

$ v.topics(k_indices=cls[5])

Topics Sorted by Index
Topic	Words
5	x, y, f, 0, set, n, algebra, b, c, function
12	x, set, theory, sets, y, frege, type, ph, f, axiom
42	mathematics, mathematical, proof, godel, numbers, hilbert, logic, brouwer, intuitionistic, arithmetic
46	logic, logical, reasoning, formal, calculus, rules, inference, default, ai, form
55	p, 1, b, 2, 3, see, 4, two, following, section
64	probability, evidence, e, h, probabilities, hypothesis, hypotheses, inductive, induction, p
84	logic, ph, m, b, l, g, formula, formulas, logics, semantics
90	b, p, belief, lewis, conditional, set, conditionals, k, w, probability

which forms a more coherent cluster relating to logics / formal epistemology.

Note: the clustering algorithms used in our viewer are stochastic, so you should expect to get different clusterings each time you execute the function.

Provide feedback

Saved searches