Skip to content

LDA Tutorial: Exploring topics

junotk edited this page Apr 16, 2013 · 3 revisions

In this page we illustrate various ways to analyze a trained LDA model. As an example, we use a 100-topics LDA model trained with 1553 articles from the Stanford Encyclopedia of Philosophy.

First, we load both a corpus (sep_corpus.npz) and a trained LDA model (sep_model.npz).

$ from vsm.util.corpustools import Corpus
$ c = Corpus.load('sep_corpus.npz')
Loading corpus from descartes_corpus.npz

$ from vsm.model.ldagibbs import LDAGibbs as LDA
$ m = LDA.load('sep_model.npz')
Loading LDA-Gibbs data from descartes.npz

To analyze an LDA model, we create a viewer object from the corpus and the model:

$ from vsm.viewer.ldagibbsviewer import LDAGibbsViewer as LDAViewer
$ v = LDAViewer(c, m)

First, let's plot the log probabilities (not available on remote ipython session).

$ v.logp_plot()
<module 'matplotlib.pyplot' from '/Library/Python/2.7/site-packages/matplotlib-1.3.x-py2.7-macosx-10.8-intel.egg/matplotlib/pyplot.pyc'>

[img here]

The chain seems to be roughly converged.

View topics

To see all topics in the model from command line, type

$ print v.topics()

This gives a list of all topics, each of which is a list of words and corresponding probabilities. In an ip notebook,

$ v.topics()

gives a compact html table where each topic is represented by the top 10 probability words:

Topics Sorted by Index
Topic Words
0 medieval, abelard, william, socrates, john, ockham, de, bacon, boethius, century
1 one, see, example, may, way, might, many, different, kind, things
2 motion, newton, space, bodies, force, body, matter, atoms, leibniz, forces
3 truth, true, propositions, theory, russell, proposition, facts, false, correspondence, fact
4 hegel, berlin, philosophy, german, religion, herder, fichte, jacobi, cohen, history
5 x, y, f, 0, set, n, algebra, b, c, function
... ........

If not only words characterizing topics but also there probabilities are needed, type

$ v.topics(compact_view=False)

which gives words and their corresponding probability for each topic as:

Topics Sorted by Index
Topic 0 Topic 1 Topic 2
Word Prob Word Prob Word Prob
medieval 0.01971 one 0.04451 motion 0.04242
abelard 0.01298 see 0.01712 newton 0.02785
william 0.01152 example 0.01708 space 0.01944
socrates 0.01084 may 0.01640 bodies 0.01890
john 0.01016 way 0.01537 force 0.01337
ockham 0.00981 might 0.01476 body 0.01236
de 0.00954 many 0.01405 matter 0.01198
bacon 0.00952 different 0.01374 atoms 0.01132
boethius 0.00908 kind 0.01362 leibniz 0.00939
century 0.00892 things 0.01315 forces 0.00937
...

To see specific topics, use k_indices as

$ print v.topics(k_indices=[2,6,13])

which lists just three topics, 2, 6 and 13.

Find similar topics

Looking at the topics shown above topic 2 seems to be related to the classical physics. Are there other topics similar to it? To see similarities between topics, we use sim_top_top

$ print v.sim_top_top([2])
--------------------
     Topics: 2
--------------------
Topic     Cosine
--------------------
2         1.00000
89        0.20470
79        0.20083
93        0.19487
21        0.17824
83        0.16438
75        0.11835
36        0.10747
51        0.09974
73        0.09409

These are topics similar to topic 2. Let's see the top 6 topics from this list using k_indices in topics method.

$ v.topics(k_indices=[2, 89, 79, 93, 21, 83])
Topics Sorted by Index
Topic Words
2 motion, newton, space, bodies, force, body, matter, atoms, leibniz, forces
89 soul, knowledge, human, body, natural, nature, matter, things, mind, material
79 time, change, infinite, past, temporal, sequence, state, finite, chance, future
93 spacetime, theory, relativity, einstein, field, physical, physics, quantum, general, space
21 energy, bohr, principle, quantum, entropy, state, mechanics, correspondence, boltzmann, theory
83 descartes, god, leibniz, spinoza, substance, ideas, mind, malebranche, nature, substances

Hence the topics related to the general/contemporary physics and the modern philosophy are judged as 'similar' to topic 2.

We can also look at the similarities between each pair from a given set of topics by using simmat_topics. This will return a numpy array containing the similarity matrix for a given topics

$ v.simmat_topics(k_indices=[2, 89, 79, 93, 21, 83])

IndexedSymmArray([[ 1.        ,  0.20469593,  0.2008326 ,  0.19487031,  0.17824172, 0.16437827],
                  [ 0.20469593,  1.        ,  0.05569496,  0.08155678,  0.07375815, 0.17739257],
                  [ 0.2008326 ,  0.05569496,  1.        ,  0.15310955,  0.19200546, 0.05160133],
                  [ 0.19487031,  0.08155678,  0.15310955,  1.        ,  0.30703874, 0.05445239],
                  [ 0.17824172,  0.07375815,  0.19200546,  0.30703874,  1.        , 0.04247789],
                  [ 0.16437827,  0.17739257,  0.05160133,  0.05445239,  0.04247789, 1.        ]])

Explore topics by document

In LDA, each document in the corpus is assigned with a probability distribution over topics, which characterizes the content of the document. Suppose we are interested in the SEP article on Descartes, and ask which topics are discussed in it. For this we use doc_topics:

$ print v.doc_topics('descartes.txt')
-----------------------
Document: descartes.txt
-----------------------
Topic     Prob
-----------------------
83        0.21227
89        0.19220
82        0.10778
9         0.08628
2         0.08313
59        0.04773
70        0.04472
51        0.04415
48        0.04071
52        0.03239

Let's look at the top five topics

$ v.topics(k_indices=[83, 89, 82, 9, 2])
Topics Sorted by Index
Topic Words
83 descartes, god, leibniz, spinoza, substance, ideas, mind, malebranche, nature, substances
89 soul, knowledge, human, body, natural, nature, matter, things, mind, material
82 work, published, first, time, new, one, years, also, could, book
9 would, even, whether, could, two, however, since, rather, another, also
2 motion, newton, space, bodies, force, body, matter, atoms, leibniz, forces

In this list we recognize the terms related to the modern philosophy (topic 83), the mind & body problem (topic 89), and the classical physics (topic 2), as is expected.

Explore topics by words

One can also ask which topics are most relevant to a given word. This gives the contexts in which a particular word is used in the corpus. Let's take the word "anthropomorphism" for example and see which topics are related to this word. To this we use sim_word_top:

$ v.sim_word_top('anthropomorphism')
Sorted by Word Similarity
Topic Words
76 behavior, psychology, cognitive, mental, human, mind, psychological, attention, imagery, animals
19 god, divine, world, human, religion, theological, power, christian, creation, nature
31 world, one, reality, within, experience, process, human, time, self, individual
... .........

Topic 76 looks phycology, whereas topic 19 is about theology. So we see that "anthropomorphism" is discussed at least in these two contexts. This makes sense, for in phycology it is often discussed whether one can legitimately project human abilities to animals, whereas in theology the anthropomorphism has been a traditional contention in the god-human relationship.

Cluster topics

When there are lot of topics, one may have a group of related topics. In such cases, it is useful to see clusters of topics. Our LDA viewer supports k-means, spectral clustering and affinity propagation as clustering algorithms. For a description of each algorithm see e.g. here.

Here we use k-means to illustrate topic clusters in our LDA model. k-means algorithm requires cluster number to be fixed. We choose 10 clusters.

$ cls = v.cluster_topics(method='k-means', n_clusters=10)
Initialization complete
Iteration 0, inertia 149.883810088
Iteration 1, inertia 84.901591324
Iteration 2, inertia 84.7033890556
Iteration 3, inertia 84.4245218161
Converged to similar centers at iteration 3

cls now contains a list of 10 clusters:

$ cls
[[19, 23, 31, 69, 83],
 [4, 7, 10, 13, 18, 27, 34, 35, 50, 53, 57, 59, 61, 86, 98],
 [1, 6, 9, 40, 47, 48, 52, 70, 82, 88, 91],
 [2, 21, 28, 51, 72, 75, 79, 93],
 [15, 24, 30, 45, 58, 65, 81, 85, 87],
 [5, 12, 42, 46, 55, 64, 84, 90],
 [11, 16, 17, 20, 29, 33, 38, 41, 43, 44, 56, 63, 66, 67, 74, 77, 78, 96],
 [8, 32, 36, 49, 73, 76, 80, 92, 97, 99],
 [0, 14, 25, 26, 39, 62, 89, 94, 95],
 [3, 22, 37, 54, 60, 68, 71]]

One can look at each cluster by using topics function:

$ v.topics(k_indices=cls[0])
Topics Sorted by Index
Topic Words
19 god, divine, world, human, religion, theological, power, christian, creation, nature
23 possible, worlds, world, modal, true, object, w, actual, objects, kripke
31 world, one, reality, within, experience, process, human, time, self, individual
69 god, theism, hartshorne, evil, world, universe, existence, chisholm, whitehead, process
83 descartes, god, leibniz, spinoza, substance, ideas, mind, malebranche, nature, substances

Which looks like a mixture of theological / modern philosophy and possible world semantics. As another example, let's look at cluster 5:

$ v.topics(k_indices=cls[5])
Topics Sorted by Index
Topic Words
5 x, y, f, 0, set, n, algebra, b, c, function
12 x, set, theory, sets, y, frege, type, ph, f, axiom
42 mathematics, mathematical, proof, godel, numbers, hilbert, logic, brouwer, intuitionistic, arithmetic
46 logic, logical, reasoning, formal, calculus, rules, inference, default, ai, form
55 p, 1, b, 2, 3, see, 4, two, following, section
64 probability, evidence, e, h, probabilities, hypothesis, hypotheses, inductive, induction, p
84 logic, ph, m, b, l, g, formula, formulas, logics, semantics
90 b, p, belief, lewis, conditional, set, conditionals, k, w, probability

which forms a more coherent cluster relating to logics / formal epistemology.

Note: the clustering algorithms used in our viewer are stochastic, so you should expect to get different clusterings each time you execute the function.