TED Talks

Topic Modeling vs Document Clustering

I wanted to write a second post on the final project I did for the Data Science course I took at General Assembly. I made a short presentation you can see here. And you can see all the code here. And lastly, you can see one cluster in a more interactive way here.

So the basic idea was to compare how clustering algorithms can compare to probabalistic topic models, in this case, LDA. The problem in a more general sense can be thought of as so: there exists some collection of textual data, more than one person can categorize. How do you extract topics and sort or categorize the documents automatically in a repeatable, reliable way?

For my data, I used the TED Talks from 2013 already labeled with topics, but rather than treat them as categories for a supervised classification task, I'm using them to double check my own results and approaching the problem with unsupervised clustering algorithms to see how it compares to Latent Dirichlet Allocation.

Process
  • scrub / clean / parse
  • General EDA
  • vectorize
  • cluster (kmeans, then agglomerative)
  • count words / topics
  • compare
  • reduce dimensionality & visualize

For the scrubbing, I found out that TONS of words were combined (somehow missing a space). Also, I wanted to get rid of stopwords.

I got a huge word list and split words from each other like this:

def getRealWords(word, dictionary):
    if word in dictionary:
        return str(word)
    else:
        wordlength = len(word)
        for i in range(wordlength):
            part = word[:i]
            if part in dictionary:
                if word[i:] in dictionary:
                    return str(part) + ' ' + str(word[i:])
                    
        return str(word)

And I used a bunch of stopword lists from ranks.nl as well as all the stopwords in NLTK for English.

Vectorize

The sklearn tfidf vectorizer made that ridiculously easy. I truncated the terms to just include the top 10,000 most important words (by tfidf values) - however I'm not sure if thats reflected in the notebook (I had a bunch of verisons and just put one that gave the general idea up there).

Note: from clustering to dimensionality reduction to visualization, I compared hierarchical agglomerative clustering and kmeans clustering. I ended up pretty much sticking with kmeans for the final versions of everything because it performed better (although they were relatively close).

Topic Count

Topics as they are listed in the actual data, just a literal count:

Topics

Now this is some sort of gold standard; the talks are already labeled. The only part where I actually used this data was when I guessed what k was - In the previous image you can see that after the topic health, there is a pretty similar usage of topics and its in slow, steady decline. The main topics are obviously the first ~10 or so.

but how'd u kno k=10 dawg?

I did try to pretend that I didn't know how many topics there were but in the end you're going to make a decision somehow. The between sum of squares (sum of squared distances from each point to the cluster sample mean) divided by the total sum of squares (sum of squared distances from each point to the global sample mean) is one way of measuring goodness of fit for clustering. That metric gives us just about .75 when k=10, meaning basically that 75% of variance is explained by these clusters.

When k=20, we get .78, and when k=40, we get .81. So if quadrupling the number of clusters only gives us 6% more variance explained, I figured that even if I hadn't had the topics from the actual data, I probably would've picked a relatively similar k anyways since I ideally I want a small number of clusters that represent distinct topic groupings.

Top Terms per Cluster (kmeans, k=10)

So if we look at the topics as they're assigned (to see if our clusters are representing any legitimate groupings in relation to the already-assigned topic labels), we see a few clusters that represent distinct themes.

Topic 1: Everything

technology       160
design           145
culture          116
science           96
business          73
entertainment     66
arts              52
art               49
creativity        42
education         40
We'll that seems pretty general. Can't take too much from that - except that it talks alot about technology and design and culture... This is the most general category; it covers the T. E. and D. of TED. Since we can see that technology is number one, with a close second of design, then we move to seeing some of our other topics show up.

Topic 2: Everything Pt.2

culture          119
entertainment     78
issues            71
global            71
arts              49
storytelling      46
technology        45
design            40
education         38
business          35
Much more focused here on culture, entertainment, and global issues.

Topic 3: Green Tech?

technology     81
science        53
design         51
business       42
global         42
issues         42
environment    36
green          35
energy         30
invention      25
Hmm, once again, technology, design, business, but focusing more towards green energy at the end there.

Topic 4 : Politics / Global Issues

issues        106
global        106
politics       49
culture        44
business       34
economics      31
health         29
Africa         28
technology     28
war            22
Here we actually see a departure from the earlier topics; were talking more about global issues, business, war, politics.

Topic 5 : Art?

music            38
entertainment    37
performance      19
talk             16
arts             16
short            16
technology       16
design           13
live             12
culture          11

This is obviously more about art ! Music, performance, live performances.

Topic 6 : Health

science       53
technology    37
medicine      25
health        22
brain         19
biology       15
cancer        10
care          10
research       9
medical        9

Topic 7: Oceans

science        33
oceans         28
technology     15
issues         12
mission        12
global         12
fish           12
blue           12
environment    10
exploration    10

Topic 8: Space

science        27
physics        17
universe       17
technology     16
astronomy      12
cosmos          6
space           6
exploration     6
education       4
change          4

Topic 9: Robots

robots           12
technology       12
design            8
science           5
entertainment     3
engineering       3
evolution         3
animals           2
demo              2
AI                2

Topic 10: ?

animals         3
issues          2
oceans          2
global          2
science         2
biodiversity    1
storytelling    1
culture         1
photography     1
creativity      1

So in 10 topics we see a few different distinct topics. Now what if we just counted the top words in each clusters?

Topic 1: Super general ! Unhelpful !

thing     3388
people    2888
time      2106
kind      1684
year      1571
world     1538
work      1523
lot       1136
life      1091
idea      1047
dtype: int64
----------------------------------------------------------------------

Topic 2: Getting closer...

people     3856
world      2048
thing      1683
year       1594
time       1378
country    1027
life        879
lot         798
good        795
problem     772
dtype: int64
----------------------------------------------------------------------

Topic 3: Still very general...

thing     390
people    366
time      311
music     310
world     264
year      240
good      233
life      218
sound     206
kind      189
dtype: int64
----------------------------------------------------------------------

Topic 4: Women ! 

woman     927
people    632
year      492
time      472
story     463
child     450
thing     428
girl      425
world     413
life      384
dtype: int64
----------------------------------------------------------------------

Topic 5: Health / Medicine / Biology

brain      975
cell       774
people     563
thing      544
cancer     520
time       476
year       465
patient    357
life       332
body       329
dtype: int64
----------------------------------------------------------------------

Topic 6: Ecology / Ocean / Global

year      595
water     490
ocean     471
thing     449
time      433
life      353
people    334
world     302
planet    271
earth     259
dtype: int64
----------------------------------------------------------------------

Topic 7: Food 

food      450
people    307
year      276
thing     269
plant     210
world     205
time      195
lot       166
kind      159
tree      152
dtype: int64
----------------------------------------------------------------------

Topic 8: Energy

energy        426
car           424
people        410
thing         406
year          385
time          282
world         262
technology    241
oil           225
city          218
dtype: int64
----------------------------------------------------------------------

Topic 9: Space

universe    462
galaxy      222
thing       205
star        199
year        199
space       198
planet      167
time        165
earth       153
life        140
dtype: int64
----------------------------------------------------------------------

Topic 10: ???

robot     344
thing      60
foot       59
animal     58
time       54
leg        49
doe        45
people     42
work       41
kind       37
dtype: int64
----------------------------------------------------------------------

So even just counting the top occuring words do help us learn a little bit more about certain clusters, but for the most part, very general words like "thing" and "time" sort of obfuscate what we are really trying to get at.

LDA

So when I used LDA, we see similar topics, but pretty much after the first 4 groupings, its hard to tell at all what the groupings are really about...

0.015	cell 
 0.011	patient 
 0.011	food 
 0.008	disease 
 0.008	cancer 
 0.007	body 
 0.007	brain 
 0.006	heart 
 0.006	people 
 0.006	year
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.014	space 
 0.009	universe 
 0.008	particle 
 0.007	thing 
 0.007	earth 
 0.007	light 
 0.006	planet 
 0.006	tree 
 0.006	theory 
 0.006	time
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.017	brain 
 0.014	human 
 0.009	thing 
 0.008	people 
 0.006	life 
 0.006	time 
 0.006	year 
 0.006	gene 
 0.004	evolution 
 0.004	genome
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.014	people 
 0.013	world 
 0.013	country 
 0.009	africa 
 0.009	year 
 0.007	woman 
 0.006	government 
 0.005	war 
 0.005	aid 
 0.005	india
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.026	people 
 0.012	world 
 0.012	thing 
 0.006	time 
 0.005	kind 
 0.005	idea 
 0.005	good 
 0.005	year 
 0.005	work 
 0.004	lot
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.010	life 
 0.008	music 
 0.008	compassion 
 0.007	people 
 0.006	time 
 0.006	sound 
 0.005	thing 
 0.005	world 
 0.005	god 
 0.004	year
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.013	year 
 0.009	technology 
 0.008	thing 
 0.008	people 
 0.007	energy 
 0.007	time 
 0.006	water 
 0.006	percent 
 0.005	world 
 0.005	system
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.011	thing 
 0.010	kind 
 0.007	time 
 0.006	water 
 0.006	animal 
 0.006	data 
 0.005	ocean 
 0.005	lot 
 0.005	robot 
 0.005	design
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.013	thing 
 0.010	people 
 0.010	time 
 0.007	work 
 0.007	year 
 0.006	day 
 0.006	life 
 0.005	kid 
 0.005	story 
 0.005	school
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0.009	people 
 0.007	language 
 0.007	baby 
 0.007	child 
 0.006	love 
 0.006	time 
 0.005	year 
 0.005	thing 
 0.004	learning 
 0.004	english
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Dimensionality Reduction

Google:

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.

Visualizing data is important... (See Anscombe's Quartet)

However, it's quite hard to visualize 1000's of dimensions. So below, I go about plotting our clusters in 3 dimensions; yes - we are losing tons of data ! But, this reduction allows us to actually see the data. In order to do these reductions, I compare the following dimensionality reduction algorithms.

Principal Component Analysis

Simply put, PCA is a way of finding the most important parts some data set. More exactly, it's an orthogonal transformation of observations into some number of linearally uncorrelated variables, in this case, trying to summarize 1000's of dimensions into three. Basically, the first principle component is the component which accounts for this highest variance of the data (it explains the most), and the subsequent component(s) is the next in terms of variance explanation and also orthogonal (i.e. uncorrelated) with the previous component(s).

Singular Value Decomposition

Very similar to PCA. A gross simplification; SVD is a way of factorizing a large matrix into 3 sub parts. These 3 parts can re-create the matrix, so we take some of one of the components to make a smaller, approximated copy of the original.

t-Distributed Stochastic Neighbor Embedding

This is a fascinating algorithm. It has a few main parts. Firstly, it creates a probability distribution that represents similarity between points (in the high dimensional space). Then, it creates a similar probability distribution over the low dimensional space and then minimizes the distance between the two (Kullback-Leibler divergence).

Independent Component Analysis

Wikipedia:

ICA finds the independent components (also called factors, latent variables or sources) by maximizing the statistical independence of the estimated components.

Basically:

Typical algorithms for ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the eigenvalue decomposition), and dimensionality reduction as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition.

I compared a quite a few different combinations. A large assumption that I was making (or exploring) was that using more than one type of dimensionality reduction would be best. So each time, I compared two types. I did test only using one and ended up liking the double reduction better.

Some were not so great, like PCA then t-SNE:
Alt
Some were alright, PCA, then SVD for example:
Alt

Note the second number always needs to be 3 since we're trying to create visualizations for humans. I tried varying the first number too, but there are an infinite number of variations of course so I ended up just picking a few.

Better Visualization

So even though we have these nice clusters, they're sort of hard to see since we can't spin them around, or turn on and off certain clusters, or zoom in at all. After reading an interesting post by datacratic on dimensionality reduction, I saw they had open sourced their visualization, listed on github as datacratic/data-projector. The data projector takes data that looks like this:

			y				x				z	   cid
0	-13.0348188266	-10.0552407715	34.6405097522	4
1	-15.0734196879	-16.0528409209	0.0902980978558	7
2	16.0918851228	-29.1836998758	-11.0857616069	3
3	-0.935772800338	21.9657836093	5.74032885375	0
4	-11.3979443438	-18.4993158635	-7.80118007466	7

Where each cid = clusterid. After translating my data into that format, I forked the repo, put my data in, and put it here.