Evil cause; evil effect

I found this kotowaza (japanese proverb or axiom) particularly poignant given a natural language processing project I was working this last weekend.

I started with a corpus of TED Talks from 2013 on OPUS, the open parallel corpus. The idea was to do some clustering, then do some topic modeling, and see how close the models might get to (human-assigned) topic labels. (For anyone interested in a quick introduction to topic modeling, and its relevance to the modern world, data science, medicine, etc., I suggest David Blei's article here.

Sadly, because it's all based off subtitle data, there were some issues with the text. For example, a ton of words were just combined (no space between them) so I used a dictionary-checker-function to push them apart.

First, I scrubbed. Then I scrubbed. Then I modeled. Then I scrubbed more. And in the end, it's hard to know if the data itself is the issue, or the algorithm, or the implementation... (it could never be me ;) - Haha, OK - I did make quite a few mistakes here and there; e.g. accidently using a tf-idf term-document matrix instead of a count-matrix.

Albeit a novice analysis, I have (some) confidence blaming (some) of the issues on the original data; thus, the proverb atop.

So, like the open source advocate that I am, I've posted my analysis on github here. Furthermore, I've put all my code in an iPython notebook here in case you want to see the code in execution (as well as some fun charting!).

If you're interested in the details of the analysis:

The Details
  • scrub 2.5 million words from here
  • term-frequency inverse-document-frequency (tf-idf)
  • dimensionality reduction
    • singular value decomposition (SVD)
    • t-neighbor stochastic neighbor embedding (t-SNE)
    • k-means clustering
  • latent dirichlet allocation (LDA)
    • but that failed more or less (so far)
    • so I'll try a few more topic models soon

Since LDA didn't work very will the data I had, I still wanted to make something neat from the (inital) analysis. What I ended up doing was using a really cool interactive 3D cluster charting I saw on this post.

So I took their open source code, massaged my data a little bit, then fit it to their visualization. You can actually interact with (sorta) and zoom into the clusters here. You'll need to combine hitting a few keys, zooming, clicking, etc. to get a feel for what the visualization is like. If you hit D, you can see a top down view (2D) and then hit A to see the colors pop around. Enjoy !

One of my colleagues asked me; what do those clusters even represent? We'll, they're approximations of huge matricies; they may or may not relate to anything (I've only spent a few days on this project). They could be some very reasonable clustering; they could be totally gibberish, and just algorithmically generated. I will update the github & notebook though if anyone is curious about what my conclusions are ^__^v

I find it particularly awesome that you can take something with over 100,000 dimensions and somehow approximate it to 3 in order to let us skeletal humans actually see it.


I did this analysis on 2.5 million words from TED Talks in 2013, in order to use someone else's visualization tool for the data to end up with this.