Language processing


“the trophy doesn’t fit in the suitcase because it is too big” - geoff hinton “the trophy doesn’t fit in the suitcase because it is too small” - geoff hinton

Trying to get algorithms to make sense of ambiguity of human language, we begin to appreciate just how much we take for granted. We hardly notice the tiny feats of disambiguation our brains do when reading sentences like the ones above.

you shall know the nature of a word by the company it keeps

What kinds of language-oriented tasks might we be interested in?

This chapter is about applications of machine learning to natural language processing. like ml, NLP is a nebulous term with several precise definitions and most have something to do wth making sense from text. This chapter will take a broad view of NLP

“Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences.” -Dr. Christopher D. Manning, Dec 2015

  • check manning pdf statement

quote/link from:

word vectors

word vectors are a rep such that geometric preserved in emeddings. reverse king queen

cover tf-idf in detail (link to it fromm tsne chapter). link to t-sne chapter from here

since then, there have been a number of writings which have tried to interpret these word vectors. gender binary

tf-idf examples -> t-SNE examples

aparrish generative poetry


  • analogies
  • kcimc antonyms
  • rejecting gender binary

tf-idf -> t-SNE LSA + LDA -> t-SNE

RNNs annotating?

word2vec chapter

  • anything2vec


  • NeuralTalk and Walk

Mario RNN

attention + DRAW the trophy can’t fit into the suitcase because it’s too big (it = trophy) the trophy can’t fit into the suitcase because it’s too small (it = suitcase)

colah word2vec

metal + NLP

lda2vec Chris Moody hybrid word2vec and LDA

historical word embeddings



Language modeling a billion words

demystifying word2vec

keras word2vec

hotel reviews: rejecting gender binary

Voynich Manuscript: word vectors and t-SNE visualization of some patterns

kcimc synonyms + antonyms

CNN sentence classificaiton

Chris Olah Word2Vec + tSNE

paragraph vectors:

text_analytics_on_mpp doc2vec from newsgroups

Harvard NLP Stanford NLP



  • Common schema for datasets
  • dowloading images from google [python] [js]
  • Freely available datasets
  • Extracting/scraping data from the web
  • Data “mungling”

Feature extraction and word embeddings

  • Bag of words
  • tf-idf
  • latent dirichlet, lsa
  • word2vec, doc2vec

Organizing, retrieving documents

  • document classification
  • clustering and visualizing documents
  • document retrieval, similarity ranking
  • combining with filters

NLP tasks

  • sentiment analysis
  • named entity recognition
  • quote attribution
  • anomaly detection

Speculative NLP tasks

  • fact-checking (

    • topic-modeling (tfidf, lsa, lda)
    • document retrieval/similarity/visualization
    • word2vec
    • sentiment analysis
    • skip-thoughts/doc2vec/lda2vec

db extract wikipedia


  • debiasing embeddings

 NLP topics
  • semantic hashing for fast document retrieval (use auto encoder to learn binary addresses for documents, then use it as a memory address for a hash map and look for documents in nearby memory cells — very fast


  • skipgram retrieval



dimensionality reduction

  • explanation of manifolds
  • PCA + SVD
  • eigenfaces

text representations

  • tf-idf + bag-of-words
  • lsa/lda

applications of text representations

  • document retrieval/similarity
  • document clustering/visualization
  • topic modeling

word vectors

paragraph vectors / skip-thoughts

  • nearest skip-thought retrieval
  • next skip-thought prediction

NLP tasks

  • named entity recognition
  • POS tagging
  • sentiment analysis
  • translation
  • summarization
  • speculative nlp tasks
  • stylometry / deanonymization


  • semantic hashing + fast retrieval
  • summarization TextSum 
  • document retrieval
    • tf-idf
    • lsa/lda
  • document clustering, organization, visualization
    • unsupervised: t-SNE
    • topic modeling
    • classification: neural net
  • word2vec + t-SNE
  • skip-thought vectors


  • gensim, sklearn

sebastian ruder blog:

  • others