Quote
“the trophy doesn’t fit in the suitcase because it is too big” - geoff hinton “the trophy doesn’t fit in the suitcase because it is too small” - geoff hinton
Trying to get algorithms to make sense of ambiguity of human language, we begin to appreciate just how much we take for granted. We hardly notice the tiny feats of disambiguation our brains do when reading sentences like the ones above.
you shall know the nature of a word by the company it keeps
What kinds of language-oriented tasks might we be interested in?
This chapter is about applications of machine learning to natural language processing. like ml, NLP is a nebulous term with several precise definitions and most have something to do wth making sense from text. This chapter will take a broad view of NLP
“Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences.” -Dr. Christopher D. Manning, Dec 2015
- check manning pdf statement
quote/link from: http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/
word vectors
word vectors are a rep such that geometric preserved in emeddings. reverse king queen
cover tf-idf in detail (link to it fromm tsne chapter). link to t-sne chapter from here
since then, there have been a number of writings which have tried to interpret these word vectors. gender binary
tf-idf examples -> t-SNE examples
aparrish generative poetry
word2vec
- analogies
- kcimc antonyms
- rejecting gender binary
tf-idf -> t-SNE LSA + LDA -> t-SNE
RNNs annotating?
word2vec chapter
- anything2vec http://www.lab41.org/anything2vec/
captioning
- NeuralTalk and Walk
Mario RNN
attention + DRAW
https://www.youtube.com/watch?v=XG-dwZMc7Ng the trophy can’t fit into the suitcase because it’s too big (it = trophy) the trophy can’t fit into the suitcase because it’s too small (it = suitcase)
colah word2vec http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
metal + NLP https://www.reddit.com/r/MachineLearning/comments/4r1np7/heavy_metal_and_natural_language_processing_part_1/?utm_source=twitterfeed&utm_medium=twitter
https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment
http://sebastianruder.com/secret-word2vec/index.html
https://civisanalytics.com/blog/data-science/2016/09/22/neural-network-visualization/
lda2vec Chris Moody hybrid word2vec and LDA http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=
historical word embeddings http://nlp.stanford.edu/projects/histwords/
textsum https://github.com/tensorflow/models/tree/master/textsum
http://wiki.dbpedia.org/Datasets/NLP%20https://datahub.io/dataset?tags=nlp
doc2vec http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb
Language modeling a billion words http://torch.ch/blog/2016/07/25/nce.html
demystifying word2vec https://buss_jan.gitbooks.io/word2vec/content/chapter2.html https://github.com/facebookresearch/fastText
keras word2vec https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
hotel reviews: https://blog.monkeylearn.com/machine-learning-1m-hotel-reviews-finds-interesting-insights/
https://github.com/thoppe/transorthogonal-linguistics rejecting gender binary http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html
Voynich Manuscript: word vectors and t-SNE visualization of some patterns blog.christianperone.com/2016/01/voynich-manuscript-word-vectors-and-t-sne-visualization-of-some-patterns/
kcimc synonyms + antonyms https://gist.github.com/kylemcdonald/3463caf86ffca5c950c2 https://gist.github.com/kylemcdonald/3463caf86ffca5c950c2 https://gist.github.com/kylemcdonald/9bedafead69145875b8c#file-_tsne-pdf
CNN sentence classificaiton https://github.com/yoonkim/CNN_sentence
Chris Olah Word2Vec + tSNE http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
paragraph vectors: https://arxiv.org/pdf/1507.07998.pdf
text_analytics_on_mpp doc2vec from newsgroups https://github.com/vatsan/text_analytics_on_mpp/blob/master/neural_language_models/01_news_groups_doc2vec.ipynb
Harvard NLP https://github.com/harvardnlp Stanford NLP
https://lamyiowce.github.io/word2viz/
===========
Datasets
- Common schema for datasets
- dowloading images from google [python] [js]
- Freely available datasets
- Extracting/scraping data from the web
- Data “mungling”
Feature extraction and word embeddings
- Bag of words
- tf-idf
- latent dirichlet, lsa
- word2vec, doc2vec
Organizing, retrieving documents
- document classification
- clustering and visualizing documents
- document retrieval, similarity ranking
- combining with filters
NLP tasks
- sentiment analysis
- named entity recognition
- quote attribution
- anomaly detection
Speculative NLP tasks
-
fact-checking (https://fullfact.org/blog/2016/aug/automated-factchecking/)
-
- topic-modeling (tfidf, lsa, lda)
-
- document retrieval/similarity/visualization
-
- word2vec
-
- sentiment analysis
-
- skip-thoughts/doc2vec/lda2vec
https://www.quora.com/What-are-good-resources-tutorials-to-learn-Keras-deep-learning-library-in-Python http://u.cs.biu.ac.il/~yogo/nnlp.pdf http://rare-technologies.com/making-sense-of-word2vec/ http://lxmls.it.pt/2014/socher-lxmls.pdf http://nlp.stanford.edu/courses/NAACL2013/NAACL2013-Socher-Manning-DeepLearning.pdf http://nlp.stanford.edu/~socherr/SocherBengioManning-DeepLearning-ACL2012-20120707-NoMargin.pdf https://github.com/jtoy/awesome-tensorflow/
db http://www-nlp.stanford.edu/links/statnlp.html https://datahub.io/dataset?tags=nlp http://wiki.dbpedia.org/Datasets/NLP extract wikipedia https://github.com/bwbaugh/wikipedia-extractor
links
- debiasing embeddings http://arxiv.org/pdf/1607.06520.pdf
- https://github.com/facebookresearch/fastText
- http://nlp.stanford.edu/projects/glove/ NLP topics
- semantic hashing for fast document retrieval (use auto encoder to learn binary addresses for documents, then use it as a memory address for a hash map and look for documents in nearby memory cells — very fast
ideas
- skipgram retrieval
syntaxNet https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
Syllabus
dimensionality reduction
- explanation of manifolds
- PCA + SVD
- eigenfaces
text representations
- tf-idf + bag-of-words
- lsa/lda
applications of text representations
- document retrieval/similarity
- document clustering/visualization
- topic modeling
word vectors
paragraph vectors / skip-thoughts
- nearest skip-thought retrieval
- next skip-thought prediction
NLP tasks
- named entity recognition
- POS tagging
- sentiment analysis
- translation
- summarization
- speculative nlp tasks
- stylometry / deanonymization
etc
- semantic hashing + fast retrieval
- summarization TextSum NOTEBOOKS
- document retrieval
- tf-idf
- lsa/lda
- document clustering, organization, visualization
- unsupervised: t-SNE
- topic modeling
- classification: neural net
- word2vec + t-SNE
- skip-thought vectors
SOFTWARE
- gensim, sklearn
https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment
sebastian ruder blog: http://sebastianruder.com/word-embeddings-softmax/index.html#hierarchicalsoftmax http://sebastianruder.com/word-embeddings-1/index.html
- others
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
https://nlp.stanford.edu/projects/histwords/ https://github.com/williamleif/histwords
https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6
https://juliasilge.com/blog/gender-pronouns/
http://ruder.io/optimizing-gradient-descent/index.html#adam
https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
https://github.com/oxford-cs-deepnlp-2017/lectures