#89 Machine Learning & Data Science Challenge 89

#89 Machine Learning & Data Science Challenge 89

What is Doc2vec?

Paragraph Vector (more popularly known as Doc2Vec) — Distributed Memory (PV-DM)

Paragraph Vector (Doc2Vec) is supposed to be an extension to Word2Vec such that Word2Vec learns to project words into a latent d-dimensional space whereas Doc2Vec aims at learning how to project a document into a latent d-dimensional space.

The basic idea behind PV-DM is inspired by Word2Vec. In the CBOW model of Word2Vec, the model learns to predict a center word based on the context.

  • For example- given the sentence “The cat sat on the table”, the CBOW model would learn to predict the words “sat” given the context words — the cat, on, and table.

  • Similarly, in PV-DM the main idea is: randomly sample consecutive words from the paragraph and predict a center word from the randomly sampled set of words by taking as the input — the context words and the paragraph id.

Let’s have a look at the model diagram for some more clarity.

  • In this given model, we see the Paragraph matrix, (Average/Concatenate) and classifier sections.

  • Paragraph matrix: It is the matrix where each column represents the vector of a paragraph.

  • Average/Concatenate: It means whether the word vectors and paragraph vectors are averaged or concatenated.

  • Classifier: In this, it takes the hidden layer vector (the one that was concatenated/averaged) as input and predicts the Centre word

In matrix D, It has the embeddings for “seen” paragraphs (i.e. arbitrary length documents), the same way Word2Vec models learn embeddings for words.

  • For unseen paragraphs, the model is again run through gradient descent (5 or so iterations) to infer a document vector.

Did you find this article valuable?

Support Bhagirath's Blog Vision by becoming a sponsor. Any amount is appreciated!