Word Embedding

Mona
4 min readMay 30, 2019

A guest article by Mona Srivastava

This article explains word embedding concept and how to develop & train word embedding model for NLP applications in Python using Gensim.

What is word embedding?

Word embedding is a type of word representation that allows words with similar meaning to have similar representation.

It is an improvement over Bag-Of-Words model. In bag-of-words model, word encoding results in large and sparse vectors, which describe the document and not the meaning of the words.

Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors.

Each word is mapped to one vector.

The representation is the learning based on the usage of words.

Words that are used in similar way result in having similar representations.This implies that the words that have similar context will have similar meanings.The usage of the word defines its meaning.

The vector representation of the words provides a projection where words with similar meanings are locally clustered within the space.

Word Embedding Algorithm -

Word embedding provides a real-valued vector* representation for a predefined fixed sized vocabulary from a corpus of text.

  • *Real valued vectors — Vectors of variables of type Real.

Techniques to learn word embedding from text-data -

  1. Word2Vec (Google) — i)Continuous Bag Of Words , ii) Continuous Skip Gram
  2. Glovec (Stanford)

Illustration of 4-dimensional embedding -

An embedding is a dense vector of floating point values.

Sentence — “ Tomatoes are red”

A 4-dimensional embedding

Each word above is represented as a 4-dimensional vector of floating point values.

We can see embedding as a ‘look-up table’.

Once the weights have been evaluated, we can encode each word by looking up the corresponding dense vector in the table.

The embedding layer can be understood as a look-up table that maps integer indices, which stands for specific word, to dense vector i.e their embeddings.

We can change the dimensionality of the embedding according to what works well for us.

Develop Word2Vec Embedding ->

We can train a set of fixed-length dense and continuous valued vectors on a large corpus of text .Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

Gensim is a python library for NLP with a focus on topic modeling. We can use it to implement Word2Vec word embedding for learning new word vectors from text.

There are two main training algorithms that can be used to learn the embedding from text — Continuous bag of words and Continuous Skip Gram

The algorithm looks at the window of words and for each target word it provides context and in turn meaning of the word.

To learn word embedding from text we have to load and organize the text into sentences and provide them to the constructor of a new Word2Vec() instance.

Each sentence must be tokenized.

The parameters of this constructor are -

  • size (default 100) — The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token i.e word.
  • window(default 5) — The maximum distance between a target word and word around the target word.
  • min-count(default 5) — The minimum count of words to consider when training the model and words with occurence less than this will be ignored.
  • workers(default 3) — The number of threads to use while training.
  • sg(default 0 or CBOW) — The training algorithm either CBOW(0) or Skip Gram(1).
  • we can increase the number of workers based on need.

We will work on a list of pre-tokenized sentences.

We will keep min-count for words as 1 while training the model so that no words are ignored.

Example

  1. Train the Model
  2. Print Summary of Trained Model
  3. Print Vocabulary.
  4. Print a single vector for the word ‘tomato’.
  5. Save the Model for further use

Output

Summary —

We have learnt how to develop & train our own word2vec word embedding model on text data.

Cheers!

--

--