Introduction to Gensim in NLP

Overview

Natural Language Processing, or NLP helps a computer to interpret and analyze human language. There are a number of packages and libraries in the Python language which help us in implementing various applications of NLP. One such library is Gensim and this article revolves around this topic.

Introduction

Data scientists play around with text data in various ways to get meaningful results. There are various algorithms, such as word2vec, doc2vec, topic modeling, tf-idf, etc., that make our work easier while training our models with text data. These features play a significant role in Natural Language Processing applications, and we need a Python library that deals with them efficiently. Hence, Gensim in NLP.

What is Corpus?

Before we go deep into the world of Gensim, we need to learn about the term corpus. A corpus(plural: corpora) is a large, structured set of machine-readable text, written or audio spoken by the native of the language, organized into datasets. In other words, the text data we obtain after data preprocessing of the raw data to feed into our machine learning models is the corpus. In natural language processing, a corpus contains audio or text data that is used to train ML models.

What is Gensim?

Gensim is an open-source Python package for natural language processing used mainly for unsupervised topic modeling. It uses state-of-the-art academic models and modern statistical machine learning to perform complex NLP tasks.

Advantages

Gensim is better than other packages, such as Scikit-learn, when it comes to text processing in terms of convenience.
It can handle large text files without even loading the whole file into the memory.
Since it uses unsupervised models, Gensim in NLP does not require tagging of documents.

Features

Robustness: Gensim is used in various systems over a wide range of applications.
Scalability: Gensim is highly scalable. It uses incremental online training algorithms to contribute to this cause. While working with Gensim, the large text data does not need to reside fully in the Random Access Memory, which means the algorithms used are independent of the corpus size.
Model Agnostic: Gensim is based on the Python language, so it can be used in a variety of operating systems such as Windows, UNIX, LINUX, etc.

How to Install Gensim?

The installation process of Gensim in NLP is quick and easy. We can install the Python library through pip and conda.

pip

conda

Hands-on with Gensim

Creating a Dictionary with Gensim

While working with text documents, Gensim maps each word or token to a unique number or id. Hence a dictionary data structure is required. We can create a dictionary from a list of sentences, a text file, or multiple text files.

1. List of sentences We have to convert the text data to a list of tokens before creating the dictionary. Implementation Example:

First, import the necessary libraries and packages.

Load the data

Convert the text into tokens

The following script creates the dictionary:

Output:

2. Single Text File Gensim uses simple_preprocess that processes one file at a time from the text file and converts them to a dictionary. For code implementation, the sentences in the above example are saved in a text file named "text.txt".

Code Implementation:

Output:-

3. Multiple text files The code implementation for multiple text files is very similar. For this example, we take the three sentences from the example above and save it in three text file "A.txt", "B.txt" and "C.txt". To handle multiple files, we create a method that can iterate through all the files and create the dictionary.

We call the Dictionary function and feed in the path of the directory.

Output:-

Creating a bag-of-words

In Gensim in NLP, the corpus contains the id and frequency of every word in every document. Creating a Bag of Words(BOW) is quick and simple. We use the Dictionary.doc2bow() function. Let's see how to create a bag of words from a text file For the following example, we use the same text file used while creating the dictionary.

Import the libraries

Tokenize and call the required function

Output:- Corresponding token ids with their frequencies.

Saving and Loading a Gensim Dictionary and BOW

Gensim provides its own save() and load() methods to save and load back a dictionary as well as the serialize() method for a bag of words.

Save a dictionary

Load a Dictionary

Save a Bag of Words

Creating TF-IDF

TF-IDF is analogous to a bag of words. The only difference being it down weights the words which have a high frequency. We use the TfidfModel from models library in gensim

Code Implementation

Output:

Creating Bigrams and Trigrams

Bigrams are a pair of words occurring frequently in the corpus which have a meaningful co-existence whereas trigrams are words that occur in groups of three. Both of them are essential in the linguistic analysis of the corpus. While working with a bag of words model, it’s quite important to form bigrams and trigrams from sentences.

Code Implementation

For this example, we use the "text8" data from gensim which is the first 100,000,000 bytes of plain text from Wikipedia.

Constructing Bigrams

Output:-

Constructing trigrams

Output:

Conclusion

The popularity of Gensim is increasing steadily. The scalability and robustness of the package contribute to the cause. The key takeaways from this article are:-

Gensim is an open-source Python package used mainly for unsupervised topic modeling.
It is scalable, robust, and platform agnostic.
A dictionary in Gensim is created using corpora.Dictionary().
dictionary.doc2bow helps in creating a bag of words
models.TfidfModel creates a TF-IDF model.
gensim.models.phrases.Phrases helps in creating bigrams and trigrams.