Introduction to Gensim in NLP
Overview
Natural Language Processing, or NLP helps a computer to interpret and analyze human language. There are a number of packages and libraries in the Python language which help us in implementing various applications of NLP. One such library is Gensim and this article revolves around this topic.
Introduction
Data scientists play around with text data in various ways to get meaningful results. There are various algorithms, such as word2vec, doc2vec, topic modeling, tf-idf, etc., that make our work easier while training our models with text data. These features play a significant role in Natural Language Processing applications, and we need a Python library that deals with them efficiently. Hence, Gensim in NLP.
What is Corpus?
Before we go deep into the world of Gensim, we need to learn about the term corpus. A corpus(plural: corpora) is a large, structured set of machine-readable text, written or audio spoken by the native of the language, organized into datasets. In other words, the text data we obtain after data preprocessing of the raw data to feed into our machine learning models is the corpus. In natural language processing, a corpus contains audio or text data that is used to train ML models.
What is Gensim?
Gensim is an open-source Python package for natural language processing used mainly for unsupervised topic modeling. It uses state-of-the-art academic models and modern statistical machine learning to perform complex NLP tasks.
Advantages
- Gensim is better than other packages, such as Scikit-learn, when it comes to text processing in terms of convenience.
- It can handle large text files without even loading the whole file into the memory.
- Since it uses unsupervised models, Gensim in NLP does not require tagging of documents.
Features
- Robustness: Gensim is used in various systems over a wide range of applications.
- Scalability: Gensim is highly scalable. It uses incremental online training algorithms to contribute to this cause. While working with Gensim, the large text data does not need to reside fully in the Random Access Memory, which means the algorithms used are independent of the corpus size.
- Model Agnostic: Gensim is based on the Python language, so it can be used in a variety of operating systems such as Windows, UNIX, LINUX, etc.
How to Install Gensim?
The installation process of Gensim in NLP is quick and easy. We can install the Python library through pip and conda.
- pip
- conda
Hands-on with Gensim
Creating a Dictionary with Gensim
While working with text documents, Gensim maps each word or token to a unique number or id. Hence a dictionary data structure is required. We can create a dictionary from a list of sentences, a text file, or multiple text files.
1. List of sentences We have to convert the text data to a list of tokens before creating the dictionary. Implementation Example:
First, import the necessary libraries and packages.
Load the data
Convert the text into tokens
The following script creates the dictionary:
Output:
2. Single Text File Gensim uses simple_preprocess that processes one file at a time from the text file and converts them to a dictionary. For code implementation, the sentences in the above example are saved in a text file named "text.txt".
Code Implementation:
Output:-
3. Multiple text files The code implementation for multiple text files is very similar. For this example, we take the three sentences from the example above and save it in three text file "A.txt", "B.txt" and "C.txt". To handle multiple files, we create a method that can iterate through all the files and create the dictionary.
We call the Dictionary function and feed in the path of the directory.
Output:-
Creating a bag-of-words
In Gensim in NLP, the corpus contains the id and frequency of every word in every document. Creating a Bag of Words(BOW) is quick and simple. We use the Dictionary.doc2bow() function. Let's see how to create a bag of words from a text file For the following example, we use the same text file used while creating the dictionary.
Import the libraries
Tokenize and call the required function
Output:- Corresponding token ids with their frequencies.
Saving and Loading a Gensim Dictionary and BOW
Gensim provides its own save() and load() methods to save and load back a dictionary as well as the serialize() method for a bag of words.
Save a dictionary
Load a Dictionary
Save a Bag of Words
Creating TF-IDF
TF-IDF is analogous to a bag of words. The only difference being it down weights the words which have a high frequency. We use the TfidfModel from models library in gensim
Code Implementation
Output:
Creating Bigrams and Trigrams
Bigrams are a pair of words occurring frequently in the corpus which have a meaningful co-existence whereas trigrams are words that occur in groups of three. Both of them are essential in the linguistic analysis of the corpus. While working with a bag of words model, it’s quite important to form bigrams and trigrams from sentences.
Code Implementation
For this example, we use the "text8" data from gensim which is the first 100,000,000 bytes of plain text from Wikipedia.
Constructing Bigrams
Output:-
Constructing trigrams
Output:
Conclusion
The popularity of Gensim is increasing steadily. The scalability and robustness of the package contribute to the cause. The key takeaways from this article are:-
- Gensim is an open-source Python package used mainly for unsupervised topic modeling.
- It is scalable, robust, and platform agnostic.
- A dictionary in Gensim is created using corpora.Dictionary().
- dictionary.doc2bow helps in creating a bag of words
- models.TfidfModel creates a TF-IDF model.
- gensim.models.phrases.Phrases helps in creating bigrams and trigrams.