Latent Semantic Allocation

Overview

Latent Semantic Analysis or LSA is a technique in Natural Language Processing that uses the Matrix Factorization approach to identify the association among the words in a document. More precisely it helps us in identifying the hidden concepts (or hidden topics) among the given set of documents.

When we need to know how many possible categories can be there in the given M number of documents corpus, we can use the Latent Semantic Analysis algorithm to identify those hidden categories or topics.

LSA is an example of an unsupervised learning technique and its mathematical roots lie in Singular Value Decomposition from Linear Algebra.

Introduction

LSA is primarily used for concept searching and automated document categorization. LSA tries to mimic humans way of word sorting and document categorization.

LSA is a popular dimensionality reduction technique that follows the same method as Singular Value Decomposition. LSA ultimately reformulates text data in terms of latent (i.e. hidden) features, where the number of these features should be less than the number of documents, and the number of terms in the data.

LSA can use a document-term (or document-word) matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to documents and whose columns correspond to terms.

After the construction of the occurrence matrix, LSA finds a low-rank approximation to the document-term matrix using SVD to reveal patterns in the relationships between the terms & concepts and documents & concepts which are contained in an unstructured collection of text.

That's a lot of technical jargon, let's try to understand LSA intuitively and get comfortable with the principles of LSA then we'll deep dive into its maths and code implementation.

The Intuition Behind LSA

In very simple words - If you have an M number of documents, and if you do the categorization manually you might come up with a k number of topics in which you can categorize these documents.

LSA also tries to come up with something close to k topics algorithmically. Even though identifying k topics by humans may not be accurate, the same goes for the LSA algorithm. The final decision will always be a judgemental call on whether these k topics/categories can accurately classify the documents, depending on the usecase as well.

Let's have a look at how raw data and data-term matrix looks like from sample documents belonging to several categories such as Data Science, Religion, Scientific Fiction Movies, etc (M documents).

Doc 1: Data science is a concept to unify statistics, data analytics, informatics, and their related methods to understand and analyze actual phenomena with data

Doc 2: Religious practices may include rituals, sermons, commemoration, sacrifices, festivals, feasts, trances, initiations, funerary services, matrimonial services, meditation, prayer, music, art, dance, public service, or other aspects of human culture

Doc 3: Science fiction is a film genre that uses speculative, fictional science-based depictions of phenomena that are not fully accepted by mainstream science, such as extraterrestrial lifeforms, spacecraft, robots, cyborgs, interstellar travel, time travel, or other technologies

Doc 4

Doc 5

Doc 6 . . . Doc M

Here we already know that the doc1 is from data science, doc2 is from religion and doc3 is from a sci-fi film

The sample of the document-term matrix is shown in the image below :

sample

This table represents the word/term frequency in the document.

But we will use the standardized version of this table for LSA implementation. The Standardisation is done using TF-IDF vectorization.

Below is the standard version of this table :

standard version

The final data-term matrix could look something like this:

final data term

Here, In the cells, we would have different numbers that indicated how strongly that document belongs to the particular topic.

1. A document could strongly belong to a certain category

This will be represented in a matrix called the document-category matrix (U) also called the document topic matrix
Higher cell value indicates that this document is more explained by this category
for our sample documents this matrix could look like this:

matrix

e.g. doc 1 strongly belongs to Category 1 similarly doc 3 strongly belongs to Category 4

2. A term/word could also strongly belong to a certain category

This will be represented in a matrix called the term-category matrix (V) and also the term-topic matrix
Higher cell value indicates that this term is more explained by this category
For our sample documents this matrix could look like this:

matrix

e.g. word/term "spacecraft" is more likely to fall in the "sci-fi" category, therefore its cell value will be higher for the "sci-fi" category

matrix

3. How strongly each topic/category represents or explains the data

This set of information is represented in form of numbers

matrix

These values tell us what is the relative importance of the hidden topics within our text data. Here we can see that "data science" has a higher weightage and this hidden category explains our text very well.

The technical name for this array of numbers is the “singular values”.

In matrix notation (call it Sigma or S matrix) it is represented as a diagonal matrix:

e.g.

matrix

If we look closely all three of these matrices (U, V, and S) have one thing in common - hidden topics or hidden categories

The LSA tries to figure out these matrices through SVD.

LSA Matrix Decomposition Intuition

If we call the,

Document - term matrix as A (size m x n)
Document - category matrix as U (size m x k)
Term - category matrix as V (size n x k)
Singular value matrix as S (k x k)

Then we can see that our original matrix A can be written as a product of U, V, and S.

We have to consider the transposed version of V to follow the matrix multiplication rule, We can write:

A_{m x n} = U_{m x k} * S_{k x k} * V^T_{k x n}

lsa matrix decomposition intution

These matrices U, S, and V can be easily found by Singular Value Decomposition (SVD).

If we look closely, the value of k can ideally range from 1 to n. Where "n" is the total number of unique terms across all the documents.

E.g. if we have 20,000 total unique terms we could choose to have 20,000 hidden topics in our data, but that may not be a reasonable approach. We can choose any value of "k" if it seems reasonable.

Let's take a scenario, where we have 100 documents that have 20,000 unique words/terms-

And let's say we choose k = 3 (number of hidden topics that could explain these documents well),

Then the sizes of the above matrices would be :

A_{100 x 20,000} = U_{100 x 3} * S_{3 x 3} * V^T_{3 x 20,000}

So basically we have reduced the dimensions of right-hand side matrices. That's why LSA and SVD are also dimensionality reduction techniques. This is also called a truncated version of SVD.

So technically or mathematically LSA uses a low-rank approximation to the document-term matrix to remove irrelevant information, extract more important relations, and reduce the computational time.

The Math

Reader's Note: You can absolutely SKIP this section if you are not comfortable with Linear Algebra concepts specifically eigenvectors and linear transformations and directly jump onto the coding implementation of LSA.

Now we know that LSA uses SVD for the matrix decomposition. Let's explore a few important aspects of SVD mathematically.

Before we deep dive into a little mathematics of SVD I would highly recommend going through the Essence of Linear Algebra Playlist on Youtube by 3Blue1Brown. This would give you a geometric understanding of matrices, determinants, eigen-stuffs, and more.

Also this Linear Algebra course by Khan Academy would make your linear algebra concepts stronger. The SVD maths is easier to follow if your basics are clear.

SVD Maths

If A is a square n x n matrix it can be decomposed into its eigen vectors and eigen values.

We can write A as :

svd maths

Where;

svd maths

This is possible only if A is a square matrix and A has n linearly independent eigenvectors.
Consider any m × n matrix A, The matrix AAᵀ and AᵀA will be square metrices of size (m x m) and (n x n) respectively. These matrices have very special properties :
Both of them are square and symmetrical and their eigenvalues are zero or positive (called positive semidefinite)

svd maths

Since they are symmetric, we can leverage properties of symmetric matrices and we can find their eigenvectors to be orthonormal i.e. perpendicular to each other with unit length.

The Eigenvectors of the matrix AAᵀ and AᵀA are called singular vectors in linear algebra.

Let's represent the set of eigenvectors of AAᵀ as u and the set of eigenvectors of AᵀA as v.

Both of these matrices have the same positive eigenvalues. The square root of these eigenvalues are called singular values.

Representing the set of eigenvectors of these both matrices as U (have m eigenvectors) and V (have n eigenvectors) matrices.

svd maths

According to the SVD, any matrix A of size (m x n) can be factorized as:

A_{m x n} = U_{m x m} * S_{m x n} * V^T_{n x n}

Where U and V are orthogonal matrices with orthonormal eigenvectors chosen from AAᵀ and AᵀA respectively.
S is a diagonal matrix with r elements equal to the square root of the positive eigenvalues of AAᵀ or Aᵀ A (both matrics have the same positive eigenvalues).
The diagonal elements of the S matrix are composed of singular values.

svd maths

where σ_i's are singular values, equal to the square root of eigenvalues λ_i's

σ_i = sqrt(λ_i)

svd maths

The eigenvectors inside matrices U and V are ordered according to their corresponding eigenvalues (highest to lowest).

svd maths

Now we can easily see that If A is our original data matrix we can approximate A as its summation of singular values, eigen vectors multiplication:

svd maths

Previously we had the tall U, the square S and the long 𝑉-transpose matrices.
Here we are taking the first vertical slice from U (first eigenvector of AAᵀ), weighting (multiplying) all its values by the first singular value, and then, by doing an outer product with the first horizontal slice of 𝑉-transpose (first eigenvector of AᵀA),
Creating a new matrix with the dimensions of those slices.
Then we add those products together and we get the original data matrix A.
If we don’t do the full sum but only complete it partially, we get the truncated version (i.e. till "k" concepts or categories)

LSA Code Implementation

Let's implement LSA using text documents from news articles.

Expected outcome from the algorithm

We will extract important keywords per hidden topics or concepts from the news articles and interpret the output whether these keywords make sense in explaining the hidden concept.

Dataset Explanation

We will use the raw data collected from news articles.
This data set is a collection of approximately 20,000 newsgroup documents, partitioned almost evenly across 20 different newsgroups.
This 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

More details about the data can be found here : http://qwone.com/~jason/20Newsgroups/

Python Code

Output Interpretation

Let's explore a few keywords from 2 categories identified by the algorithm

Keywords from Category 1

Topic 1:

thanks

windows

card

drive

mail

file

advance

files

software

program

These keywords look like they belong to the Microsoft Windows category

Topic 2:

game

team

year

games

season

players

good

play

hockey

league

These keywords look like they belong to the sports - hockey category

Applications for LSA

Latent Semantic Analysis (LSA) is a Topic Modeling technique that uncovers latent topics and has become a very useful tool while we work on any NLP Problem statement.

Major Applications of LSA include:

The LSA can be used for dimensionality reduction. We can reduce the vector size drastically from millions to thousands without losing any context or information. As a result, it reduces the computation power and the time taken to perform the computation.
LSA can also be used for document clustering. As we can see that the LSA assigns topics to each document based on the assigned topic so we can cluster the documents.
LSA is used in search engine optimization, and Latent Semantic Indexing (LSI) is a term often used in place of Latent Semantic Analysis.
LSA is also used in Information Retrieval, Given a query of terms, translates it into the low-dimensional space, and finds matching documents.

Conclusion

In this article, we discussed an interesting topic modeling approach called Latent Semantic Analysis (LSA)
We discussed LSA conceptually and understood the important theory behind the working principles of LSA
Mathematical roots of LSA lies in Linear Algebra, and this algorithm uses Singular Value Decomposition SVD for matrix factorization
We implemented LSA using Python code to uncover 20 hidden topics from around 11k news articles
Lastly, we discussed the Major Applications of LSA in Natural Language Processing and other Machine Learning related tasks

LSA – Latent Semantic Allocation

Overview

Introduction

The Intuition Behind LSA

LSA Matrix Decomposition Intuition

The Math

SVD Maths

LSA Code Implementation

Python Code

Output Interpretation

Applications for LSA

Conclusion