NLP spaCy Tutorial

Topics Covered

In this intro to the spacy NLP tutorial, we will learn what spacy NLP is. Spacy is an open-source Python library developed by Matthew Honnibal and Ines Montani in 2015. It is used in data science, data analysis, and other machine-learning activities, mainly on text. It is very fast and provides many tools for effectively handling large amounts of text data.

Prerequisites

To use the spacy NLP library, some of the conditions have to be met

  • Knowledge of Python programming library.
  • Basic understanding of NLP Topics such as tokenization, lemmatization, and other topics.

Introduction to spaCy

spaCy NLP is a free, open-source advanced library for Natural Language Processing in Python and Cython. It is an industry-standard library with more features to solve many NLP tasks with state-of-the-art speed, accuracy, and performance.

spaCy is designed specifically for production use purposes which helps us to understand huge volumes of text data. It can be used to extract information from text or to pre-process the text, which can be used for model training in deep learning

Installation

SpaCy nlp is compatible with python 3.6 + and runs on Linux, MacOS, and windows. It is available in both pip and conda

To Install Using PIP

Before installing SpaCy NLP and its dependencies, ensure that your pip, setup tools, and wheel are up to date.

Code:

It is advisable to create an environment for and install all the packages.

To Install Using Conda

We can install spacy via conda-forge

Basics of SpaCy

SpaCy NLP is used in every sector of the industry, starting from academics to analyzing documents in seconds to decisions in any field, NLP has been in decades. However, the increase in the promise of deep learning expanded the NLP field.

Humans can parse complex data easily in the proper context, but this can be challenging for computers. NLP is a complex problem requiring complex solutions that can be solved using ANN or Neural networks.

Features of SpaCy NLP

Some of the main features of SpaCy NLP is

  • SpaCy nlp support for 64+ languages
  • SpaCy nlp has 63 trained pipelines with pre-trained transformers like BERT
  • Pretrained word vectors
  • state-of-art speed
  • SpaCy NLP is a production training system
  • Linguistically-motivated tokenization
  • components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, lemmatization and many others
  • Easily extensible with custom components
  • support for custom models in Pytorch,Tensorflow and other frameworks
  • spacy nlp offers built-in visualizers for Syantax and NER
  • Easy model packaging, deployement, and workflow management
  • Robust, rigorously evaluated accuracy

Linguistic Annotations

The first step to working with SpaCy python is to import spacy.

With SpaCy NLP imported, we can create an NLP object. To do that we have to call spacy.load() with the model name as a parameter. Since many models are available, we will use the default English model.

Next, we will read some text files or sample paragraphs to process the text and get insights.

A sample text has been given below; we will name this data as text.

Creating the Doc Container

It is time to create a Doc container. To create a doc container, we will usually call our NLPobject and pass our text to it

let's see what this doc contains

If you are trying to spot the difference between this and the text above, You will not see a difference when printing off the doc container. But it is quite different behind the scenes. Unlike the text object, the Doc container contains a lot of valuable metadata, or attributes, hidden behind it. To check this, let us examine the length of the doc object and the text object.

What is going on here? The same text but a different length. Why does this occur? To answer that, let us check it more deeply and try and print each item in each object.

As expected, we have printed each character; let's check the same with the doc container.

We see the difference. While on the surface, it may seem that the Doc container’s length depends on the number of words. You should notice that the open and closed parentheses are also considered an item in the container. These are all known as tokens. Tokens are a fundamental building block of SpaCy or any NLP framework. They can be words or punctuation marks, have a syntactic purpose in a sentence, and are self-contained. An example of this is the “don’t” contraction in English. When tokenized, or the process of converting the text into tokens, we will have two tokens. “do” and “n’t” because the contraction represents two words, “do” and “not”.

You may think that you could easily use the split method in Python to split by whitespace and have the same result. However, you would be wrong. Let us see why.

Sentence Boundary Detection

SBD is a technique to identify where the sentence ends in a text, it seems easy to do with rules. For example, we can just split with "." but in English, we write abbreviations using period. We can detect these things with the help of SpaCy nlp.

Let us move forward with just one of these sentences. First, let’s try and grab the index 0 in this attribute.

Token Attributes

The token object contains a lot of different attributes that are important for performing NLP in SpaCy. We will be working with a few of them, such as:

  • .text
  • .head
  • .left_edge
  • .right_edge
  • .ent_type
  • .iob_
  • .lemma_
  • .morph
  • .pos_
  • .dep_
  • .lang_

I will briefly describe these here and show you how to grab each one and what they look like. We will be exploring each of these attributes. To demonstrate each of these attributes, we will use one token, “no” which is part of a sequence of tokens that make up “Make no mistake, this was a football match”

Text

Verbatim text content. - spaCy nlp docs

The syntactic parent, or “governor”, of this token. - spaCy nlp docs

This tells which word it is governed by.

Left Edge

The leftmost token of this token’s syntactic descendants. -SpaCy nlp docs

Right Edge

The rightmost token of this token’s syntactic descendants. -SpaCy nlp docs

This will tell us where the multi-word token ends.

Entity Type

Named entity type. -spaCy nlp docs

Note the absence of the "_" at the end of the attribute. This will return an integer that corresponds to an entity type.

Lemma

The base form of the token, with no inflectional suffixes. -SpaCy nlp docs

Morph

Morphological analysis -SpaCy nlp docs

Part of Speech

Coarse-grained part-of-speech from the Universal POS tag set. - SpaCy nlp docs

Syntactic Dependency

Syntactic dependency relation. - SpaCy nlp docs

Language

Language of the parent document’s vocabulary. - nlp docs

Vocab, Hashes and Lexemes

In spaCy, the Vocab object is central. It's a collection of look-up tables that make common information available across documents. This means that strings are only stored once in the Vocab. To save memory, spaCy also encodes all strings to hash values.

Vocab

Vocab is essentially a collection of symbols and their string representations. It's where spaCy stores data shared between multiple documents. By centralizing strings, word vectors, and lexical attributes, spaCy saves memory and ensures there's a single source of truth.

For example, when you process multiple documents, spaCy doesn’t keep saving words like “the”, “and”, or “a” over and over. Instead, it saves it in the Vocab and associates it with a unique ID.

Hashes

Every string spaCy encounters, whether it's a word, a part of speech tag, or any other value, gets hashed. This hashing is non-reversible, meaning you can't get the original string back from its hash.

This means spaCy doesn’t have to memorize the word “coffee”. Instead, it remembers a 64-bit hash and uses it as a unique ID for the word. When processing a text, spaCy only has to remember the hashes, which are much smaller than the strings they represent.

Lexemes

A Lexeme object is an entry in the vocabulary. You can get a lexeme by looking up a string or a hash ID in the Vocab. Lexemes expose attributes, just like tokens. They hold context-independent information about a word, like its text, hash ID, and morphological features.

While the Doc contains words in context – i.e. with their part-of-speech tags, dependencies, etc. – the Lexeme is context-free. Think of it as the word type, while the word in the Doc is a word token.

Here's a summary table for Vocab, Hashes, and Lexemes:

ConceptDescriptionExample Use-case
VocabStores data shared across multiple documents. Central collection of symbols & their representations.Access shared data, like word vectors.
HashesEncodes strings to unique 64-bit integers. Efficient for memory but non-reversible.Efficiently check if a word is in the vocabulary.
LexemesContext-free entries in the vocabulary. Stores information about a word type.Access lexical attributes without context.

Part of Speech Tagging(POS)

Understanding parts of speech is essential, and with the help of Spacy NLP, it will be very easy.

Conclusion

  • This article briefly introduces SpaCy NLP and is recommended to go through a couple of times.
  • spacy is an open-source Python library developed for data science, data analysis, and other machine-learning activities, which are mainly focused on text.
  • We covered how to install spacy and load the English model in SpaCy using nlp = spacy.load("en_core_web_sm").
  • Next, we discussed SBD, a technique in detail to identify where the sentence ends in a text.
  • We also discussed token attributes such as .text, .head etc.
  • At the end, we covered part of speech tagging in text.