NLP spaCy Tutorial
In this intro to the spacy NLP tutorial, we will learn what spacy NLP is. Spacy is an open-source Python library developed by Matthew Honnibal and Ines Montani in 2015. It is used in data science, data analysis, and other machine-learning activities, mainly on text. It is very fast and provides many tools for effectively handling large amounts of text data.
Prerequisites
To use the spacy NLP library, some of the conditions have to be met
- Knowledge of Python programming library.
- Basic understanding of NLP Topics such as tokenization, lemmatization, and other topics.
Introduction to spaCy
spaCy NLP is a free, open-source advanced library for Natural Language Processing in Python and Cython. It is an industry-standard library with more features to solve many NLP tasks with state-of-the-art speed, accuracy, and performance.
spaCy is designed specifically for production use purposes which helps us to understand huge volumes of text data. It can be used to extract information from text or to pre-process the text, which can be used for model training in deep learning
Installation
SpaCy nlp is compatible with python 3.6 + and runs on Linux, MacOS, and windows. It is available in both pip and conda
To Install Using PIP
Before installing SpaCy NLP and its dependencies, ensure that your pip, setup tools, and wheel are up to date.
Code:
It is advisable to create an environment for and install all the packages.
To Install Using Conda
We can install spacy via conda-forge
Basics of SpaCy
SpaCy NLP is used in every sector of the industry, starting from academics to analyzing documents in seconds to decisions in any field, NLP has been in decades. However, the increase in the promise of deep learning expanded the NLP field.
Humans can parse complex data easily in the proper context, but this can be challenging for computers. NLP is a complex problem requiring complex solutions that can be solved using ANN or Neural networks.
Features of SpaCy NLP
Some of the main features of SpaCy NLP is
- SpaCy nlp support for 64+ languages
- SpaCy nlp has 63 trained pipelines with pre-trained transformers like BERT
- Pretrained word vectors
- state-of-art speed
- SpaCy NLP is a production training system
- Linguistically-motivated tokenization
- components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, lemmatization and many others
- Easily extensible with custom components
- support for custom models in Pytorch,Tensorflow and other frameworks
- spacy nlp offers built-in visualizers for Syantax and NER
- Easy model packaging, deployement, and workflow management
- Robust, rigorously evaluated accuracy
Linguistic Annotations
The first step to working with SpaCy python is to import spacy.
With SpaCy NLP imported, we can create an NLP object. To do that we have to call spacy.load() with the model name as a parameter. Since many models are available, we will use the default English model.
Next, we will read some text files or sample paragraphs to process the text and get insights.
A sample text has been given below; we will name this data as text.
Creating the Doc Container
It is time to create a Doc container. To create a doc container, we will usually call our NLPobject and pass our text to it
let's see what this doc contains
If you are trying to spot the difference between this and the text above, You will not see a difference when printing off the doc container. But it is quite different behind the scenes. Unlike the text object, the Doc container contains a lot of valuable metadata, or attributes, hidden behind it. To check this, let us examine the length of the doc object and the text object.
What is going on here? The same text but a different length. Why does this occur? To answer that, let us check it more deeply and try and print each item in each object.
As expected, we have printed each character; let's check the same with the doc container.
We see the difference. While on the surface, it may seem that the Doc container’s length depends on the number of words. You should notice that the open and closed parentheses are also considered an item in the container. These are all known as tokens. Tokens are a fundamental building block of SpaCy or any NLP framework. They can be words or punctuation marks, have a syntactic purpose in a sentence, and are self-contained. An example of this is the “don’t” contraction in English. When tokenized, or the process of converting the text into tokens, we will have two tokens. “do” and “n’t” because the contraction represents two words, “do” and “not”.
You may think that you could easily use the split method in Python to split by whitespace and have the same result. However, you would be wrong. Let us see why.
Sentence Boundary Detection
SBD is a technique to identify where the sentence ends in a text, it seems easy to do with rules. For example, we can just split with "." but in English, we write abbreviations using period. We can detect these things with the help of SpaCy nlp.
Let us move forward with just one of these sentences. First, let’s try and grab the index 0 in this attribute.
Token Attributes
The token object contains a lot of different attributes that are important for performing NLP in SpaCy. We will be working with a few of them, such as:
- .text
- .head
- .left_edge
- .right_edge
- .ent_type
- .iob_
- .lemma_
- .morph
- .pos_
- .dep_
- .lang_
I will briefly describe these here and show you how to grab each one and what they look like. We will be exploring each of these attributes. To demonstrate each of these attributes, we will use one token, “no” which is part of a sequence of tokens that make up “Make no mistake, this was a football match”
Text
Verbatim text content. - spaCy nlp docs
Head
The syntactic parent, or “governor”, of this token. - spaCy nlp docs
This tells which word it is governed by.
Left Edge
The leftmost token of this token’s syntactic descendants. -SpaCy nlp docs
Right Edge
The rightmost token of this token’s syntactic descendants. -SpaCy nlp docs
This will tell us where the multi-word token ends.
Entity Type
Named entity type. -spaCy nlp docs
Note the absence of the "_" at the end of the attribute. This will return an integer that corresponds to an entity type.
Lemma
The base form of the token, with no inflectional suffixes. -SpaCy nlp docs
Morph
Morphological analysis -SpaCy nlp docs
Part of Speech
Coarse-grained part-of-speech from the Universal POS tag set. - SpaCy nlp docs
Syntactic Dependency
Syntactic dependency relation. - SpaCy nlp docs
Language
Language of the parent document’s vocabulary. - nlp docs
Vocab, Hashes and Lexemes
In spaCy, the Vocab object is central. It's a collection of look-up tables that make common information available across documents. This means that strings are only stored once in the Vocab. To save memory, spaCy also encodes all strings to hash values.
Vocab
Vocab is essentially a collection of symbols and their string representations. It's where spaCy stores data shared between multiple documents. By centralizing strings, word vectors, and lexical attributes, spaCy saves memory and ensures there's a single source of truth.
For example, when you process multiple documents, spaCy doesn’t keep saving words like “the”, “and”, or “a” over and over. Instead, it saves it in the Vocab and associates it with a unique ID.
Hashes
Every string spaCy encounters, whether it's a word, a part of speech tag, or any other value, gets hashed. This hashing is non-reversible, meaning you can't get the original string back from its hash.
This means spaCy doesn’t have to memorize the word “coffee”. Instead, it remembers a 64-bit hash and uses it as a unique ID for the word. When processing a text, spaCy only has to remember the hashes, which are much smaller than the strings they represent.
Lexemes
A Lexeme object is an entry in the vocabulary. You can get a lexeme by looking up a string or a hash ID in the Vocab. Lexemes expose attributes, just like tokens. They hold context-independent information about a word, like its text, hash ID, and morphological features.
While the Doc contains words in context – i.e. with their part-of-speech tags, dependencies, etc. – the Lexeme is context-free. Think of it as the word type, while the word in the Doc is a word token.
Here's a summary table for Vocab, Hashes, and Lexemes:
Concept | Description | Example Use-case |
---|---|---|
Vocab | Stores data shared across multiple documents. Central collection of symbols & their representations. | Access shared data, like word vectors. |
Hashes | Encodes strings to unique 64-bit integers. Efficient for memory but non-reversible. | Efficiently check if a word is in the vocabulary. |
Lexemes | Context-free entries in the vocabulary. Stores information about a word type. | Access lexical attributes without context. |
Part of Speech Tagging(POS)
Understanding parts of speech is essential, and with the help of Spacy NLP, it will be very easy.
Conclusion
- This article briefly introduces SpaCy NLP and is recommended to go through a couple of times.
- spacy is an open-source Python library developed for data science, data analysis, and other machine-learning activities, which are mainly focused on text.
- We covered how to install spacy and load the English model in SpaCy using nlp = spacy.load("en_core_web_sm").
- Next, we discussed SBD, a technique in detail to identify where the sentence ends in a text.
- We also discussed token attributes such as .text, .head etc.
- At the end, we covered part of speech tagging in text.