Remove Punctuation from String Python

Learn via video course
FREE
View all courses
Python Course for Beginners With Certification: Mastering the Essentials
Python Course for Beginners With Certification: Mastering the Essentials
by Rahul Janghu
1000
4.90
Start Learning
Python Course for Beginners With Certification: Mastering the Essentials
Python Course for Beginners With Certification: Mastering the Essentials
by Rahul Janghu
1000
4.90
Start Learning
Topics Covered

Overview

Handling vast amounts of data requires various kinds of processing to make data useful. Raw data contains, in part, unnecessary and unfiltered information that must be filtered and sorted before further processing. The most common example is removing unwanted characters from strings for various purposes like web scraping, machine learning, data mining, etc. Cleaning strings may involve removing certain text from a long string or deleting certain symbols that are unnecessary for processing. This article will discuss various ways to remove punctuation from a string in Python.

How to Remove Punctuation from String Python?

In English, there are multiple grammatical symbols like hyphens -, dash _, exclamation mark !, question mark ?, comma ,, colon :, parentheses {}, semicolons ;, brackets () and so on these are called punctuation marks. These symbols are written in English to clarify the meaning of words and sentences. The information we obtain is usually in a natural English form, but for further data processing, a text in this natural English can cause complications. To avoid these complications, we need to filter out unnecessary symbols. The process of removal of these symbols is what we discuss in the following article.

Method 1: Using Loop + Punctuation String

The brute force way to remove punctuations from a string in Python is to use the basic for loop. In this approach, we loop through the string we wish to clean, checking each character. While looping through, we add the current character at the end of a new string. If the current character falls under the punctuation category, we do not add that character to the end of the new string. The result, a string without punctuation, is stored in the new string variable.

Program:

Output:

Method 2: via regex

Regular Expressions, also known as Regex, are widely used for logic-containing pattern-matching needs. Regular expressions are a set of characters that specify a pattern to be searched in some string. In the below program, we use regex pattern matching methodology to remove punctuations from a string in Python.

Python re module provides the functions for using regular expressions.

Note: Use of the regular expression for punctuation removal from string has the slowest performance. The reason is that a regex matches a string over multiple iterations, so for this task where one single iteration can do the job, the regex performs more than one iteration.

The concept used here is that we identify all the characters that are not alphabets and numbers and substitute an empty string instead of all those characters.

We use re.sub() function to perform the substitution.

The syntax of re.sub() is

This function takes three arguments:

  • pattern: regular expression pattern to match
  • replacement: a string or regex, that will be substituted in place of the pattern
  • string: the actual string to perform substitutions on

The return value of re.sub() is a string with newly applied substitutions.

Program

Output:

Note: The regular expression used here, [^\w\s] matches the set of characters that are not alphanumeric (^\w) and not white space (^\s).

Method 3: Using Translate

Python's built-in string class provides a great utility to perform punctuation removal on a string. We can use the translate() function of the string class to perform string replacements using the translation table.

Note: A translation table is a two-column table whose first element in each row is to be replaced by the second element of that same row. A translation table is a highly efficient data structure.

The syntax of translate() function is

It takes the translation table as an argument. A translation table is a dictionary that provides the key-value mapping of the replacement values where the key will be replaced with corresponding values.

Writing a translation table in key-value pair form can get complicated. The string class has maketrans() built-in method that eases this job by converting string arguments into a translation table.

The syntax of maketrans() function is:

The function takes 3 arguments:

  • x: a string that makes the first column of the translation table, these are the characters from the original string to be replaced.
  • y: a string of the same length as x, for each character in x, will be replaced by the corresponding character in y
  • delete: A set of characters to remove from the original string while translation

Program

Output

Explanation

Here we have used the translation table with no replacement values and only deletion characters. The first and second arguments in the maketrans() function are empty strings, which makes the first two columns in the translation table contain zero rows. Therefore, no character in the string will be replaced. The third argument specifies the list of characters to be removed from the source string. Thus, it removes only the values in the deletion_symbol variable.

The translation table is the most efficient way to remove punctuation from a string in python. Since we are using the translation table and the table here contains the character to be removed, an empty string will replace those.

Conclusion

In this article, we have understood

  • Punctuations are special symbols that add grammatical structure to natural English.
  • Natural English strings are not easily processed; hence we need to remove punctuation from strings before we can use them for further processing.
  • A loop iterating over the string sequentially is the simplest way to remove punctuation from a string in Python.
  • The regular expression-based substitution technique provides expandability but is the slowest in terms of efficiency.
  • The translate function is the most efficient way to remove punctuation using the translation table.

Read More: