Urllib Python
Overview
The urllib Python library is a built-in library that comes with the Python interpreter. It is a collection of various modules that helps us to work with the URLs (Uniform Resource Locator). The library uses the function urlopen() to fetch URLs using several protocols. The urllib library contains various useful modules such as: urllib.request (for opening files), urllib.parse (for parsing the URLs), urllib.error (for handling errors), and urllib.robotparser (for parsing the robot.txt files).
What is the urllib Module in Python?
Python is one of the most widely used scripting languages. Python has a wide variety of modules, packages, and libraries which makes it so popular among other programming and scripting languages like C, C++, JavaScript, etc. Let us learn about the most useful package of Python namely urllib.
The urllib Python library is one of the built-in Python libraries. The urllib Python library is a collection of various modules that helps us to work with the URLs (Uniform Resource Locator). The library used the function urlopen() to fetch URLs using several protocols.
Note: A URL or Uniform Resource Locator is simply the address of a unique address on the Web (Internet). A URL points to some unique resource like an HTML page, a CSS document, an image, etc.
Example: https://www.scaler.com/topics/
In easier terms, we can say that the urllib library helps us to access various websites using Python. We can make various types of HTTP web requests like GET, POST, PUT, and DELETE using the urllib Python library. We can also deal with JSON data in our Python code using this library.
In most cases, the urllib library comes with the Python interpreter itself. If our system (environment) does not have the urllib Python library then we can run the simple command (mentioned below) to install it.
Some of the use cases of the urllib Python library are:
- We can download data.
- We can access websites.
- We can perform web scraping.
- We can parse data.
- We can modify webpage headers.
Let us learn about the various useful modules and methods of the urllib Python library in the next section.
Modules Under urllib in Python
As we have discussed earlier that the urllib library helps us to access webpages using Python. For accessing and performing various actions, the urllib library comes with a wide variety of modules. Let us now learn about some of the most useful modules that come under the urllib Python library along with their syntax and example.
The modules that come under urllib are:
1. For Opening and Reading: "urllib.request"
The urllib.request module of the urllib Python library comes with classes and functions that help us to open URLs (mostly the HTTPS URLs). We generally use the urlopen(url) function of the urllib.request function to open webpages.
Example
Output
In the program above, we have accessed the HTML code of the requested URL.
2. For parsing URLs: urllib.parse.
The urllib.parse module of the urllib library contains functions and classes that can be used to manipulate the URLs and the corresponding parts of the URL. It can be used to break and build components of a URL.
The prime use of the urllib.parse module is used to split the URL into several smaller components. We can also use this module to join the various components of the URL to make the URL string.
Syntax The syntax of the urllib.parse is quite simple, let us look at the syntax:
Example
Output
In the above code, we have used the urlparse() and urlunparse() methods to get the various components of the URL and for getting back the original URL from the components respectively.
Refer to the list provided below to learn about the various functions of the urllib.parse method.
Function Name | Use |
---|---|
urllib.parse.urlparse | This function is used to separate the various components of a URL. |
urllib.parse.urlunparse | This function is used to join back the various components of a URL. |
urllib.parse.urlsplit | This function behaves similar to the urllib.parse.urlparse function but it does not split the components of the URL. |
urllib.parse.urlunsplit | This function is used to combine the components of the URL (in the form of tuple) returned by the urllib.parse.urlsplit function. |
urllib.parse.urldeflag | This function is used to remove fragments from a URL if there are fragment(s) present in the URL. |
3. For the Raised Exceptions: "urllib.error"
The urllib.error module of the urllib Python library is used to handle the exceptions that can be raised by the urllib.request module. So, if there is any error in fetching a URL then we use the urllib.error module to handle the raised errors.
In the case of URL fetching, there may be two types of exceptions. Let us learn about both of them by using examples.
- URLError: When there are errors in fetching the URL due to some connectivity issues or any other reason then we use the urllib.error module to handle the errors.
Let us take an example to understand the context better.
Example
Output
In the above code, we tried to access the website without internet connectivity, so we got an exception which was printed in the Exception block.
- HTTPError: The other type of error can be HTTP errors. Some of the most common HTTP Errors can be authentication request errors, permanent redirect errors, etc.
Note: Some of the most common errors are
- 404: page not found.
- 403: request forbidden.
- 401: authentication required.
- 308: permanent redirect.
Let us take an example to understand the context better.
Example
Output
In the above code, we have tried to access the website without internet connectivity, so we got an exception which was printed in the Exception block.
4. For Parsing robot.txt Files: "urllib.robotparser"
The urllib.robotparser module of the urllib library contains only a single class called RobotFileParser. Now, this class is used to check or detect whether we can fetch a URL that published robot.txt files or not.
Note: The robot.txt is a text file that is created by webmasters and can be used to instruct the web robots on how to crawl certain pages on the website.
Let us take an example to understand the context better
Example
Output
Conclusion
- The urllib Python library is a built-in Python library that has a collection of various modules that helps us to work with the URLs.
- We can make various types of HTTP web requests like GET, POST, PUT, and DELETE using the urllib Python library.
- Using the urllib Python library, we can download data, access websites, perform web scraping, parse data, and modify webpage headers.
- The urllib.request module helps us to open URLs (mostly the HTTPS URLs). We generally use the urlopen(url) function of the urllib.request function to open webpages.
- The urllib.parse module can be used to manipulate the URLs and the corresponding parts of the URL. It can be used to break and build components of a URL.
- The urllib.error module can be raised by the urllib.request module. So, if there is any error in fetching a URL then we use the urllib.error module to handle the raised errors.
- The urllib.robotparser module contains only a single class called RobotFileParser which is used to check or detect whether we can fetch a URL that published robot.txt files or not.