The Python Pickle Module
Pickle in Python is a powerful module for serializing and deserializing Python object structures, transforming them into a byte stream for storage or transmission. This process, known as pickling, enables efficient data exchange and persistence, essential for applications involving complex data manipulation.
Python Pickle - Python object serialization
In Python, everything is an object, and Pickle in python helps us to save the internal state of these Python objects to a database or file for future use with the process called serialization. Serialization is also known as pickling or marshalling, or flattening. This bytes stream contains all the required information to reconstruct the object structure in another Python script.
Unpickling is also known as deserialization or unmarshalling.
One of the most common use cases of pickling in the Data Science domain is when the developer can save the internal state or weights of the trained model, which can be used later for making predictions without having to train the model all over again.
To use pickle module in Python program, it has to be imported using the following statement
Python Pickle Example
In this example, we demonstrate how to use the Pickle module to serialize and deserialize a Python dictionary, showcasing the simplicity and power of Pickle for object persistence.
Serialization (Pickling):
Deserialization (Unpickling):
Output:
Pickle module in python
Constants
-
pickle.HIGHEST_PROTOCOL: This constant represents the highest protocol version available. Using the highest protocol can improve efficiency in terms of serialization speed and the size of the resulting serialized object. However, it may not be compatible with older Python versions.
-
pickle.DEFAULT_PROTOCOL: This constant is set to the default protocol used by Pickle if no protocol is specified. It strikes a balance between compatibility and efficiency. As of Python 3.8, the default protocol is 4.
-
Protocol Versions (0 to 5):
- 0: The original ASCII protocol and is backward compatible with earlier versions of Python.
- 1: An old binary format which is also compatible with earlier versions of Python.
- 2: Introduced in Python 2.3, provides more efficient pickling of new-style classes.
- 3: Introduced in Python 3.0, designed for Python 3.x, making it incompatible with Python 2.x.
- 4: Introduced in Python 3.4, adds support for very large objects, pickling more kinds of objects, and improving efficiency.
- 5: Introduced in Python 3.8, adds support for out-of-band data and speed optimizations for numpy arrays.
Functions
-
pickle.dump(object, file, protocol=None, *, fix_imports=True, buffer_callback=None)
This function is used to write the pickled representation of the object obj to the file object file.
protocol is an optional argument that takes an integer value and enables the pickler to use the specified protocol.
If the fix_imports argument is true and the protocol is less than 3, the pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable in Python 2.
Output
In the above example, we created a dictionary dic and used pickle.dump() to serialize the dictionary and store it in a data.pickle file for later use.
-
pickle.dumps(obj, protocol = None, *, fix_imports = True, buffer_callback=None)
This function returns the pickled representation of the object obj in form of a bytes object instead of writing it to a file.
Output
The file produced via pickling using pickle in Python is of .pickle format.
-
pickle.load(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=None)
This function reads the pickled object representation from the open file object file and returns the reconstituted object.
The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2, these default to ‘ASCII’ and ‘strict’, respectively.
Output
-
pickle.loads(data, *, fix_imports = True, encoding = “ASCII”, errors = “strict”)
This function reads the pickled object representation from the bytes stream object data and returns the reconstituted object.
Output
The main difference between dumps() and dump(), is that the latter has s at the end of the function name which stands for string.
Exceptions
-
exception pickle.PickleError This Exception is the base class for all other raised exceptions in the Pickle module.
-
exception pickle.PicklingError This exception is raised when the pickle object does not support pickling.
-
exception pickle.UnpicklingError This exception is raised when there is data corruption or a security violation while unpickling an object.
Classes exported by the pickle module
Class Instances can be pickled and unpickled without using any additional code. By default, the pickle will retrieve the class and attributes of an instance via introspection.
This default implementation of pickle in python can be altered by using one or more special methods explained as follows:
-
object.__getnewargs_ex__()
This method commands the values passed to the __new__() method while unpickling. The method will return a pair (args, kwargs) where args is a tuple of positional arguments and kwargs is a dictionary of named arguments for constructing the object.
-
object.__getnewargs__()
This method is similar to object.__getnewargs_ex__() but has support for only positive arguments. The method will return a tuple of arguments args which will be then passed to the __new__() method while unpickling.
-
object.__getstate__()
If this method is defined by classes, it is called and the returned object is pickled as the contents for the instance, instead of the contents of the instance’s dictionary.
-
object.__setstate__(state)
After unpickling, if the class defines __setstate__(), then it is in unpickled state and there is no need for the state object to be dictionary. While, the pickled state must be a dictionary and its items are assigned to the new instance’s dictionary.
-
object.__reduce__()
This method takes no argument and returns either a string or preferably a tuple.
-
object.__reduce_ex__(protocol)
This method is similar to __reduce__ method but it takes a single integer argument and provides backward compatibility by reducing the values for previous Python releases.
Protocol Formats of the Python pickle Module
There are six different protocols that the Python Pickle module uses.
- Protocol version 0 - It was the original human-readable protocol having backward compatibility with previous Python versions.
- Protocol version 1 - It was the first binary format supporting backward compatibility.
- Protocol version 2 - It was introduced in Python 2.3 and provided a lot more improvement in efficiency.
- Protocol version 3 - It was introduced in Python 3.0. It cannot be unpickled by the Python 2.x version. This was the default protocol in Python 3.0–3.7.
- Protocol version 4 - It was introduced in Python 3.4. This version includes support for a wide range of object sizes and types and is the default protocol starting with Python 3.8.
- Protocol version 5 - It was introduced in Python 3.8. It adds support for out-of-band data and improves speed for in-band data.
If we pickle the Pandas data frame using different versions, we can see the difference in pickled file size.
Output
The higher versions are always better than the lower ones in terms of
- The size of the pickled objects
- The performance of unpickling
What Can Be Pickled And Unpickled?
The following Python object types can be pickled:
- None, Boolean Values (True, False)
- int, float, complex numbers
- strings (normal and Unicode), bytes, byte arrays
- lists, sets, tuples, and dictionaries containing only picklable
Other than these, functions (both user-defined and built-in) and classes can also be pickled, only if these are defined at the top level of a module.
While there are some inbuilt python functions/classes like generators, DateTime module, lambda functions, and defaultdicts that cannot be pickled, for pickling lambda function, an additional package named dill is required, and defaultdict can be pickled by creating a module-level function. Live connection objects like a database or network connection cannot be pickled as pickle won't be able to connect once the connection is closed.
Let us see an example of a datetime module that cannot be pickled.
Output
Another example of pickling lambda function is demonstrated below:
Output
Compression of Pickled Objects
Compressing pickled objects can significantly reduce storage space and improve the efficiency of data transmission. This process involves serializing the Python object with Pickle and then compressing the serialized byte stream using a compression library such as gzip or bz2. Compression is particularly useful when dealing with large data structures or when serialized objects need to be stored or transmitted over a network.
Example of Compressing a Pickled Object with gzip:
Decompression and Deserialization:
Benefits:
- Compression can drastically reduce the size of serialized files, making storage and data transfer more efficient.
- Compressing and decompressing data can also serve as an additional layer of data integrity check, as corrupt compressed files are often easier to detect.
Security Concerns
Apart from the pros, a Developer must be aware of some drawbacks while using the pickle module. Major drawback of using the pickle module in python is that it is possible to create malicious pickled data which will execute any set of arbitrary code while unpickling. Therefore, we should never unpickle data that comes from an untrusted source or is transmitted over an insecure network. Such attacks can be prevented by using libraries like hmac for signing the data and reducing security risks.
Apart from security concerns, some other drawbacks of using pickle in python are that the pickle file is unreadable, and the pickle module is only limited to Python, thus other languages might have support issues while dealing with pickled files.
Advantages and disadvantages of using pickle in python
Advantages of Using Pickle in Python
- Pickle's API is straightforward, making it simple to serialize and deserialize Python objects with minimal code.
- It can serialize a wide range of Python objects, including complex data structures like lists, dictionaries, custom classes, and more.
- Pickle is a part of Python's standard library, ensuring good integration with Python's ecosystem and no additional installation requirements.
- Pickle maintains the object's state and all its data attributes, allowing for an exact recreation of the original object upon deserialization.
Disadvantages of Using Pickle in Python
- Deserializing data from an untrusted source can execute arbitrary code, leading to significant security vulnerabilities.
- Pickle is tightly coupled with Python, making the serialized data not easily readable or usable from other programming languages.
- Pickle files may not be compatible across different Python versions, leading to potential issues when unpickling data with a different Python version than the one used for pickling.
- For large data sets, Pickle's performance may not be as efficient as more specialized libraries designed for high-performance serialization, such as numpy for arrays or pandas for data frames.
Conclusion
- Serialization (pickling) converts Python objects into a byte stream for storage or network transmission, while deserialization (unpickling) reverses this process.
- Despite its utility, the pickle module carries security risks and produces unreadable files, limiting its use in Python environments.
- Pickling is favoured for data frames over CSVs due to its speed, despite CSVs being human-readable.
- The pandas library's built-in methods for pickling and unpickling enhance efficiency in data processing, making it a preferred choice for handling complex data structures.