The idea to solve the problem is to first find the object that causes the serialization failure, and then overwrite it as a normal object.
Many related answers can be found on the Internet, but they often teach people to fish, but they don't teach people to fish. There are many scenarios for this problem, but in the final analysis, it is because some objects that cannot be pickle d serialized are defined, and then these objects are passed in as multiprocessing parameters. So to solve this problem, we must know which object is not serializable. After understanding the multiprocessing process, the troubleshooting process is actually very simple.
First post my error message. I encountered the problem that I couldn't serialize when running DDP. The specific process is that DDP calls multiprocessing when creating a data process, and the parameters passed in multiprocessing cannot be serialized.
File "/home/pai/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train for i, data_batch in enumerate(self.data_loader): File "/home/pai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__ return self._get_iterator() File "/home/pai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/home/pai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 918, in __init__ w.start() File "/home/pai/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/pai/lib/python3.6/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/pai/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/pai/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/home/pai/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/home/pai/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/pai/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects
When I saw this error, my first reaction was the synchronization problem between data processes, because DDP uses multiple processes to load data, and the error message appears again_ thread.RLock . But then I found this_ thread.RLock is just a common type that cannot be serialized by pickle. The error itself has nothing to do with inter process synchronization. multiprocessing uses serialization to encode python objects into binary streams and send them to child processes through pipelines.
Here is a general troubleshooting method. First, locate the line before the error, that is, forking picker (file, protocol) The previous line of dump (obj). According to the error message, you can see that this line of code is in my multiprocessing/popen_spawn_posix.py line 60, the line number may vary depending on the python version. Now we know that one of obj's attributes (or attributes of attributes) must not be serializable, and the type of this attribute is_ thread.RLock . To find this attribute, you only need to serialize the obj attributes one by one. If it is found that one of the obj attributes is not serializable, but the type of this attribute is not_ thread.RLock indicates that a property of this property is not serializable and needs recursive search. In this way, the object causing the problem can be found. Deleting or replacing it can solve the problem.
For PyTorch Dataloader, the serialized python objects are mainly related to data sets. See PyTorch 1.9.1 source code for details. Note that the error occurs at line 918 w.start(). Further debugging will find that the serialization failed object is defined here w. However, w itself does not introduce parameters that cannot be serialized, so the object most likely to cause serialization failure is args.
# torch/utils/data/dataloader.py, line 904～918 w = multiprocessing_context.Process( target=_utils.worker._worker_loop, args=(self._dataset_kind, self._dataset, index_queue, self._worker_result_queue, self._workers_done_event, self._auto_collation, self._collate_fn, self._drop_last, self._base_seed, self._worker_init_fn, i, self._num_workers, self._persistent_workers)) ... w.start()
Generally speaking, dataset may be a user-defined class, so there may be problems. You can give priority to trying to serialize dataset with pickle. If it fails, it means that the object causing the serialization failure is an attribute of the dataset. You can serialize the attributes of the dataset one by one. Of course, other objects may also have problems. For the sake of safety, you can use the above general methods for troubleshooting.
The problem I found is that there is a logger object in the dataset. Prior to python 3.7, logger objects could not be serialized because they contained some lock objects whose types were_ thread.RLock . The easiest way to modify is to delete all these loggers. Alternatively, you can set the logger to a global variable to avoid serialization. Here is a toy example of a serialization test.
import io import pickle import sys import threading print(sys.version_info) lock = threading.RLock() buffer = io.BytesIO() pickle.dump(lock, buffer)
The expected output is as follows. You can see that the error message is consistent with multiprocessing.
sys.version_info(major=3, minor=8, micro=9, releaselevel='final', serial=0) Traceback (most recent call last): File "debug.py", line 9, in <module> pickle.dump(lock, buffer) TypeError: cannot pickle '_thread.RLock' object