Data Loader does not work with Hdf5 file, when num_worker >1 #11929

New issue

Closed

Data Loader does not work with Hdf5 file, when num_worker >1#11929

yunyundong

opened

on Sep 21, 2018

· edited by yunyundong

https://discuss.pytorch.org/t/dataloader-when-num-worker-0-there-is-bug/25643
https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a#file-h5py-error-py

We Hope developers can help us.
Many pepole has the problems, but there is no solution available.

soumith

mentioned this

on Sep 21, 2018

DataLoader, does not work with Hdf5 file when number #11928

soumith

Member

closing as duplicate of #11887 and #11928

h5py doesn't allow reading from multiple processes:
https://github.com/h5py/h5py/blob/master/examples/multiprocessing_example.py#L17-L21

soumith

closed this as completed

on Sep 21, 2018

yunyundong

Author

I do not think so. We have found a solution, https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a#file-h5py-error-py
If we use it like this ,is it right? @soumith

yunyundong

Author

I encountered the very same issue, and after spending a day trying to marry PyTorch DataParallel loader wrapper with HDF5 via h5py, I discovered that it is crucial to open h5py.File inside the new process, rather than having it opened in the main process and hope it gets inherited by the underlying multiprocessing implementation.

Since PyTorch seems to adopt lazy way of initializing workers, this means that the actual file opening has to happen inside of the__getitem__function of the Dataset wrapper. refer to https://stackoverflow.com/questions/46045512/h5py-hdf5-database-randomly-returning-nans-and-near-very-small-data-with-multi/52438133#52438133

This is the answer to the problem. I modify the code, and it works well. Can you explain more about it? Thank you in advance. @soumith

rs9899

I do not think so. We have found a solution, https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a#file-h5py-error-py

Can you please update the link?
This link is not working and I am need of the same for my project.

Thanks

rs9899

Problem solved
https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a
It seemed like a minor link issue.

alexisdrakopoulos

Problem solved
https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a
It seemed like a minor link issue.

This does not seem to be working for me at least.

rs9899

Can you elaborate more on the issue?

aleSuglia

mentioned this

on Jun 11, 2020

[Feature request] Support for external modality for language datasets huggingface/datasets#263

airsplay

Solution

This issue could be solved and the solution is simple:

Do not open hdf5 inside __init__
Open the hdf5 at the first data iteration.

Here is an illustration:

class LXRTDataLoader(torch.utils.data.Dataset):
    def __init__(self):
        """do not open hdf5 here!!"""

    def open_hdf5(self):
        self.img_hdf5 = h5py.File('img.hdf5', 'r')
        self.dataset = self.img_hdf5['dataset'] # if you want dataset.

    def __getitem__(self, item: int):
        if not hasattr(self, 'img_hdf5'):
            self.open_hdf5()
        img0 = self.img_hdf5['dataset'][0] # Do loading here
        img1 = self.dataset[1]
        return img0, img1

Then the dataloader with num_workers > 1 could just be normally used.

train_loader = torch.utils.data.DataLoader(
        dataset=train_tset,
        batch_size=32,
        num_workers=4
    )

Explanation
The multi-processing actually happens when you create the data iterator (e.g., when calling for datum in dataloader:):

pytorch/torch/utils/data/dataloader.py

Lines 712 to 720 in 461014d

    
           for i in range(self._num_workers): 
        
               index_queue = multiprocessing_context.Queue() 
        
               # index_queue.cancel_join_thread() 
        
               w = multiprocessing_context.Process( 
        
                   target=_utils.worker._worker_loop, 
        
                   args=(self._dataset_kind, self._dataset, index_queue, 
        
                         self._worker_result_queue, self._workers_done_event, 
        
                         self._auto_collation, self._collate_fn, self._drop_last, 
        
                         self._base_seed + i, self._worker_init_fn, i, self._num_workers))

In short, it would create multiple processes which "copy" the state of the current process. Thus the opened hdf5 file object would be dedicated to each subprocess if we open it at the first data iteration.

If you somehow create an hdfs file in __init__ and set up the `num_workers' > 0, it might cause two issues:

The writing behavior is non-determistic. (We do not need to write to hdf5, thus this issue is ignored.)
The state of the hdfs is copied, which might not faithfully indicate the current state.

In the previous way, we bypass this two issues.

kfeeeeee

Solution

This issue could be solved and the solution is simple:

This works very well. I am just wondering if there is any way to call a destructor if the worker exited, e.g., closing the hdf5 file properly. Do you know how to do that?

airsplay

This works very well. I am just wondering if there is any way to call a destructor if the worker exited, e.g., closing the hdf5 file properly. Do you know how to do that?

Great to know that it works! If you want to explicitly close the hdf5 file, you might could add the __del__ function to the dataloader:

def __del__(self):
    if hasattr(self, 'img_hdf5'):
        self.img_hdf5.close()

However, this destructor does not to be explicitly built. Since the sub-processes of workers are closed when the iterator ends:

pytorch/torch/utils/data/dataloader.py

Lines 1091 to 1092 in 461014d

    
           def __del__(self): 
        
               self._shutdown_workers()

the python interpreter and OS would correctly close hdf5 files (e.g., free the resources) which are build inside the sub-processes. If the process normally exits, the python closes the hdf5 file upon process ending. Otherwise (i.e., process crashes), the OS would take charge of GC thus no side affects would remain.

MustafaMustafa

@airsplay
Yes, your solution works. Thank you!

RSKothari

@airsplay Love it. The only downside being that __len__ needs to be defined in advanced. How do you propose handling that? Actually, I managed to figure that out with a little hacky H5 read and close operation within init. This solution is smart. Love it and bookmarked.

airsplay

@airsplay Love it. The only downside being that __len__ needs to be defined in advanced. How do you propose handling that? Actually, I managed to figure that out with a little hacky H5 read and close operation within init. This solution is smart. Love it and bookmarked.

A good point! A previous solution posted by others (I could not find the original link TAT) mentions the with statement, which might be appropriate here:

def __init__(self):
    with h5py.File("X.hdf5", 'r') as f:
        self.length = len(f['dataset'])

def __len__(self):
    return self.length

airsplay

mentioned this

on Feb 28, 2021

TypeError: can't pickle Environment objects when num_workers > 0 for LSUN pytorch/vision#689

mrektor

mentioned this

on Sep 2, 2021

Allowing multiple-workers read with pytorch cta-observatory/dl1-data-handler#105

HendrikSchmidt

mentioned this

on Aug 30, 2022

Get training to work in parallel on different architectures rvandewater/YAIB#3

pengzhi1998

mentioned this

on Jun 25, 2024

TypeError: h5py objects cannot be pickled Lifelong-Robot-Learning/LIBERO#19

pengzhi1998

Hi all! Thank you for your help!! However, since it has been 6 years ago, this solution is not working on my side. And the problem of TypeError: h5py objects cannot be pickled still exists after I open the hdf5 file only in __getitem__ method. May I have your suggestions on that?

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Loader does not work with Hdf5 file, when num_worker >1 #11929

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

	for i in range(self._num_workers):
	index_queue = multiprocessing_context.Queue()
	# index_queue.cancel_join_thread()
	w = multiprocessing_context.Process(
	target=_utils.worker._worker_loop,
	args=(self._dataset_kind, self._dataset, index_queue,
	self._worker_result_queue, self._workers_done_event,
	self._auto_collation, self._collate_fn, self._drop_last,
	self._base_seed + i, self._worker_init_fn, i, self._num_workers))

Data Loader does not work with Hdf5 file, when num_worker >1 #11929

Description

Activity

soumith commented on Sep 21, 2018

yunyundong commented on Sep 21, 2018

yunyundong commented on Sep 21, 2018

rs9899 commented on Apr 9, 2019

rs9899 commented on Apr 9, 2019

alexisdrakopoulos commented on Jun 10, 2020

rs9899 commented on Jun 10, 2020

airsplay commented on Jun 25, 2020

kfeeeeee commented on Jul 6, 2020

airsplay commented on Jul 6, 2020

MustafaMustafa commented on Jul 9, 2020

RSKothari commented on Jul 15, 2020

airsplay commented on Jul 15, 2020

pengzhi1998 commented on Jun 25, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions