Skip to content

Data Loader does not work with Hdf5 file, when num_worker >1 #11929

@yunyundong

Description

@yunyundong

Activity

soumith

soumith commented on Sep 21, 2018

@soumith
Member

closing as duplicate of #11887 and #11928

h5py doesn't allow reading from multiple processes:
https://github.com/h5py/h5py/blob/master/examples/multiprocessing_example.py#L17-L21

yunyundong

yunyundong commented on Sep 21, 2018

@yunyundong
Author

I do not think so. We have found a solution, https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a#file-h5py-error-py
If we use it like this ,is it right? @soumith

yunyundong

yunyundong commented on Sep 21, 2018

@yunyundong
Author

I encountered the very same issue, and after spending a day trying to marry PyTorch DataParallel loader wrapper with HDF5 via h5py, I discovered that it is crucial to open h5py.File inside the new process, rather than having it opened in the main process and hope it gets inherited by the underlying multiprocessing implementation.

Since PyTorch seems to adopt lazy way of initializing workers, this means that the actual file opening has to happen inside of the__getitem__function of the Dataset wrapper. refer to https://stackoverflow.com/questions/46045512/h5py-hdf5-database-randomly-returning-nans-and-near-very-small-data-with-multi/52438133#52438133

This is the answer to the problem. I modify the code, and it works well. Can you explain more about it? Thank you in advance. @soumith

rs9899

rs9899 commented on Apr 9, 2019

@rs9899

I do not think so. We have found a solution, https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a#file-h5py-error-py

Can you please update the link?
This link is not working and I am need of the same for my project.

Thanks

rs9899

rs9899 commented on Apr 9, 2019

@rs9899

Problem solved
https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a
It seemed like a minor link issue.

alexisdrakopoulos

alexisdrakopoulos commented on Jun 10, 2020

@alexisdrakopoulos

Problem solved
https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a
It seemed like a minor link issue.

This does not seem to be working for me at least.

rs9899

rs9899 commented on Jun 10, 2020

@rs9899

Can you elaborate more on the issue?

airsplay

airsplay commented on Jun 25, 2020

@airsplay

Solution

This issue could be solved and the solution is simple:

  1. Do not open hdf5 inside __init__
  2. Open the hdf5 at the first data iteration.

Here is an illustration:

class LXRTDataLoader(torch.utils.data.Dataset):
    def __init__(self):
        """do not open hdf5 here!!"""

    def open_hdf5(self):
        self.img_hdf5 = h5py.File('img.hdf5', 'r')
        self.dataset = self.img_hdf5['dataset'] # if you want dataset.

    def __getitem__(self, item: int):
        if not hasattr(self, 'img_hdf5'):
            self.open_hdf5()
        img0 = self.img_hdf5['dataset'][0] # Do loading here
        img1 = self.dataset[1]
        return img0, img1

Then the dataloader with num_workers > 1 could just be normally used.

train_loader = torch.utils.data.DataLoader(
        dataset=train_tset,
        batch_size=32,
        num_workers=4
    )

Explanation
The multi-processing actually happens when you create the data iterator (e.g., when calling for datum in dataloader:):

for i in range(self._num_workers):
index_queue = multiprocessing_context.Queue()
# index_queue.cancel_join_thread()
w = multiprocessing_context.Process(
target=_utils.worker._worker_loop,
args=(self._dataset_kind, self._dataset, index_queue,
self._worker_result_queue, self._workers_done_event,
self._auto_collation, self._collate_fn, self._drop_last,
self._base_seed + i, self._worker_init_fn, i, self._num_workers))

In short, it would create multiple processes which "copy" the state of the current process. Thus the opened hdf5 file object would be dedicated to each subprocess if we open it at the first data iteration.

If you somehow create an hdfs file in __init__ and set up the `num_workers' > 0, it might cause two issues:

  1. The writing behavior is non-determistic. (We do not need to write to hdf5, thus this issue is ignored.)
  2. The state of the hdfs is copied, which might not faithfully indicate the current state.

In the previous way, we bypass this two issues.

kfeeeeee

kfeeeeee commented on Jul 6, 2020

@kfeeeeee

Solution

This issue could be solved and the solution is simple:

This works very well. I am just wondering if there is any way to call a destructor if the worker exited, e.g., closing the hdf5 file properly. Do you know how to do that?

airsplay

airsplay commented on Jul 6, 2020

@airsplay

This works very well. I am just wondering if there is any way to call a destructor if the worker exited, e.g., closing the hdf5 file properly. Do you know how to do that?

Great to know that it works! If you want to explicitly close the hdf5 file, you might could add the __del__ function to the dataloader:

def __del__(self):
    if hasattr(self, 'img_hdf5'):
        self.img_hdf5.close()

However, this destructor does not to be explicitly built. Since the sub-processes of workers are closed when the iterator ends:

def __del__(self):
self._shutdown_workers()

the python interpreter and OS would correctly close hdf5 files (e.g., free the resources) which are build inside the sub-processes. If the process normally exits, the python closes the hdf5 file upon process ending. Otherwise (i.e., process crashes), the OS would take charge of GC thus no side affects would remain.

MustafaMustafa

MustafaMustafa commented on Jul 9, 2020

@MustafaMustafa

@airsplay
Yes, your solution works. Thank you!

RSKothari

RSKothari commented on Jul 15, 2020

@RSKothari

@airsplay Love it. The only downside being that __len__ needs to be defined in advanced. How do you propose handling that? Actually, I managed to figure that out with a little hacky H5 read and close operation within init. This solution is smart. Love it and bookmarked.

airsplay

airsplay commented on Jul 15, 2020

@airsplay

@airsplay Love it. The only downside being that __len__ needs to be defined in advanced. How do you propose handling that? Actually, I managed to figure that out with a little hacky H5 read and close operation within init. This solution is smart. Love it and bookmarked.

A good point! A previous solution posted by others (I could not find the original link TAT) mentions the with statement, which might be appropriate here:

def __init__(self):
    with h5py.File("X.hdf5", 'r') as f:
        self.length = len(f['dataset'])

def __len__(self):
    return self.length
pengzhi1998

pengzhi1998 commented on Jun 25, 2024

@pengzhi1998

Hi all! Thank you for your help!! However, since it has been 6 years ago, this solution is not working on my side. And the problem of TypeError: h5py objects cannot be pickled still exists after I open the hdf5 file only in __getitem__ method. May I have your suggestions on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @MustafaMustafa@soumith@airsplay@kfeeeeee@RSKothari

        Issue actions

          Data Loader does not work with Hdf5 file, when num_worker >1 · Issue #11929 · pytorch/pytorch