data module¶
-
class
data.
Data
(loader, config, args)[source]¶ The main data class for training. Set’s up the HDF5 torch Dataset classes for parallel or non-parallel training.
- Parameters
- loader (class): The class that loads the data from disk. See loader module for more information.
- config (class): The class that holds the YAML configuration. See config module for more information.
- args (argparse object): Argparse object that holds the command line arguments.
Methods
Get the dataloader for testing (or validation) data.
Get the dataloader for training data.
-
class
data.
HDF5Dataset
(filename, x_label, y_label, rank, use_hist=False)[source]¶ HDF5 Dataset class which wraps the torch.utils.data.Dataset class.
- Parameters
- filename (string): HDF5 filename.
- x_label (string): Dataset label for the input data.
- y_label (string): Dataset label for the output data.
- rank (int): Rank of the process that is creating this object.
- use_hist (bool): Generate a histogram and use metropolis sampling to select training examples. This is experimental.
Methods
Check the dataset size and if larger than 32 GB than read from disk, else load into memory.
-
class
data.
TwinHDF5Dataset
(filename, x_label, y_label, n_samples, rank, use_hist=False)[source]¶ Methods
checkDataSize
-
data.
dataSplit
(fname, test_pct, hash_on_key)[source]¶ Split a HDF5 dataset into a training file and testing file.
- Parameters
- fname (string): Name of file to be written in the train and test directory.
- test_pct (float): Percentage of test.
- hash_on_key (string): Select which dataset to perform a hash on. This hash allows us to know
- for sure that no element in the test set is in the training set.