data module¶

class data.Data(loader, config, args)[source]¶

The main data class for training. Set’s up the HDF5 torch Dataset classes for parallel or non-parallel training.

Parameters

loader (class): The class that loads the data from disk. See loader module for more information.
config (class): The class that holds the YAML configuration. See config module for more information.
args (argparse object): Argparse object that holds the command line arguments.

Methods

`getTestingData`()	Get the dataloader for testing (or validation) data.
`getTrainingData`()	Get the dataloader for training data.

getTestingData()[source]¶

Get the dataloader for testing (or validation) data. If the world size is greater than 1, a testing sampler is used.

Returns

getTrainingData()[source]¶

Get the dataloader for training data. If the world size is greater than 1, a training sampler is used.

Returns

class data.HDF5Dataset(filename, x_label, y_label, rank, use_hist=False)[source]¶

HDF5 Dataset class which wraps the torch.utils.data.Dataset class.

Parameters

filename (string): HDF5 filename.
x_label (string): Dataset label for the input data.
y_label (string): Dataset label for the output data.
rank (int): Rank of the process that is creating this object.
use_hist (bool): Generate a histogram and use metropolis sampling to select training examples. This is experimental.

Methods

Check the dataset size and if larger than 32 GB than read from disk, else load into memory.

checkDataSize()[source]¶: Check the dataset size and if larger than 32 GB than read from disk, else load into memory.

class data.TwinHDF5Dataset(filename, x_label, y_label, n_samples, rank, use_hist=False)[source]¶

Methods

checkDataSize

data.dataSplit(fname, test_pct, hash_on_key)[source]¶

Split a HDF5 dataset into a training file and testing file.

Parameters

fname (string): Name of file to be written in the train and test directory.
test_pct (float): Percentage of test.
hash_on_key (string): Select which dataset to perform a hash on. This hash allows us to know
for sure that no element in the test set is in the training set.