egrecho.core.data_builder#

class egrecho.core.data_builder.DataBuilderConfig(data_dir=None, file_patterns=<factory>)[source]#

Bases: DataclassConfig

Base class for DataBuilder configuration.

Parameters:

data_dir (Optional[Union[str, Path]]) -- Path (e.g. "./data") directory have data files.
file_patterns (Optional[Union[str, List[str], Dict[str, str]]]) -- str(s) to source data file(s), support pattern matching. e.g., "egs.train.csv" or "egs.*.csv". More over, with an absolute path pattern (e.g., "/export_path/egs.train.csv"), it will invalids data_dir and search files in abs path.

class egrecho.core.data_builder.DataBuilder(config)[source]#

Bases: GenericFileMixin

Base builder class for building dataset.

The subclass should define a class attribute CONFIG_CLS that extends arguments, and the configuration class name should have an additional "Config" suffix. Subclasses should implement the dataset setup method which is necessary:

train_dataset()
val_dataset()
test_dataset()

Its instance stores config instance, data filenamess of splits, infos, etc. The build_dataset() function returns either a single split dataset or a dict of split datasets according to data files dict.

Note

In case of overheading dataset building in your procedure, you can use the warpper functools.lru_cache as def _get_data_files(self): style.

Keep in mind that that cached result won’t changed if you modify the related data files in this instance. You can use def from_config(...) to get a new instance.

classmethod from_config(config=None, data_dir=None, file_patterns=None, **kwargs)[source]#

Creates a new DataBuilder instance by providing a configuration in the form of a dictionary or a instance if DataBuilderConfig. All params after config will overwrite it.

Parameters:

config (Optional[Union[dict, DataBuilderConfig]]) -- A dict or an instance of DataBuilderConfig.
data_dir (Optional[str]) -- Path (e.g. "./data") directory have data files.
file_patterns (Optional[Union[str, List[str], Dict]]) -- Str(s) of source data file(s), support pattern matching. (e.g., "egs.train.csv" or "egs.*.csv".) Moreover, with an absolute path pattern (e.g., "/export_path/egs.train.csv"), it will invalids data_dir and search files in that abs path.
**kwargs (additional keyword arguments) -- Arguments to override config.

Returns:

The new Databuilder instance.

Return type:

DataBuilder

save_config(path)[source]#

save the configuration to a file.

Parameters:: path (Union[Path, str]) -- The path of the output file.

dump_config()[source]#

Dump the configuration to a dict.

Return type:: Dict[str, Any]

build_dataset(split=None)[source]#

Build dataset.

Parameters:: split (Optional[Union[str, Split]]) -- If None, returns all splits in dict, else specified split.
Returns:: The constructed datapipe(s).
Return type:: Union[IterableDataset, Dict[str, IterableDataset]]

build_single_dataset(split=Split.TRAIN)[source]#

Function to build single split datapipe.

Parameters:: split (Optional[Union[str, Split]]) -- The split name.
Returns:: The constructed data pipe.
Return type:: IterableDataset

property data_files: DataFilesDict#

Property method for returning data files information.

Returns:: The data files can be find by split key.
Return type:: DataFilesDict

estimate_length()[source]#: estimate dataset length.

property num_classes: int | None#: Property that returns the number of classes if it is a multiclass task.

property class_label: ClassLabel#: Property that returns the labels. Should be implemented in the derived class if needed.

property feature_extractor#: Property that returns the feature extractor. Should be implemented in the derived class if needed.

property feature_size: int#: Property that returns the feat dim.

property inputs_dim#: Property that returns the inputs_dim for downstream model. Should be implemented in the derived class if needed.