egrecho.core.data_builder#

class egrecho.core.data_builder.DataBuilderConfig(data_dir=None, file_patterns=<factory>)[source]#

Bases: DataclassConfig

Base class for DataBuilder configuration.

Parameters:
  • data_dir (Optional[Union[str, Path]]) -- Path (e.g. "./data") directory have data files.

  • file_patterns (Optional[Union[str, List[str], Dict[str, str]]]) -- str(s) to source data file(s), support pattern matching. e.g., "egs.train.csv" or "egs.*.csv". More over, with an absolute path pattern (e.g., "/export_path/egs.train.csv"), it will invalids data_dir and search files in abs path.

class egrecho.core.data_builder.DataBuilder(config)[source]#

Bases: GenericFileMixin

Base builder class for building dataset.

The subclass should define a class attribute CONFIG_CLS that extends arguments, and the configuration class name should have an additional "Config" suffix. Subclasses should implement the dataset setup method which is necessary:

  • train_dataset()

  • val_dataset()

  • test_dataset()

Its instance stores config instance, data filenamess of splits, infos, etc. The build_dataset() function returns either a single split dataset or a dict of split datasets according to data files dict.

Note

In case of overheading dataset building in your procedure, you can use the warpper functools.lru_cache as def _get_data_files(self): style.

Keep in mind that that cached result won’t changed if you modify the related data files in this instance. You can use def from_config(...) to get a new instance.

classmethod from_config(config=None, data_dir=None, file_patterns=None, **kwargs)[source]#

Creates a new DataBuilder instance by providing a configuration in the form of a dictionary or a instance if DataBuilderConfig. All params after config will overwrite it.

Parameters:
  • config (Optional[Union[dict, DataBuilderConfig]]) -- A dict or an instance of DataBuilderConfig.

  • data_dir (Optional[str]) -- Path (e.g. "./data") directory have data files.

  • file_patterns (Optional[Union[str, List[str], Dict]]) -- Str(s) of source data file(s), support pattern matching. (e.g., "egs.train.csv" or "egs.*.csv".) Moreover, with an absolute path pattern (e.g., "/export_path/egs.train.csv"), it will invalids data_dir and search files in that abs path.

  • **kwargs (additional keyword arguments) -- Arguments to override config.

Returns:

The new Databuilder instance.

Return type:

DataBuilder

save_config(path)[source]#

save the configuration to a file.

Parameters:

path (Union[Path, str]) -- The path of the output file.

dump_config()[source]#

Dump the configuration to a dict.

Return type:

Dict[str, Any]

build_dataset(split=None)[source]#

Build dataset.

Parameters:

split (Optional[Union[str, Split]]) -- If None, returns all splits in dict, else specified split.

Returns:

The constructed datapipe(s).

Return type:

Union[IterableDataset, Dict[str, IterableDataset]]

build_single_dataset(split=Split.TRAIN)[source]#

Function to build single split datapipe.

Parameters:

split (Optional[Union[str, Split]]) -- The split name.

Returns:

The constructed data pipe.

Return type:

IterableDataset

property data_files: DataFilesDict#

Property method for returning data files information.

Returns:

The data files can be find by split key.

Return type:

DataFilesDict

estimate_length()[source]#

estimate dataset length.

property num_classes: int | None#

Property that returns the number of classes if it is a multiclass task.

property class_label: ClassLabel#

Property that returns the labels. Should be implemented in the derived class if needed.

property feature_extractor#

Property that returns the feature extractor. Should be implemented in the derived class if needed.

property feature_size: int#

Property that returns the feat dim.

property inputs_dim#

Property that returns the inputs_dim for downstream model. Should be implemented in the derived class if needed.