egrecho.core.data_builder#
- class egrecho.core.data_builder.DataBuilderConfig(data_dir=None, file_patterns=<factory>)[source]#
Bases:
DataclassConfig
Base class for
DataBuilder
configuration.- Parameters:
data_dir (Optional[Union[str, Path]]) -- Path (e.g.
"./data"
) directory have data files.file_patterns (Optional[Union[str, List[str], Dict[str, str]]]) -- str(s) to source data file(s), support pattern matching. e.g.,
"egs.train.csv"
or"egs.*.csv"
. More over, with an absolute path pattern (e.g.,"/export_path/egs.train.csv"
), it will invalidsdata_dir
and search files in abs path.
- class egrecho.core.data_builder.DataBuilder(config)[source]#
Bases:
GenericFileMixin
Base builder class for building dataset.
The subclass should define a class attribute
CONFIG_CLS
that extends arguments, and the configuration class name should have an additional"Config"
suffix. Subclasses should implement the dataset setup method which is necessary:train_dataset()
val_dataset()
test_dataset()
Its instance stores config instance, data filenamess of splits, infos, etc. The
build_dataset()
function returns either a single split dataset or a dict of split datasets according to data files dict.Note
In case of overheading dataset building in your procedure, you can use the warpper
functools.lru_cache
asdef _get_data_files(self):
style.Keep in mind that that cached result won’t changed if you modify the related data files in this instance. You can use
def from_config(...)
to get a new instance.- classmethod from_config(config=None, data_dir=None, file_patterns=None, **kwargs)[source]#
Creates a new
DataBuilder
instance by providing a configuration in the form of a dictionary or a instance ifDataBuilderConfig
. All params afterconfig
will overwrite it.- Parameters:
config (Optional[Union[dict, DataBuilderConfig]]) -- A dict or an instance of
DataBuilderConfig
.data_dir (Optional[str]) -- Path (e.g.
"./data"
) directory have data files.file_patterns (Optional[Union[str, List[str], Dict]]) -- Str(s) of source data file(s), support pattern matching. (e.g.,
"egs.train.csv"
or"egs.*.csv"
.) Moreover, with an absolute path pattern (e.g.,"/export_path/egs.train.csv"
), it will invalidsdata_dir
and search files in that abs path.**kwargs (additional keyword arguments) -- Arguments to override config.
- Returns:
The new Databuilder instance.
- Return type:
- save_config(path)[source]#
save the configuration to a file.
- Parameters:
path (Union[Path, str]) -- The path of the output file.
- build_dataset(split=None)[source]#
Build dataset.
- Parameters:
split (Optional[Union[str, Split]]) -- If None, returns all splits in dict, else specified split.
- Returns:
The constructed datapipe(s).
- Return type:
Union[IterableDataset, Dict[str, IterableDataset]]
- build_single_dataset(split=Split.TRAIN)[source]#
Function to build single split datapipe.
- Parameters:
split (Optional[Union[str, Split]]) -- The split name.
- Returns:
The constructed data pipe.
- Return type:
IterableDataset
- property data_files: DataFilesDict#
Property method for returning data files information.
- Returns:
The data files can be find by split key.
- Return type:
- property num_classes: int | None#
Property that returns the number of classes if it is a multiclass task.
- property class_label: ClassLabel#
Property that returns the labels. Should be implemented in the derived class if needed.
- property feature_extractor#
Property that returns the feature extractor. Should be implemented in the derived class if needed.
- property feature_size: int#
Property that returns the feat dim.
- property inputs_dim#
Property that returns the inputs_dim for downstream model. Should be implemented in the derived class if needed.