egrecho.core.tokenizer#

class egrecho.core.tokenizer.TruncationStrategy(value)[source]#

Bases: StrEnum

Possible values for the truncation argument in Tokenizer.__call__(). Useful for tab-completion in an IDE.

class egrecho.core.tokenizer.TensorType(value)[source]#

Bases: StrEnum

For tab-completion in an IDE.

class egrecho.core.tokenizer.BaseTokenizerConfig(extradir=None)[source]#

Bases: DataclassConfig

Base class for the BaseTokenizer configuration.

The path extradir will not be serialized in the config file. When deserialized from a config file (tokenizer_config.json), it will be set to the directory of the config file by default, allowing for the locationing of files defined by extra_files_names().

Parameters:: extradir (Optional[Union[str, Path]]) -- Path to the directory containing vocabulary files defined by extra_files_names().

property extra_files_names#

Defines the extra file names required by the model. Can be either:

A dictionary with values being the filenames for saving the associated files (strings).
A tuple/list of filenames.

get_extra_files(extra_files_names=None, check_local_exist=True)[source]#: Recursively adds prefix dir to locate extra files.

classmethod fetch_from(srcdir, **kwargs)[source]#

Instantiate a BaseTokenizerConfig (or a derived class).

Return type:: BaseTokenizerConfig

copy_extras(savedir, extra_files_names=None, filename_prefix=None, excludes=None, **kwargs)[source]#

Copys the extra files of the tokenizer config.

Use BaseTokenizer.save_to() to save the whole configuration (config file + extra files) of the tokenizer. This method will copy all files defined by extra_files_names() by defaults.

Parameters:

savedir (str) -- The directory in which to save the extra files.
extra_files_names -- If None, use default files defined by class property extra_files_names
filename_prefix (str, optional) -- An optional prefix to add to the named of the saved files.
excludes (Union[str, List[str]]) -- Excludes what fnames.

Returns:

Paths to the files saved.

Return type:

Tuple(str)

class egrecho.core.tokenizer.BaseTokenizer(config)[source]#

Bases: ABC, GenericFileMixin

A base class offers serialize methods for tokenizer. and derived classes should implement its encode/decode methods (text2ids(), ids2text(), etc..)

The implementation of the tokenizer method is intended for derived classes. Its purpose is to facilitate coordination between model inputs and the frontend data processor.

Unlike egrecho.core.feature_extractor.speaker.BaseFeature, which is designed to save its mainly attributes as config itself. BaseTokenizer maintains an inside config instance.

Class attributes (overridden by derived classes)

CONFIG_CLS -- The type of assosiate BaseTokenizerConfig (or a derived class).

Parameters:: config (BaseTokenizerConfig) -- configuration object.

property cls#: Returns cls_id if available.

property sep#: Returns sep_id if available.

property pad#: Returns pad_id if available.

property pad_token_id#: Returns pad_id if available.

property pad_token_type_id: int#

Id of the padding token type in the vocabulary.

Type:: int

property eod#: Returns eod_id if available.

property bos#: Returns bos_id if available.

property eos#: Returns eos_id if available.

property mask#: Returns mask_id if available.

tokenize(line, **kwargs)[source]#

Accept kwargs for tokenize, overwrite it in subclasses.

Return type:: List[str]

property all_special_ids: List[int]#: Returns: List[int]: List the ids of the special tokens(‘<unk>’, ‘<cls>’, etc.) mapped to class attributes.

property vocab_size: int#: Returns: int: Size of the base vocabulary (without the added tokens).

property config#: Refs config

property extradir#: Refs extra files directory.

classmethod fetch_from(srcdir, **kwargs)[source]#

Instantiate a BaseTokenizer (or a derived class) from a dir has config files.

Return type:: BaseTokenizer

save_to(savedir, filename_prefix=None, **kwargs)[source]#

Saves the whole configuration (config file + extra files) of the tokenizer

Parameters:

savedir (str) -- The directory in which to save the extra files.
filename_prefix (str, optional) -- An optional prefix to add to the named of the saved files.

save_extras(savedir, filename_prefix=None)[source]#

Derived classes should overwrite it for special savings.

Return type:: Tuple[str]

default_save_extras(savedir, filename_prefix=None, excludes=None, **kwargs)[source]#: A default funtion conviniently copies extra files.

class egrecho.core.tokenizer.Tokenizer(config)[source]#

Bases: BaseTokenizer

A base class aims to prepare model inputs via __call__ interface , derived from BaseTokenizer. And offers some useful methods for padding/truncate. Core methods:

__call__(): Abstract method to tokenize and prepare for the model, can handle single or batch inputs.
prepare_for_model() (one sample): Prepares a sequence of input id (tokenized by the text2ids()), or a pair of sequences of inputs ids so that it can be used by the model. Workflows typically follows:
- Pre-define settings: Get truncate/pad strategy. Computes the total size of the returned encodings via num_special_tokens_to_add(). Which default hacks building empty input ids through build_inputs_with_special_tokens().
- Truncates: truncate_sequences().
- Add special tokens like eos/sos, the list method should be overriden in a subclass:
  build_inputs_with_special_tokens(): Build model inputs from given ids.
  
  create_token_type_ids_from_sequences(): Create the token type IDs corresponding to the sequences.
- Pad: pad a sample use pad().
batch_decode() / decode(): inverse of __call__(), depends on _decode() in subclasses.

Class attributes (overridden by derived classes)

CONFIG_CLS -- The type of assosiate BaseTokenizerConfig (or a derived class).

model_input_names (List[str]) -- A list of inputs expected in the forward pass of the model.

padding_side (str) -- The default value for the side on which the model should have padding applied. Should be 'right' or 'left'.

truncation_side (str) -- The default value for the side on which the model should have truncation applied. Should be 'right' or 'left'.

Note

Heavily borrowed and adapted from tokenizer module in huggingface tokenizer.

Parameters:: config (BaseTokenizerConfig) -- configuration object derived from BaseTokenizerConfig.

classmethod input_text_batched(text, text_pair=None, is_split_into_words=False)[source]#

Detect inputs text is valid batched.

Return type:: bool

__call__(text=None, text_pair=None, add_special_tokens=True, padding=False, truncation=None, max_length=None, is_split_into_words=False, **kwargs)[source]#

Main abstract method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences. Below lists a possible format of inputs.

Tip

A preferred paradigm of inputs:

is_split_into_words=False, input text as follows:
- List[List[str]]: list with a list of strings, batch of tokenized tokens, i.e., need tokens2ids.
- List[str]: list of strings, batch of strings, i.e., need text2ids.
- str: single string, i.e., need directly text2ids.
is_split_into_words=True, input text as follows:
- List[List[str]]: list with a list of strings, batch of pretokenized (not tokenized but splited), i.e., need text2ids in inner list.
- List[str]: list of strings, single pretokenized, i.e., need text2ids one by one.
- str: single string, auto fallback to is_split_into_words=False.

Parameters:

text (str, List[str], List[List[str]], optional) -- The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair (str, List[str], List[List[str]], optional) -- The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
add_special_tokens (bool, optional, defaults to True) -- Whether or not to add special tokens when encoding the sequences. This will use the underlying Tokenizer.build_inputs_with_special_tokens() function, which defines which tokens are automatically added to the input ids. This is usefull if you want to add bos or eos tokens automatically.
padding (bool, str or PaddingStrategy, optional, defaults to False) --
Activates and controls padding. Accepts the following values:
- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or TruncationStrategy, optional, defaults to False) --
Activates and controls truncation. Accepts the following values:
- True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) --
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
is_split_into_words (bool, optional, defaults to False) --
Whether or not the input is already pre-tokenized (e.g., split into words).

If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
**kwargs -- Additional keyword arguments.

num_special_tokens_to_add(pair=False)[source]#

Returns the number of added tokens when encoding a sequence with special tokens.

Parameters:: pair (bool, optional, defaults to False) -- Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.
Returns:: Number of special tokens added to sequences.
Return type:: int

Note

This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

prepare_for_model(ids, pair_ids=None, add_special_tokens=True, padding=False, truncation=None, max_length=None, stride=0, pad_to_multiple_of=None, return_tensors=None, return_token_type_ids=None, return_attention_mask=None, return_overflowing_tokens=False, return_special_tokens_mask=False, return_length=False, verbose=True, prepend_batch_axis=False, **kwargs)[source]#

Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens. Please Note, for pair_ids different than None and truncation_strategy = longest_first or True, it is not possible to return overflowing tokens. Such a combination of arguments will raise an error.

Parameters:

ids (List[int]) -- Tokenized input ids of the first sequence. Can be obtained from a string by the text2ids().
pair_ids (List[int], optional) -- Tokenized input ids of the second sequence. Can be obtained from a string by the text2ids().
add_special_tokens (bool, optional, defaults to True) -- Whether or not to add special tokens when encoding the sequences. This will use the underlying Tokenizer.build_inputs_with_special_tokens() function, which defines which tokens are automatically added to the input ids. This is usefull if you want to add bos or eos tokens automatically.
padding (bool, str or PaddingStrategy, optional, defaults to False) --
Activates and controls padding. Accepts the following values:
- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or TruncationStrategy, optional, defaults to False) --
Activates and controls truncation. Accepts the following values:
- True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) --
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) -- If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) -- Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) -- If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_tensors (str or TensorType, optional) --
If set, will return tensors instead of list of python integers. Acceptable values are:
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) -- Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.
return_attention_mask (bool, optional) -- Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
return_overflowing_tokens (bool, optional, defaults to False) -- Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) -- Whether or not to return special tokens mask information.
return_length (bool, optional, defaults to False) -- Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) -- Whether or not to print more information and warnings.
**kwargs -- passed to the self.tokenize().

Returns:

A [BatchEncoding] with the following fields:

input_ids -- List of token ids to be fed to a model.
token_type_ids -- List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).
attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).
overflowing_tokens -- List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens -- Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length -- The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

truncate_sequences(ids, pair_ids=None, num_tokens_to_remove=0, truncation_strategy='longest_first', stride=0)[source]#

Truncates a sequence pair in-place following the strategy.

Parameters:

ids (List[int]) -- Tokenized input ids of the first sequence. Can be obtained from a string by the text2ids().
pair_ids (List[int], optional) -- Tokenized input ids of the second sequence. Can be obtained from a string by the text2ids().
num_tokens_to_remove (int, optional, defaults to 0) -- Number of tokens to remove using the truncation strategy.
truncation_strategy (str or TruncationStrategy, optional, defaults to False) --
The strategy to follow for truncation. Can be:
- 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
stride (int, optional, defaults to 0) -- If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.

Returns:

The truncated ids, the truncated pair_ids and the list of overflowing tokens. Note: The longest_first strategy returns empty list of overflowing tokens if a pair of sequences (or a batch of pairs) is provided.

Return type:

Tuple[List[int], List[int], List[int]]

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

Parameters:

token_ids_0 (List[int]) -- The first tokenized sequence.
token_ids_1 (List[int], optional) -- The second tokenized sequence.

Returns:

The model input with special tokens.

Return type:

List[int]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]#

Create the token type IDs corresponding to the sequences passed. Should be overridden in a subclass if the model has a special way of building those.

Parameters:

token_ids_0 (List[int]) -- The first tokenized sequence.
token_ids_1 (List[int], optional) -- The second tokenized sequence.

Returns:

The token type ids.

Return type:

List[int]

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model().

Parameters:

token_ids_0 (List[int]) -- List of ids of the first sequence.
token_ids_1 (List[int], optional) -- List of ids of the second sequence.
already_has_special_tokens (bool, optional, defaults to False) -- Whether or not the token list is already formatted with special tokens for the model.

Returns:

1 for a special token, 0 for a sequence token.

Return type:

A list of integers in the range [0, 1]

pad(encoded_inputs, padding=True, max_length=None, pad_to_multiple_of=None, return_attention_mask=None, return_tensors=None, verbose=True)[source]#

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id).

Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

Note

If the encoded_inputs passed are dictionary of numpy arrays or PyTorch tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

Parameters:

encoded_inputs ([BatchEncoding], list of [BatchEncoding], Dict[str, List[int]], Dict[str, List[List[int]] or List[Dict[str, List[int]]]) --
Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors), see the note above for the return type.
padding (bool, str or PaddingStrategy, optional, defaults to True) --
Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:
- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (int, optional) -- Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (int, optional) --
If set will pad the sequence to a multiple of the provided value.

This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_attention_mask (bool, optional) -- Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
return_tensors (str or TensorType, optional) --
If set, will return tensors instead of list of python integers. Acceptable values are:
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) -- Whether or not to print more information and warnings.

Return type:

UserDict

batch_decode(sequences, skip_special_tokens=False, **kwargs)[source]#

Convert a list of lists of token ids into a list of strings by calling decode.

Parameters:

sequences (Union[List[int], List[List[int]], np.ndarray, torch.Tensor]) -- List of tokenized input ids. Can be obtained using the __call__() method.
skip_special_tokens (bool, optional, defaults to False) -- Whether or not to remove special tokens in the decoding.
**kwargs (additional keyword arguments, optional) -- Will be passed to the underlying model specific decode method.

Returns:

The list of decoded sentences.

Return type:

List[str]

decode(token_ids, skip_special_tokens=False, **kwargs)[source]#

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.ids2text(token_ids).

Parameters:

token_ids (Union[int, List[int], np.ndarray, torch.Tensor]) -- List of tokenized input ids. Can be obtained using the __call__() method.
skip_special_tokens (bool, optional, defaults to False) -- Whether or not to remove special tokens in the decoding.
**kwargs (additional keyword arguments, optional) -- Will be passed to the underlying model specific decode method.

Returns:

The decoded sentence.

Return type:

str

egrecho.core.tokenizer.convert_to_tensors(encoded_inputs, tensor_type=None, prepend_batch_axis=False)[source]#

Convert the inner content of a dict to tensors.

Parameters:

encoded_inputs (Union[Dict[str, EncodedInput], UserDict]) -- encoded inputs.
tensor_type (str or TensorType, optional) -- The type of tensors to use. If str, should be one of the values of the enum TensorType. If None, no modification is done.
prepend_batch_axis (int, optional, defaults to False) -- Whether or not to add the batch dimension during the conversion.