egrecho.core.tokenizer#
- class egrecho.core.tokenizer.TruncationStrategy(value)[source]#
Bases:
StrEnum
Possible values for the
truncation
argument inTokenizer.__call__()
. Useful for tab-completion in an IDE.
- class egrecho.core.tokenizer.TensorType(value)[source]#
Bases:
StrEnum
For tab-completion in an IDE.
- class egrecho.core.tokenizer.BaseTokenizerConfig(extradir=None)[source]#
Bases:
DataclassConfig
Base class for the
BaseTokenizer
configuration.The path
extradir
will not be serialized in the config file. When deserialized from a config file (tokenizer_config.json), it will be set to the directory of the config file by default, allowing for the locationing of files defined byextra_files_names()
.- Parameters:
extradir (Optional[Union[str, Path]]) -- Path to the directory containing vocabulary files defined by
extra_files_names()
.
- property extra_files_names#
Defines the extra file names required by the model. Can be either:
A dictionary with values being the filenames for saving the associated files (strings).
A tuple/list of filenames.
- get_extra_files(extra_files_names=None, check_local_exist=True)[source]#
Recursively adds prefix dir to locate extra files.
- classmethod fetch_from(srcdir, **kwargs)[source]#
Instantiate a
BaseTokenizerConfig
(or a derived class).- Return type:
- copy_extras(savedir, extra_files_names=None, filename_prefix=None, excludes=None, **kwargs)[source]#
Copys the extra files of the tokenizer config.
Use
BaseTokenizer.save_to()
to save the whole configuration (config file + extra files) of the tokenizer. This method will copy all files defined byextra_files_names()
by defaults.- Parameters:
savedir (
str
) -- The directory in which to save the extra files.extra_files_names -- If None, use default files defined by class property extra_files_names
filename_prefix (
str
, optional) -- An optional prefix to add to the named of the saved files.excludes (Union[
str
,List[str]
]) -- Excludes what fnames.
- Returns:
Paths to the files saved.
- Return type:
Tuple(str)
- class egrecho.core.tokenizer.BaseTokenizer(config)[source]#
Bases:
ABC
,GenericFileMixin
A base class offers serialize methods for tokenizer. and derived classes should implement its encode/decode methods (
text2ids()
,ids2text()
, etc..)The implementation of the tokenizer method is intended for derived classes. Its purpose is to facilitate coordination between model inputs and the frontend data processor.
Unlike
egrecho.core.feature_extractor.speaker.BaseFeature
, which is designed to save its mainly attributes as config itself.BaseTokenizer
maintains an insideconfig
instance.Class attributes (overridden by derived classes)
CONFIG_CLS -- The type of assosiate
BaseTokenizerConfig
(or a derived class).
- Parameters:
config (BaseTokenizerConfig) -- configuration object.
- property cls#
Returns cls_id if available.
- property sep#
Returns sep_id if available.
- property pad#
Returns pad_id if available.
- property pad_token_id#
Returns pad_id if available.
- property pad_token_type_id: int#
Id of the padding token type in the vocabulary.
- Type:
int
- property eod#
Returns eod_id if available.
- property bos#
Returns bos_id if available.
- property eos#
Returns eos_id if available.
- property mask#
Returns mask_id if available.
- tokenize(line, **kwargs)[source]#
Accept kwargs for tokenize, overwrite it in subclasses.
- Return type:
List
[str
]
- property all_special_ids: List[int]#
Returns: List[int]: List the ids of the special tokens(‘<unk>’, ‘<cls>’, etc.) mapped to class attributes.
- property vocab_size: int#
Returns:
int
: Size of the base vocabulary (without the added tokens).
- property config#
Refs config
- property extradir#
Refs extra files directory.
- classmethod fetch_from(srcdir, **kwargs)[source]#
Instantiate a
BaseTokenizer
(or a derived class) from a dir has config files.- Return type:
- save_to(savedir, filename_prefix=None, **kwargs)[source]#
Saves the whole configuration (config file + extra files) of the tokenizer
- Parameters:
savedir (
str
) -- The directory in which to save the extra files.filename_prefix (
str
, optional) -- An optional prefix to add to the named of the saved files.
- class egrecho.core.tokenizer.Tokenizer(config)[source]#
Bases:
BaseTokenizer
A base class aims to prepare model inputs via __call__ interface , derived from
BaseTokenizer
. And offers some useful methods for padding/truncate. Core methods:__call__()
: Abstract method to tokenize and prepare for the model, can handle single or batch inputs.prepare_for_model()
(one sample): Prepares a sequence of input id (tokenized by thetext2ids()
), or a pair of sequences of inputs ids so that it can be used by the model. Workflows typically follows:Pre-define settings: Get truncate/pad strategy. Computes the total size of the returned encodings via
num_special_tokens_to_add()
. Which default hacks building empty input ids throughbuild_inputs_with_special_tokens()
.Truncates:
truncate_sequences()
.Add special tokens like eos/sos, the list method should be overriden in a subclass:
build_inputs_with_special_tokens()
: Build model inputs from given ids.create_token_type_ids_from_sequences()
: Create the token type IDs corresponding to the sequences.
Pad: pad a sample use
pad()
.
batch_decode()
/decode()
: inverse of__call__()
, depends on_decode()
in subclasses.
Class attributes (overridden by derived classes)
CONFIG_CLS -- The type of assosiate
BaseTokenizerConfig
(or a derived class).model_input_names (
List[str]
) -- A list of inputs expected in the forward pass of the model.padding_side (
str
) -- The default value for the side on which the model should have padding applied. Should be'right'
or'left'
.truncation_side (
str
) -- The default value for the side on which the model should have truncation applied. Should be'right'
or'left'
.
Note
Heavily borrowed and adapted from tokenizer module in huggingface tokenizer.
- Parameters:
config (
BaseTokenizerConfig
) -- configuration object derived fromBaseTokenizerConfig
.
- classmethod input_text_batched(text, text_pair=None, is_split_into_words=False)[source]#
Detect inputs text is valid batched.
- Return type:
bool
- __call__(text=None, text_pair=None, add_special_tokens=True, padding=False, truncation=None, max_length=None, is_split_into_words=False, **kwargs)[source]#
Main abstract method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences. Below lists a possible format of inputs.
Tip
A preferred paradigm of inputs:
is_split_into_words=False
, input text as follows:List[List[str]]: list with a list of strings, batch of tokenized tokens, i.e., need tokens2ids.
List[str]: list of strings, batch of strings, i.e., need text2ids.
str: single string, i.e., need directly text2ids.
is_split_into_words=True
, input text as follows:List[List[str]]: list with a list of strings, batch of pretokenized (not tokenized but splited), i.e., need text2ids in inner list.
List[str]: list of strings, single pretokenized, i.e., need text2ids one by one.
str: single string, auto fallback to is_split_into_words=False.
- Parameters:
text (
str
,List[str]
,List[List[str]]
, optional) -- The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences).text_pair (
str
,List[str]
,List[List[str]]
, optional) -- The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences).add_special_tokens (
bool
, optional, defaults toTrue
) -- Whether or not to add special tokens when encoding the sequences. This will use the underlyingTokenizer.build_inputs_with_special_tokens()
function, which defines which tokens are automatically added to the input ids. This is usefull if you want to addbos
oreos
tokens automatically.padding (
bool
,str
orPaddingStrategy
, optional, defaults toFalse
) --Activates and controls padding. Accepts the following values:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (
bool
,str
orTruncationStrategy
, optional, defaults toFalse
) --Activates and controls truncation. Accepts the following values:
True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (
int
, optional) --Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.is_split_into_words (
bool
, optional, defaults toFalse
) --Whether or not the input is already pre-tokenized (e.g., split into words).
If set to
True
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.**kwargs -- Additional keyword arguments.
- num_special_tokens_to_add(pair=False)[source]#
Returns the number of added tokens when encoding a sequence with special tokens.
- Parameters:
pair (
bool
, optional, defaults toFalse
) -- Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.- Returns:
Number of special tokens added to sequences.
- Return type:
int
Note
This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- prepare_for_model(ids, pair_ids=None, add_special_tokens=True, padding=False, truncation=None, max_length=None, stride=0, pad_to_multiple_of=None, return_tensors=None, return_token_type_ids=None, return_attention_mask=None, return_overflowing_tokens=False, return_special_tokens_mask=False, return_length=False, verbose=True, prepend_batch_axis=False, **kwargs)[source]#
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens. Please Note, for pair_ids different than
None
and truncation_strategy = longest_first orTrue
, it is not possible to return overflowing tokens. Such a combination of arguments will raise an error.- Parameters:
ids (
List[int]
) -- Tokenized input ids of the first sequence. Can be obtained from a string by thetext2ids()
.pair_ids (
List[int]
, optional) -- Tokenized input ids of the second sequence. Can be obtained from a string by thetext2ids()
.add_special_tokens (
bool
, optional, defaults toTrue
) -- Whether or not to add special tokens when encoding the sequences. This will use the underlyingTokenizer.build_inputs_with_special_tokens()
function, which defines which tokens are automatically added to the input ids. This is usefull if you want to addbos
oreos
tokens automatically.padding (
bool
,str
orPaddingStrategy
, optional, defaults toFalse
) --Activates and controls padding. Accepts the following values:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (
bool
,str
orTruncationStrategy
, optional, defaults toFalse
) --Activates and controls truncation. Accepts the following values:
True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (
int
, optional) --Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.stride (
int
, optional, defaults to 0) -- If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.is_split_into_words (
bool
, optional, defaults toFalse
) -- Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.pad_to_multiple_of (
int
, optional) -- If set will pad the sequence to a multiple of the provided value. Requirespadding
to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5
(Volta).return_tensors (
str
orTensorType
, optional) --If set, will return tensors instead of list of python integers. Acceptable values are:
'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
return_token_type_ids (
bool
, optional) -- Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by thereturn_outputs
attribute.return_attention_mask (
bool
, optional) -- Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputs
attribute.return_overflowing_tokens (
bool
, optional, defaults toFalse
) -- Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided withtruncation_strategy = longest_first
orTrue
, an error is raised instead of returning overflowing tokens.return_special_tokens_mask (
bool
, optional, defaults toFalse
) -- Whether or not to return special tokens mask information.return_length (
bool
, optional, defaults toFalse
) -- Whether or not to return the lengths of the encoded inputs.verbose (
bool
, optional, defaults toTrue
) -- Whether or not to print more information and warnings.**kwargs -- passed to the
self.tokenize()
.
- Returns:
A [
BatchEncoding
] with the following fields:input_ids -- List of token ids to be fed to a model.
token_type_ids -- List of token type ids to be fed to a model (when
return_token_type_ids=True
or if “token_type_ids” is inself.model_input_names
).attention_mask -- List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True
or if “attention_mask” is inself.model_input_names
).overflowing_tokens -- List of overflowing tokens sequences (when a
max_length
is specified andreturn_overflowing_tokens=True
).num_truncated_tokens -- Number of tokens truncated (when a
max_length
is specified andreturn_overflowing_tokens=True
).special_tokens_mask -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when
add_special_tokens=True
andreturn_special_tokens_mask=True
).length -- The length of the inputs (when
return_length=True
)
- Return type:
[
BatchEncoding
]
- truncate_sequences(ids, pair_ids=None, num_tokens_to_remove=0, truncation_strategy='longest_first', stride=0)[source]#
Truncates a sequence pair in-place following the strategy.
- Parameters:
ids (
List[int]
) -- Tokenized input ids of the first sequence. Can be obtained from a string by thetext2ids()
.pair_ids (
List[int]
, optional) -- Tokenized input ids of the second sequence. Can be obtained from a string by thetext2ids()
.num_tokens_to_remove (
int
, optional, defaults to 0) -- Number of tokens to remove using the truncation strategy.truncation_strategy (
str
orTruncationStrategy
, optional, defaults toFalse
) --The strategy to follow for truncation. Can be:
'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
stride (
int
, optional, defaults to 0) -- If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.
- Returns:
The truncated
ids
, the truncatedpair_ids
and the list of overflowing tokens. Note: The longest_first strategy returns empty list of overflowing tokens if a pair of sequences (or a batch of pairs) is provided.- Return type:
Tuple[List[int], List[int], List[int]]
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
This implementation does not add special tokens and this method should be overridden in a subclass.
- Parameters:
token_ids_0 (
List[int]
) -- The first tokenized sequence.token_ids_1 (
List[int]
, optional) -- The second tokenized sequence.
- Returns:
The model input with special tokens.
- Return type:
List[int]
- create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]#
Create the token type IDs corresponding to the sequences passed. Should be overridden in a subclass if the model has a special way of building those.
- Parameters:
token_ids_0 (
List[int]
) -- The first tokenized sequence.token_ids_1 (
List[int]
, optional) -- The second tokenized sequence.
- Returns:
The token type ids.
- Return type:
List[int]
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
prepare_for_model()
.- Parameters:
token_ids_0 (
List[int]
) -- List of ids of the first sequence.token_ids_1 (
List[int]
, optional) -- List of ids of the second sequence.already_has_special_tokens (
bool
, optional, defaults toFalse
) -- Whether or not the token list is already formatted with special tokens for the model.
- Returns:
1 for a special token, 0 for a sequence token.
- Return type:
A list of integers in the range [0, 1]
- pad(encoded_inputs, padding=True, max_length=None, pad_to_multiple_of=None, return_attention_mask=None, return_tensors=None, verbose=True)[source]#
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.
Padding side (left/right) padding token ids are defined at the tokenizer level (with
self.padding_side
,self.pad_token_id
andself.pad_token_type_id
).Please note that with a fast tokenizer, using the
__call__
method is faster than using a method to encode the text followed by a call to thepad
method to get a padded encoding.Note
If the
encoded_inputs
passed are dictionary of numpy arrays or PyTorch tensors, the result will use the same type unless you provide a different tensor type withreturn_tensors
. In the case of PyTorch tensors, you will lose the specific device of your tensors however.- Parameters:
encoded_inputs ([
BatchEncoding
], list of [BatchEncoding
],Dict[str, List[int]]
,Dict[str, List[List[int]]
orList[Dict[str, List[int]]]
) --Tokenized inputs. Can represent one input ([
BatchEncoding
] orDict[str, List[int]]
) or a batch of tokenized inputs (list of [BatchEncoding
], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.Instead of
List[int]
you can have tensors (numpy arrays, PyTorch tensors), see the note above for the return type.padding (
bool
,str
orPaddingStrategy
, optional, defaults toTrue
) --Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (
int
, optional) -- Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (
int
, optional) --If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta).return_attention_mask (
bool
, optional) -- Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputs
attribute.return_tensors (
str
orTensorType
, optional) --If set, will return tensors instead of list of python integers. Acceptable values are:
'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
verbose (
bool
, optional, defaults toTrue
) -- Whether or not to print more information and warnings.
- Return type:
UserDict
- batch_decode(sequences, skip_special_tokens=False, **kwargs)[source]#
Convert a list of lists of token ids into a list of strings by calling decode.
- Parameters:
sequences (
Union[List[int], List[List[int]], np.ndarray, torch.Tensor]
) -- List of tokenized input ids. Can be obtained using the__call__()
method.skip_special_tokens (
bool
, optional, defaults toFalse
) -- Whether or not to remove special tokens in the decoding.**kwargs (additional keyword arguments, optional) -- Will be passed to the underlying model specific decode method.
- Returns:
The list of decoded sentences.
- Return type:
List[str]
- decode(token_ids, skip_special_tokens=False, **kwargs)[source]#
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing
self.ids2text(token_ids)
.- Parameters:
token_ids (
Union[int, List[int], np.ndarray, torch.Tensor]
) -- List of tokenized input ids. Can be obtained using the__call__()
method.skip_special_tokens (
bool
, optional, defaults toFalse
) -- Whether or not to remove special tokens in the decoding.**kwargs (additional keyword arguments, optional) -- Will be passed to the underlying model specific decode method.
- Returns:
The decoded sentence.
- Return type:
str
- egrecho.core.tokenizer.convert_to_tensors(encoded_inputs, tensor_type=None, prepend_batch_axis=False)[source]#
Convert the inner content of a dict to tensors.
- Parameters:
encoded_inputs (Union[Dict[str, EncodedInput], UserDict]) -- encoded inputs.
tensor_type (
str
orTensorType
, optional) -- The type of tensors to use. Ifstr
, should be one of the values of the enumTensorType
. IfNone
, no modification is done.prepend_batch_axis (
int
, optional, defaults toFalse
) -- Whether or not to add the batch dimension during the conversion.