tokenizer¶

C2I¶

class diagnnose.tokenizer.c2i.C2I(w2i: Dict[str, int], max_word_length: int = 50, **kwargs: Any)[source]¶

Bases: W2I

Vocabulary containing character-level information.

Adapted from: https://github.com/tensorflow/models/tree/master/research/lm_1b

property max_word_length: int¶

token_to_char_ids(token: str)[source]¶

Create¶

diagnnose.tokenizer.create.create_char_vocab(corpus_path: Union[str, List[str]], **kwargs) → C2I[source]¶

diagnnose.tokenizer.create.create_tokenizer(path: str, notify_unk: bool = False, cache_dir: Optional[str] = None, **kwargs) → transformers.PreTrainedTokenizer[source]¶

Creates a tokenizer from a path.

A LSTM tokenizer is defined as a file with an entry at each line, and path should point towards that file.

A Transformer tokenizer is defined by its model name, and is imported using the AutoTokenizer class.

Parameters

path (str) – Either the path towards a vocabulary file, or the model name of a Huggingface Transformer.
notify_unk (bool, optional) – Optional toggle to notify a user if a token is not present in the vocabulary of the tokenizer. Defaults to False.
cache_dir (str, optional) – Cache directory for Huggingface tokenizers.

Returns

tokenizer – The instantiated tokenizer that maps tokens to their indices.

Return type

PreTrainedTokenizer

diagnnose.tokenizer.create.token_to_index(path: str) → Dict[str, int][source]¶

Reads in a newline-separated file of tokenizer entries.

Parameters: path (str) – Path to a vocabulary file.
Returns: w2i – Dictionary mapping a token string to its index.
Return type: Dict[str, int]

W2I¶

class diagnnose.tokenizer.w2i.W2I(w2i: Dict[str, int], unk_token: str = '<unk>', eos_token: str = '<eos>', pad_token: str = '<pad>', notify_unk: bool = False)[source]¶

Bases: dict

Provides vocab functionality mapping words to indices.

Non-existing tokens are mapped to the id of an unk token that should be present in the vocab file.

Parameters

w2i (Dict[str, int]) – Dictionary that maps strings to indices. This dictionary can be created using create_vocab.
unk_token (str, optional) – The unk token to which unknown words will be mapped. Defaults to <unk>.
eos_token (str, optional) – The end-of-sentence token that is used in the corpus. Defaults to <eos>.
notify_unk (bool, optional) – Notify when a requested token is not present in the vocab. Defaults to False.

property w2i: Dict[str, int]¶