tokenizer

C2I

class diagnnose.tokenizer.c2i.C2I(w2i: Dict[str, int], max_word_length: int = 50, **kwargs: Any)[source]

Bases: W2I

Vocabulary containing character-level information.

Adapted from: https://github.com/tensorflow/models/tree/master/research/lm_1b

property max_word_length: int
token_to_char_ids(token: str)[source]

Create

diagnnose.tokenizer.create.create_char_vocab(corpus_path: Union[str, List[str]], **kwargs) C2I[source]
diagnnose.tokenizer.create.create_tokenizer(path: str, notify_unk: bool = False, cache_dir: Optional[str] = None, **kwargs) transformers.PreTrainedTokenizer[source]

Creates a tokenizer from a path.

A LSTM tokenizer is defined as a file with an entry at each line, and path should point towards that file.

A Transformer tokenizer is defined by its model name, and is imported using the AutoTokenizer class.

Parameters
  • path (str) – Either the path towards a vocabulary file, or the model name of a Huggingface Transformer.

  • notify_unk (bool, optional) – Optional toggle to notify a user if a token is not present in the vocabulary of the tokenizer. Defaults to False.

  • cache_dir (str, optional) – Cache directory for Huggingface tokenizers.

Returns

tokenizer – The instantiated tokenizer that maps tokens to their indices.

Return type

PreTrainedTokenizer

diagnnose.tokenizer.create.token_to_index(path: str) Dict[str, int][source]

Reads in a newline-separated file of tokenizer entries.

Parameters

path (str) – Path to a vocabulary file.

Returns

w2i – Dictionary mapping a token string to its index.

Return type

Dict[str, int]

W2I

class diagnnose.tokenizer.w2i.W2I(w2i: Dict[str, int], unk_token: str = '<unk>', eos_token: str = '<eos>', pad_token: str = '<pad>', notify_unk: bool = False)[source]

Bases: dict

Provides vocab functionality mapping words to indices.

Non-existing tokens are mapped to the id of an unk token that should be present in the vocab file.

Parameters
  • w2i (Dict[str, int]) – Dictionary that maps strings to indices. This dictionary can be created using create_vocab.

  • unk_token (str, optional) – The unk token to which unknown words will be mapped. Defaults to <unk>.

  • eos_token (str, optional) – The end-of-sentence token that is used in the corpus. Defaults to <eos>.

  • notify_unk (bool, optional) – Notify when a requested token is not present in the vocab. Defaults to False.

property w2i: Dict[str, int]