tokenizer¶
C2I¶
- class diagnnose.tokenizer.c2i.C2I(w2i: Dict[str, int], max_word_length: int = 50, **kwargs: Any)[source]¶
Bases:
W2I
Vocabulary containing character-level information.
Adapted from: https://github.com/tensorflow/models/tree/master/research/lm_1b
- property max_word_length: int¶
Create¶
- diagnnose.tokenizer.create.create_char_vocab(corpus_path: Union[str, List[str]], **kwargs) C2I [source]¶
- diagnnose.tokenizer.create.create_tokenizer(path: str, notify_unk: bool = False, cache_dir: Optional[str] = None, **kwargs) transformers.PreTrainedTokenizer [source]¶
Creates a tokenizer from a path.
A LSTM tokenizer is defined as a file with an entry at each line, and path should point towards that file.
A Transformer tokenizer is defined by its model name, and is imported using the AutoTokenizer class.
- Parameters
path (str) – Either the path towards a vocabulary file, or the model name of a Huggingface Transformer.
notify_unk (bool, optional) – Optional toggle to notify a user if a token is not present in the vocabulary of the tokenizer. Defaults to False.
cache_dir (str, optional) – Cache directory for Huggingface tokenizers.
- Returns
tokenizer – The instantiated tokenizer that maps tokens to their indices.
- Return type
PreTrainedTokenizer
W2I¶
- class diagnnose.tokenizer.w2i.W2I(w2i: Dict[str, int], unk_token: str = '<unk>', eos_token: str = '<eos>', pad_token: str = '<pad>', notify_unk: bool = False)[source]¶
Bases:
dict
Provides vocab functionality mapping words to indices.
Non-existing tokens are mapped to the id of an unk token that should be present in the vocab file.
- Parameters
w2i (Dict[str, int]) – Dictionary that maps strings to indices. This dictionary can be created using create_vocab.
unk_token (str, optional) – The unk token to which unknown words will be mapped. Defaults to <unk>.
eos_token (str, optional) – The end-of-sentence token that is used in the corpus. Defaults to <eos>.
notify_unk (bool, optional) – Notify when a requested token is not present in the vocab. Defaults to False.
- property w2i: Dict[str, int]¶