syntax.tasks¶
The library currently supports the following tasks.
New tasks can be added by inheriting from SyntaxEvalTask
.
Lakretz¶
- class diagnnose.syntax.tasks.lakretz.LakretzTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]¶
Bases:
SyntaxEvalTask
- descriptions = {'adv': {'conditions': ['S', 'P'], 'items_per_condition': 900}, 'adv_adv': {'conditions': ['S', 'P'], 'items_per_condition': 900}, 'adv_conjunction': {'conditions': ['S', 'P'], 'items_per_condition': 600}, 'namepp': {'conditions': ['S', 'P'], 'items_per_condition': 900}, 'nounpp': {'conditions': ['SS', 'SP', 'PS', 'PP'], 'items_per_condition': 600}, 'nounpp_adv': {'conditions': ['SS', 'SP', 'PS', 'PP'], 'items_per_condition': 600}, 'simple': {'conditions': ['S', 'P'], 'items_per_condition': 300}}¶
- initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]] [source]¶
Performs the initialization for the tasks of Marvin & Linzen (2018)
Arxiv link: https://arxiv.org/pdf/1808.09031.pdf
Repo: https://github.com/BeckyMarvin/LM_syneval
- Parameters
path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.
subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.
- Returns
corpora – Dictionary mapping a subtask to a Corpus.
- Return type
Dict[str, Corpus]
Linzen¶
- class diagnnose.syntax.tasks.linzen.LinzenTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]¶
Bases:
SyntaxEvalTask
- initialize(path: str, subtasks: Optional[List[str]] = None, items_per_subtask: Optional[int] = 1000) Dict[str, Union[Corpus, Dict[str, Corpus]]] [source]¶
Performs the initialization for the tasks of Linzen et al. (2016)
Arxiv link: https://arxiv.org/abs/1611.01368
Repo: https://github.com/TalLinzen/rnn_agreement
- Parameters
path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.
subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.
items_per_subtask (int, optional) – Number of items that is selected per subtask. If not provided the full subtask set will be used instead.
- Returns
corpora – Dictionary mapping a subtask to a Corpus.
- Return type
Dict[str, Corpus]
- class diagnnose.syntax.tasks.linzen.RawItem(sentence: str, orig_sentence: str, pos_sentence: str, subj: str, verb: str, subj_pos: str, has_rel: str, has_nsubj: str, verb_pos: str, subj_index: str, verb_index: str, n_intervening: str, last_intervening: str, n_diff_intervening: str, distance: str, max_depth: str, all_nouns: str, nouns_up_to_verb: str)[source]¶
Bases:
tuple
The original corpus structure contains these 18 fields.
- property all_nouns¶
Alias for field number 16
- property distance¶
Alias for field number 14
- property has_nsubj¶
Alias for field number 7
- property has_rel¶
Alias for field number 6
- property last_intervening¶
Alias for field number 12
- property max_depth¶
Alias for field number 15
- property n_diff_intervening¶
Alias for field number 13
- property n_intervening¶
Alias for field number 11
- property nouns_up_to_verb¶
Alias for field number 17
- property orig_sentence¶
Alias for field number 1
- property pos_sentence¶
Alias for field number 2
- property sentence¶
Alias for field number 0
- property subj¶
Alias for field number 3
- property subj_index¶
Alias for field number 9
- property subj_pos¶
Alias for field number 5
- property verb¶
Alias for field number 4
- property verb_index¶
Alias for field number 10
- property verb_pos¶
Alias for field number 8
Marvin¶
- class diagnnose.syntax.tasks.marvin.MarvinTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]¶
Bases:
SyntaxEvalTask
- initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]] [source]¶
Performs the initialization for the tasks of Marvin & Linzen (2018)
Arxiv link: https://arxiv.org/pdf/1808.09031.pdf
Repo: https://github.com/BeckyMarvin/LM_syneval
- Parameters
path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.
subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.
- Returns
corpora – Dictionary mapping a subtask to a Corpus.
- Return type
Dict[str, Corpus]
Warstadt¶
- class diagnnose.syntax.tasks.warstadt.WarstadtTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]¶
Bases:
SyntaxEvalTask
- initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]] [source]¶
Performs the initialization for the tasks of Marvin & Linzen (2018)
Arxiv link: https://arxiv.org/pdf/1808.09031.pdf
Repo: https://github.com/BeckyMarvin/LM_syneval
- Parameters
path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.
subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.
- Returns
corpora – Dictionary mapping a subtask to a Corpus.
- Return type
Dict[str, Corpus]
Warstadt Preproc¶
- diagnnose.syntax.tasks.warstadt_preproc.create_downstream_corpus(orig_corpus: Union[str, Dict[int, Dict[Tuple[int, int, int], Dict[str, Any]]]], output_path: Optional[str] = None, conditions: Optional[List[Tuple[int, int, int]]] = None, envs: Optional[List[str]] = None, skip_duplicate_items: bool = False) List[str] [source]¶
Create a new corpus from the original one that contains the subsentences up to the position of the NPI.
- Parameters
orig_corpus (str | CorpusDict) – Either the path to the original corpus, or a CorpusDict that has been created using preproc_warstadt.
output_path (str, optional) – Path to the output file that will be created in .tsv format. If not provided the corpus won’t be written to disk.
conditions (List[Tuple[int, int, int]], optional) – List of corpus item conditions (licensor, scope, npi_present). If not provided the correct NPI cases (1, 1, 1) will be used.
envs (List[str], optional) – List of of licensing environments that should be used.
skip_duplicate_items (bool) – Some corpus items only differ in their post-NPI content, and will lead to equivalent results on a downstream task. Defaults to False.
- Returns
corpus – List of strings representing each corpus item. Note that the first line of the list contains the .tsv header.
- Return type
List[str]
- diagnnose.syntax.tasks.warstadt_preproc.preproc_warstadt(path: str) Dict[int, Dict[Tuple[int, int, int], Dict[str, Any]]] [source]¶
Reads and preprocesses the NPI corpus of Warstadt et al. (2019).
Paper: https://arxiv.org/pdf/1901.03438.pdf
Data: https://alexwarstadt.files.wordpress.com/2019/08/npi_lincensing_data.zip
- Parameters
path (str) – Path to .tsv corpus file.
- Returns
sen_id2items (CorpusDict) – Dictionary mapping a sen_id to a triplet (licensor, scope, npi_present) to the full corpus item.
env2sen_ids (EnvIdDict) – Dictionary mapping each env type to a list of sen_id’s of that type.
Winobias¶
- class diagnnose.syntax.tasks.winobias.WinobiasTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]¶
Bases:
SyntaxEvalTask
- initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]] [source]¶
- Parameters
path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.
subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.
- Returns
corpora – Dictionary mapping a subtask to a Corpus.
- Return type
Dict[str, Corpus]