syntax.tasks

The library currently supports the following tasks.

New tasks can be added by inheriting from SyntaxEvalTask.

Lakretz

class diagnnose.syntax.tasks.lakretz.LakretzTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]

Bases: SyntaxEvalTask

descriptions = {'adv': {'conditions': ['S', 'P'], 'items_per_condition': 900}, 'adv_adv': {'conditions': ['S', 'P'], 'items_per_condition': 900}, 'adv_conjunction': {'conditions': ['S', 'P'], 'items_per_condition': 600}, 'namepp': {'conditions': ['S', 'P'], 'items_per_condition': 900}, 'nounpp': {'conditions': ['SS', 'SP', 'PS', 'PP'], 'items_per_condition': 600}, 'nounpp_adv': {'conditions': ['SS', 'SP', 'PS', 'PP'], 'items_per_condition': 600}, 'simple': {'conditions': ['S', 'P'], 'items_per_condition': 300}}
initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]][source]

Performs the initialization for the tasks of Marvin & Linzen (2018)

Arxiv link: https://arxiv.org/pdf/1808.09031.pdf

Repo: https://github.com/BeckyMarvin/LM_syneval

Parameters
  • path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.

  • subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.

Returns

corpora – Dictionary mapping a subtask to a Corpus.

Return type

Dict[str, Corpus]

Linzen

class diagnnose.syntax.tasks.linzen.LinzenTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]

Bases: SyntaxEvalTask

initialize(path: str, subtasks: Optional[List[str]] = None, items_per_subtask: Optional[int] = 1000) Dict[str, Union[Corpus, Dict[str, Corpus]]][source]

Performs the initialization for the tasks of Linzen et al. (2016)

Arxiv link: https://arxiv.org/abs/1611.01368

Repo: https://github.com/TalLinzen/rnn_agreement

Parameters
  • path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.

  • subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.

  • items_per_subtask (int, optional) – Number of items that is selected per subtask. If not provided the full subtask set will be used instead.

Returns

corpora – Dictionary mapping a subtask to a Corpus.

Return type

Dict[str, Corpus]

class diagnnose.syntax.tasks.linzen.RawItem(sentence: str, orig_sentence: str, pos_sentence: str, subj: str, verb: str, subj_pos: str, has_rel: str, has_nsubj: str, verb_pos: str, subj_index: str, verb_index: str, n_intervening: str, last_intervening: str, n_diff_intervening: str, distance: str, max_depth: str, all_nouns: str, nouns_up_to_verb: str)[source]

Bases: tuple

The original corpus structure contains these 18 fields.

property all_nouns

Alias for field number 16

property distance

Alias for field number 14

property has_nsubj

Alias for field number 7

property has_rel

Alias for field number 6

property last_intervening

Alias for field number 12

property max_depth

Alias for field number 15

property n_diff_intervening

Alias for field number 13

property n_intervening

Alias for field number 11

property nouns_up_to_verb

Alias for field number 17

property orig_sentence

Alias for field number 1

property pos_sentence

Alias for field number 2

property sentence

Alias for field number 0

property subj

Alias for field number 3

property subj_index

Alias for field number 9

property subj_pos

Alias for field number 5

property verb

Alias for field number 4

property verb_index

Alias for field number 10

property verb_pos

Alias for field number 8

Marvin

class diagnnose.syntax.tasks.marvin.MarvinTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]

Bases: SyntaxEvalTask

initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]][source]

Performs the initialization for the tasks of Marvin & Linzen (2018)

Arxiv link: https://arxiv.org/pdf/1808.09031.pdf

Repo: https://github.com/BeckyMarvin/LM_syneval

Parameters
  • path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.

  • subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.

Returns

corpora – Dictionary mapping a subtask to a Corpus.

Return type

Dict[str, Corpus]

Warstadt

class diagnnose.syntax.tasks.warstadt.WarstadtTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]

Bases: SyntaxEvalTask

initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]][source]

Performs the initialization for the tasks of Marvin & Linzen (2018)

Arxiv link: https://arxiv.org/pdf/1808.09031.pdf

Repo: https://github.com/BeckyMarvin/LM_syneval

Parameters
  • path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.

  • subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.

Returns

corpora – Dictionary mapping a subtask to a Corpus.

Return type

Dict[str, Corpus]

Warstadt Preproc

diagnnose.syntax.tasks.warstadt_preproc.create_downstream_corpus(orig_corpus: Union[str, Dict[int, Dict[Tuple[int, int, int], Dict[str, Any]]]], output_path: Optional[str] = None, conditions: Optional[List[Tuple[int, int, int]]] = None, envs: Optional[List[str]] = None, skip_duplicate_items: bool = False) List[str][source]

Create a new corpus from the original one that contains the subsentences up to the position of the NPI.

Parameters
  • orig_corpus (str | CorpusDict) – Either the path to the original corpus, or a CorpusDict that has been created using preproc_warstadt.

  • output_path (str, optional) – Path to the output file that will be created in .tsv format. If not provided the corpus won’t be written to disk.

  • conditions (List[Tuple[int, int, int]], optional) – List of corpus item conditions (licensor, scope, npi_present). If not provided the correct NPI cases (1, 1, 1) will be used.

  • envs (List[str], optional) – List of of licensing environments that should be used.

  • skip_duplicate_items (bool) – Some corpus items only differ in their post-NPI content, and will lead to equivalent results on a downstream task. Defaults to False.

Returns

corpus – List of strings representing each corpus item. Note that the first line of the list contains the .tsv header.

Return type

List[str]

diagnnose.syntax.tasks.warstadt_preproc.preproc_warstadt(path: str) Dict[int, Dict[Tuple[int, int, int], Dict[str, Any]]][source]

Reads and preprocesses the NPI corpus of Warstadt et al. (2019).

Paper: https://arxiv.org/pdf/1901.03438.pdf

Data: https://alexwarstadt.files.wordpress.com/2019/08/npi_lincensing_data.zip

Parameters

path (str) – Path to .tsv corpus file.

Returns

  • sen_id2items (CorpusDict) – Dictionary mapping a sen_id to a triplet (licensor, scope, npi_present) to the full corpus item.

  • env2sen_ids (EnvIdDict) – Dictionary mapping each env type to a list of sen_id’s of that type.

Winobias

class diagnnose.syntax.tasks.winobias.WinobiasTask(model: LanguageModel, tokenizer: transformers.PreTrainedTokenizer, ignore_unk: bool, use_full_model_probs: bool, **config: Dict[str, Any])[source]

Bases: SyntaxEvalTask

initialize(path: str, subtasks: Optional[List[str]] = None) Dict[str, Union[Corpus, Dict[str, Corpus]]][source]
Parameters
  • path (str) – Path to directory containing the Marvin datasets that can be found in the github repo.

  • subtasks (List[str], optional) – The downstream tasks that will be tested. If not provided this will default to the full set of conditions.

Returns

corpora – Dictionary mapping a subtask to a Corpus.

Return type

Dict[str, Corpus]