sciwing.datasets.seq_labeling

base_seq_labeling

class sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Bases: object

__init__(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Base Text Classification Dataset to be inherited by all text classification datasets

Parameters:
  • filename (str) – Path of file where the text classification dataset is stored. Ideally this should have an example text and label separated by space. But it is left to the specific dataset to handle the different ways in which file could be structured
  • tokenizers (Dict[str, BaseTokeizer]) –
get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])

seq_labelling_dataset

class sciwing.datasets.seq_labeling.seq_labelling_dataset.SeqLabellingDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Bases: sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset, sphinx.ext.autodoc.importer._MockObject

This represents a dataset that is of the form

word1###label1 word2###label2 word3###label3

word1###label1 word2###label2 word3###label3

word1###label1 word2###label2 word3###label3

.

.

.

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])
class sciwing.datasets.seq_labeling.seq_labelling_dataset.SeqLabellingDatasetManager(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)

Bases: sciwing.data.datasets_manager.DatasetsManager

__init__(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)
Parameters:
  • train_filename (str) – The path wehere the train file is stored
  • dev_filename (str) – The path where the dev file is stored
  • test_filename (str) – The path where the test file is stored
  • tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
  • namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
  • namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
  • batch_size (int) – The batch size of the data returned