sciwing.datasets.seq_labeling¶

base_seq_labeling¶

class sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶

Bases: object

__init__(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶

Base Text Classification Dataset to be inherited by all text classification datasets

Parameters:	filename (str) – Path of file where the text classification dataset is stored. Ideally this should have an example text and label separated by space. But it is left to the specific dataset to handle the different ways in which file could be structured tokenizers (Dict[str, BaseTokeizer]) –

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])¶

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:	Returns a list of text examples and corresponding labels
Return type:	(List[str], List[str])

seq_labelling_dataset¶

class sciwing.datasets.seq_labeling.seq_labelling_dataset.SeqLabellingDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶

Bases: sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset, sphinx.ext.autodoc.importer._MockObject

This represents a dataset that is of the form

word1###label1 word2###label2 word3###label3

.

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])¶

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:	Returns a list of text examples and corresponding labels
Return type:	(List[str], List[str])

class sciwing.datasets.seq_labeling.seq_labelling_dataset.SeqLabellingDatasetManager(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶

Bases: sciwing.data.datasets_manager.DatasetsManager

__init__(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶

Parameters:

train_filename (str) – The path wehere the train file is stored
dev_filename (str) – The path where the dev file is stored
test_filename (str) – The path where the test file is stored
tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
batch_size (int) – The batch size of the data returned