sciwing.datasets.seq_labeling¶
base_seq_labeling¶
-
class
sciwing.datasets.seq_labeling.base_seq_labeling.
BaseSeqLabelingDataset
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Bases:
object
-
__init__
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Base Text Classification Dataset to be inherited by all text classification datasets
Parameters: - filename (str) – Path of file where the text classification dataset is stored. Ideally this should have an example text and label separated by space. But it is left to the specific dataset to handle the different ways in which file could be structured
- tokenizers (Dict[str, BaseTokeizer]) –
-
get_lines_labels
() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])¶ A list of lines from the file and a list of corresponding labels
This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.
Returns: Returns a list of text examples and corresponding labels Return type: (List[str], List[str])
-
seq_labelling_dataset¶
-
class
sciwing.datasets.seq_labeling.seq_labelling_dataset.
SeqLabellingDataset
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Bases:
sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset
,sphinx.ext.autodoc.importer._MockObject
This represents a dataset that is of the form
word1###label1 word2###label2 word3###label3
word1###label1 word2###label2 word3###label3
word1###label1 word2###label2 word3###label3
.
.
.
-
get_lines_labels
() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])¶ A list of lines from the file and a list of corresponding labels
This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.
Returns: Returns a list of text examples and corresponding labels Return type: (List[str], List[str])
-
-
class
sciwing.datasets.seq_labeling.seq_labelling_dataset.
SeqLabellingDatasetManager
(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶ Bases:
sciwing.data.datasets_manager.DatasetsManager
-
__init__
(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶ Parameters: - train_filename (str) – The path wehere the train file is stored
- dev_filename (str) – The path where the dev file is stored
- test_filename (str) – The path where the test file is stored
- tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
- namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
- namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
- batch_size (int) – The batch size of the data returned
-