sciwing.datasets.classification¶

base_text_classification¶

class sciwing.datasets.classification.base_text_classification.BaseTextClassification(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶

Bases: object

__init__(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶

Base Text Classification Dataset to be inherited by all text classification datasets

Parameters:	filename (str) – Full path of the filename where classification dataset is stored tokenizers (Dict[str, BaseTokenizer]) – The mapping between namespace and a tokenizer

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])¶

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:	Returns a list of text examples and corresponding labels
Return type:	(List[str], List[str])

text_classification_dataset¶

class sciwing.datasets.classification.text_classification_dataset.TextClassificationDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = <sciwing.tokenizers.word_tokenizer.WordTokenizer object>)¶

Bases: sciwing.datasets.classification.base_text_classification.BaseTextClassification, sphinx.ext.autodoc.importer._MockObject

This represents a dataset that is of the form

line1###label1

line2###label2

line3###label3 . . .

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])¶

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:	Returns a list of text examples and corresponding labels
Return type:	(List[str], List[str])

labels¶

lines¶

class sciwing.datasets.classification.text_classification_dataset.TextClassificationDatasetManager(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶

Bases: sciwing.data.datasets_manager.DatasetsManager, sciwing.utils.class_nursery.ClassNursery

__init__(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶

Parameters:

train_filename (str) – The path wehere the train file is stored
dev_filename (str) – The path where the dev file is stored
test_filename (str) – The path where the test file is stored
tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
batch_size (int) – The batch size of the data returned