sciwing.datasets.classification

base_text_classification

class sciwing.datasets.classification.base_text_classification.BaseTextClassification(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Bases: object

__init__(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Base Text Classification Dataset to be inherited by all text classification datasets

Parameters:
  • filename (str) – Full path of the filename where classification dataset is stored
  • tokenizers (Dict[str, BaseTokenizer]) – The mapping between namespace and a tokenizer
get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])

text_classification_dataset

class sciwing.datasets.classification.text_classification_dataset.TextClassificationDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = <sciwing.tokenizers.word_tokenizer.WordTokenizer object>)

Bases: sciwing.datasets.classification.base_text_classification.BaseTextClassification, sphinx.ext.autodoc.importer._MockObject

This represents a dataset that is of the form

line1###label1

line2###label2

line3###label3 . . .

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])
labels
lines
class sciwing.datasets.classification.text_classification_dataset.TextClassificationDatasetManager(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)

Bases: sciwing.data.datasets_manager.DatasetsManager, sciwing.utils.class_nursery.ClassNursery

__init__(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)
Parameters:
  • train_filename (str) – The path wehere the train file is stored
  • dev_filename (str) – The path where the dev file is stored
  • test_filename (str) – The path where the test file is stored
  • tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
  • namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
  • namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
  • batch_size (int) – The batch size of the data returned