Welcome to SciWing’s documentation!

_images/logo.png

SciWING is a modular and easy to extend framework, that enables easy experimentation of modern techniques for Scholarly Document Processing. It enables easy addition of datasets, and models and provides tools to easily experiment with them.

SciWING is a modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and believes in modularity from ground up and easy to use interface. SciWING includes many pre-trained models for fundamental tasks in Scientific Document Processing for practitioners. It has the following advantages

  • Modularity - The framework embraces modularity from ground-up. SciWING helps in creating new models by combining multiple re-usable modules. You can combine different modules and experiment with new approaches in an easy manner
  • Pre-trained Models -SciWING has many pre-trained models for fundamental tasks like Logical SectionClassifier for scientific documents, Citation string Parsing(Take a look at some of the other project related to station parsing Parscit, Neural_Parscit . Easy access to pre-trained models are made available through web APIs.
  • Run from Config File- SciWING enables you to declare datasets, models and experiment hyper-params in a TOML file. The models declared in a TOML file have a one-one correspondence with their respective class declaration in a python file. SciWING parses the model to a Directed Acyclic Graph and instantiates the model using the DAG’s topological ordering.
  • Extensible - SciWING enables easy addition of new datasets and provides command line tools for it. It enables addition of custom modules which are PyTorch modules.

Usage

Installation and Getting Started

The first step to use SciWING is to install the package on your local system. Once the package is installed, you can directly access the functionalities of SciWING. SciWING downloads the pre-trained models, embeddings and other information that is required to run the models on-demand basis.

On this page, we provide some basic tutorials on installation of SciWING and basic usage of SciWING.

Installation from Pip

SciWING currently only supports Python 3.7. This is the default and the only way to install SciWING using Pip, the python package manager. Do make sure that the pip version is 3.7 as well. We recommend using virtualenv which helps in keeping the development environment clean. To setup a virtual environtment, simply run

virtualenv -ppython3.7 .venv
source .venv/bin/activate

To install SciWING, just run

pip install sciwing

This installs all the dependencies required to run SciWING like PyTorch.

If you want to install sciwing for the current user then you can use

pip install -U sciwing

Building from source

  • Clone from git
git clone https://github.com/abhinavkashyap/sciwing.git
  • cd sciwing
  • Install the module in development mode
pip install -e .
  • Download spacy models
python -m spacy download en
  • Create directories where SciWING’s data are stored and embeddings/data are downloaded
sciwing develop makedirs
sciwing develop download

This step is optional. This will download the data and embeddings required for development. If you do not perform this step, then it gets downloadd later upon first request

  • SciWING uses pytest for testing. You can use the following command to run tests
pytest tests -n auto --dist=loadfile

The test suite is huge and again, it will take some time to run. We will put efforts to reduce the test time in the next iterations.

Running API Services

The APIs are built using FastAPI. We have APIs for citation string parsing, citation and intent classification and many other models. To run the APIs navigate into the api folder of this repository and run

uvicorn api:app --reload

Note

Navigate to http://localhost:8000/docs to access the SwaggerUI. The UI enables you to try the different APIs using a web interface.

Running the Demos

The demos are built using Streamlit. The Demos make use of the APIs. Please make sure that the APIs are running before the demos can be started. Navigate to the app folder and run the demo using streamlit (Installed along with the package). For example, this command runs all the demos.

Note

The demos download the models and the embeddings if already not downloaded and running the first time on your local machine might take time and memory. We have tested this on a 16GB MacBook Pro and works well. All the demos run on CPU for now and does not make use of any GPU, even when present.

streamlit run all_apps.py

Accessing Models

SciWING comes with many pre-trained scientific documenting processing models, that are easily accessible using a few lines of Python code. SciWING provides a consistent interface for all of its models. You can access these models, immediately after installation. The required model parameters, the embeddings etc are downloaded and initialized.

Note

The first time access of these models takes time, since we need to download them. Allow 60s, for the downloads to complete. Future access of the models are faster

Citation String Parsing

Neural Parscit is a citation parsing model. A citation string contains information like the author, the title of the publication, the conference/journal the year of publication etc. Neural Parscit extracts such information from references.

from sciwing.models.neural_parscit import NeuralParscit

# predict for a citation
neural_parscit = NeuralParscit()

# Predict on a reference string
neural_parscit.predict_for_text("Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")

# Predict on a file - The file should contain one referece for string
neural_parascit.predict_for_file("/path/to/file")

Tutorials

If you have not installed SciWING on your local machine yet, head over to our Installation and Getting Started section first. Here we are going to provide more indepth tutorials on accessing the pre-trained models, introspecting them, building the models step by step and others.

Examples

If you would like to see examples of how SciWING is used for training models for different tasks, the python code for various tasks in SciWING are given in the examples folder of our Github Repo. The instructions to run the code for examples are provided within every example.

Pre-trained Models

Note

If this is your first time use of the package, it takes time to download the pre-trained models. Subsequent access to the models are going to be faster.

Neural Parscit

Neural Parscit is a citation parsing model. A citation string contains information like the author, the title of the publication, the conference/journal the year of publication etc. Neural Parscit extracts such information from references.

>> from sciwing.models.neural_parscit import NeuralParscit

# predict for a citation
>> neural_parscit = NeuralParscit()

# Predict on a reference string
>> neural_parscit.predict_for_text("Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")

# Predict on a file - The file should contain one referece for string
>> neural_parascit.predict_for_file("/path/to/file")
Citation Intent Classification

Identify the intent behind citing another scholarly document helps in fine-grain analysis of documents. Some citations refer to the methodology in another document, some citations may refer to other works for background knowledge and some might compare and contrast their methods with another work. Citation Intent Classification models classify such intents.

>> from sciwing.models.citation_intent_clf import CitationIntentClassification

# instantiate an object
>> citation_intent_clf = CitationIntentClassification()

# predict the intention of the citation
>> citation_intent_clf.predict_for_text("Abu-Jbara et al. (2013) relied on lexical,structural, and syntactic features and a linear SVMfor classification.")
I2B2 Clinical Notes Tagging

Clinical Natural Language Processing helps in identifying salient information from clinical notes. Here, we have trained a neural network model on the i2b2: Informatics for Integrating Biology and the Bedside dataset.This dataset has manual annotation for the problems identified, the treatments and tests suggested.

>> from sciwing.models.i2b2 import I2B2NER

>> i2b2ner = I2B2NER()

>> i2b2ner.predict_for_text("Chest x - ray showed no evidency of cardiomegaly")
Extracting Abstracts

You can extract the abstracts from pdf files or abstracts from a folder containing pdf files.

>> from sciwing.models.sectlabel import SectLabel

>> sectlabel = SectLabel()

# extract abstract for file
>> sectlabel.extract_abstract_for_file("/path/to/pdf/file")

# extract abstract for all the files in the folder
>> sectlabel.extract_abstract_for_folder("/path/to/folder")
Identifying Different Logical Sections

Identifying different logical sections of the model is a fundamental task in scientific document processing. The SectLabel model of SciWING is used to obtain information about different sections of a research article.

SectLabel can label every line of the document to one of many different labels like title, author, bodyText etc. which can then be used for many other down-stream applications.

>> from sciwing.models.sectlabel import SectLabel

>> sectlabel = SectLabel()

# label all the lines in a document
>> sectlabel.predict_for_file("/path/to/pdf")

You can also get the abstract, section headers and the embedded refernces in the document using the same model as follows

>> from sciwing.models.sectlabel import SectLabel

>> sectlabel = SectLabel()

>> sectlabel.predict_for_file("/path/to/pdf")

>> info = sectlabel.extract_all_info("/path/to/pdf")

>> abstract = info["abstract"]

>> section_headers = info["section_headers"]

>> references = info["references]
Normalising Section Headers

Different research paper use different section headers. However, in order to identify the logical flow of a research paper, it would be helpful, if we could normalize the different section headers to a pre-defined set of headers. This model helps in performing such classifications.

>> from sciwing.models.generic_sect import GenericSect

>> generic_sect = GenericSect()

>> generic_sect.predict_for_text("experiments and results")
## evaluation

Interacting with Models

SciWING allows you to interact with pre-trained models even without writing code. We can interact with all the pre-trained models using command line application. Upon installation, the command sciwing is available to the users. One of the sub-commands is the interact command. Let us see an example

sciwing interact neural-parscit

This will run the inference of the best model on test data and prepare the model for interaction.

Note

The inference time again depends on whether you have a CPU or GPU. By default, we assume that you are running the model on a CPU.

1. See-Confusion-Matrix
2. See-examples-of-Classifications
3. See-prf-table
4. Enter text

1.The first option shows confusion matrix for different classes of Neural ParsCit.

2.The second option shows examples where one class is misclassified as the other. For eg., enter 4 5 to show examples where some tags belonging to class 4 is misclassified as 5

3.The Precision Recall and F-measure for the test dataset is shown along with the macro and micro F-scores.

4.You can enter a reference string and see the results.

PDF Pipelines

Note

Under Construction: This will allow you to provide path to a PDF file and extract all the information with respect to the pdf file. The information includes abstract, title, author, section headers, normalized section headers, embedded references, parses of the references etc.

Package Documentation

sciwing.api

sciwing.api.routers

citation_intent_clf
sciwing.api.routers.citation_intent_clf.classify_citation_intent(citation: str)

End point to classify a citation intent into `Background`, `Method`, `Result Comparison`

Parameters:citation (str) – String containing the citation to another work
Returns:{"tags": Predicted tag for the citation, "citation": the citation itself}
Return type:JSON
i2b2
sciwing.api.routers.i2b2.return_tags(text: str)

Tags the text that you send according to i2b2 model with problem, treatment and tests

Parameters:text (str) – The text to be tagged
Returns:{tags: Predicted tags, text_tokens: Tokens in the text }
Return type:JSON
parscit
sciwing.api.routers.parscit.tag_citation_string(citation: str)

End point to tag parse a reference string to their constituent parts.

Parameters:citation (str) – The reference string to be parsed.
Returns:{"tags": Predicted tags, "text_tokens": Tokenized citation string}
Return type:JSON
pdf_pipeline
sciwing.api.routers.pdf_pipeline.pdf_pipeline(file: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1476f590> = <sphinx.ext.autodoc.importer._MockObject object>)

Parses the file and returns various analytics about the pdf

Parameters:file (File) – A File stream
Returns:Returns a JSON where the key can be a section in the document with value as the text of the document. It can also be other information such as parsed reference strings in the document, or normalised section headers of the document. This is a feature in development. Be careful in using this.
Return type:JSON
sectlabel
sciwing.api.routers.sectlabel.extract_pdf(file: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1468fd90> = <sphinx.ext.autodoc.importer._MockObject object>)

Extracts the abstract from a scholarly article

Parameters:file (uploadFile) – Byte Stream of a file uploaded.
Returns:{"abstract": The abstract found in the scholarly document}
Return type:JSON
sciwing.api.routers.sectlabel.process_pdf(file: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1468fd90> = <sphinx.ext.autodoc.importer._MockObject object>)

Classifies every line in the document to the logical section of the document. The logical section can be title, author, email, section header, subsection header etc

Parameters:file (File) – The Bytestream of a file to be uploaded
Returns:{"labels": [(line, label)]}
Return type:JSON

sciwing.cli

sciwing_interact

class sciwing.cli.sciwing_interact.SciWINGInteract(infer_client: sciwing.infer.interface_client_base.BaseInterfaceClient)

Bases: object

This cli helps in interacting with different models of sciwing

interact()

Interact with the user to explore different models

This method provides various options for exploration of the different models.

  • See-Confusion-Matrix shows the confusion matrix on the test dataset.
  • See-Examples-of-Classification is to explore correct and mis-classifications. You can provide two class numbers as in, 2 3 and it shows examples in the test dataset where text that belong to class 2 is classified as class 3.
  • See-prf-table shows the precision recall and fmeasure per class.
  • See-text - Manually enter text and look at the classification results.

s3_mv_cli

class sciwing.cli.s3_mv_cli.S3OutputMove(foldername: str)

Bases: object

__init__(foldername: str)

Provides an interactive way to move some folders to s3

Parameters:foldername (str) – The folder name which will be moved to S3 bucket
static ask_deletion() → str

Since this is deletion, we want confirmation, just to be sure whether to keep the deleted folder locally or to remove it

Returns:An yes or no answer to the question
Return type:str
get_folder_choice()

Goes through the folder and gets the choice on which folder should be moved

Returns:The folder which is chosen to be moved
Return type:str
interact()

Interacts with the user by providing various options

sciwing.commands

validators

Utility functions for validation.

sciwing.commands.validators.is_file_exist(name: str)

Indicates whether file name exists or not

Parameters:name (str) – String representing filename
Returns:True when filename indicated by name exists, False otherwise
Return type:bool
sciwing.commands.validators.is_valid_python_classname(name: str)

Indicates whether name is a valid Python identifier

Parameters:name (str) – A string representing a class name
Returns:True when name is a valid python identifier, False otherwise
Return type:bool

sciwing.datasets

sciwing.datasets.classification

base_text_classification
class sciwing.datasets.classification.base_text_classification.BaseTextClassification(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Bases: object

__init__(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Base Text Classification Dataset to be inherited by all text classification datasets

Parameters:
  • filename (str) – Full path of the filename where classification dataset is stored
  • tokenizers (Dict[str, BaseTokenizer]) – The mapping between namespace and a tokenizer
get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])
text_classification_dataset
class sciwing.datasets.classification.text_classification_dataset.TextClassificationDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = <sciwing.tokenizers.word_tokenizer.WordTokenizer object>)

Bases: sciwing.datasets.classification.base_text_classification.BaseTextClassification, sphinx.ext.autodoc.importer._MockObject

This represents a dataset that is of the form

line1###label1

line2###label2

line3###label3 . . .

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])
labels
lines
class sciwing.datasets.classification.text_classification_dataset.TextClassificationDatasetManager(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)

Bases: sciwing.data.datasets_manager.DatasetsManager, sciwing.utils.class_nursery.ClassNursery

__init__(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)
Parameters:
  • train_filename (str) – The path wehere the train file is stored
  • dev_filename (str) – The path where the dev file is stored
  • test_filename (str) – The path where the test file is stored
  • tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
  • namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
  • namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
  • batch_size (int) – The batch size of the data returned

sciwing.datasets.seq_labeling

base_seq_labeling
class sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Bases: object

__init__(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Base Text Classification Dataset to be inherited by all text classification datasets

Parameters:
  • filename (str) – Path of file where the text classification dataset is stored. Ideally this should have an example text and label separated by space. But it is left to the specific dataset to handle the different ways in which file could be structured
  • tokenizers (Dict[str, BaseTokeizer]) –
get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])
seq_labelling_dataset
class sciwing.datasets.seq_labeling.seq_labelling_dataset.SeqLabellingDataset(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])

Bases: sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset, sphinx.ext.autodoc.importer._MockObject

This represents a dataset that is of the form

word1###label1 word2###label2 word3###label3

word1###label1 word2###label2 word3###label3

word1###label1 word2###label2 word3###label3

.

.

.

get_lines_labels() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])

A list of lines from the file and a list of corresponding labels

This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.

Returns:Returns a list of text examples and corresponding labels
Return type:(List[str], List[str])
class sciwing.datasets.seq_labeling.seq_labelling_dataset.SeqLabellingDatasetManager(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)

Bases: sciwing.data.datasets_manager.DatasetsManager

__init__(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)
Parameters:
  • train_filename (str) – The path wehere the train file is stored
  • dev_filename (str) – The path where the dev file is stored
  • test_filename (str) – The path where the test file is stored
  • tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
  • namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
  • namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
  • batch_size (int) – The batch size of the data returned

sciwing.engine

engine

class sciwing.engine.engine.Engine(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13cdf910>, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, optimizer: sphinx.ext.autodoc.importer.<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12306d90>, batch_size: int, save_dir: str, num_epochs: int, save_every: int, log_train_metrics_every: int, train_metric: sciwing.metrics.BaseMetric.BaseMetric, validation_metric: sciwing.metrics.BaseMetric.BaseMetric, test_metric: sciwing.metrics.BaseMetric.BaseMetric, experiment_name: Optional[str] = None, experiment_hyperparams: Optional[Dict[str, Any]] = None, tensorboard_logdir: str = None, track_for_best: str = 'loss', collate_fn=<class 'list'>, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13df6690>, str] = <sphinx.ext.autodoc.importer._MockObject object>, gradient_norm_clip_value: Optional[float] = 5.0, lr_scheduler: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13df64d0>] = None, use_wandb: bool = False, sample_proportion: float = 1.0, seeds: Dict[str, int] = None)

Bases: sciwing.utils.class_nursery.ClassNursery

__init__(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13cdf910>, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, optimizer: sphinx.ext.autodoc.importer.<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12306c50>, batch_size: int, save_dir: str, num_epochs: int, save_every: int, log_train_metrics_every: int, train_metric: sciwing.metrics.BaseMetric.BaseMetric, validation_metric: sciwing.metrics.BaseMetric.BaseMetric, test_metric: sciwing.metrics.BaseMetric.BaseMetric, experiment_name: Optional[str] = None, experiment_hyperparams: Optional[Dict[str, Any]] = None, tensorboard_logdir: str = None, track_for_best: str = 'loss', collate_fn=<class 'list'>, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13df6690>, str] = <sphinx.ext.autodoc.importer._MockObject object>, gradient_norm_clip_value: Optional[float] = 5.0, lr_scheduler: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13df64d0>] = None, use_wandb: bool = False, sample_proportion: float = 1.0, seeds: Dict[str, int] = None)

Engine runs the models end to end. It iterates through the train dataset and passes it through the model. During training it helps in tracking a lot of parameters for the run and saving the parameters. It also reports validation and test parameters from time to time. Many utilities required for end-end running of the model is here.

Parameters:
  • model (nn.Module) – A pytorch module defining a model to be run
  • datasets_manager (DatasetsManager) – A datasets manager that handles all the different datasets
  • optimizer (torch.optim) – Any Optimizer object instantiated using torch.optim
  • batch_size (int) – Batch size for the dataset. The same batch size is used for train, valid and test dataset
  • save_dir (int) – The experiments are saved in save_dir. We save checkpoints, the best model, logs and other information into the save dir
  • num_epochs (int) – The number of epochs to run the training
  • save_every (int) – The model will be checkpointed every save_every number of iterations
  • log_train_metrics_every (int) – The train metrics will be reported every log_train_metrics_every iterations during training
  • train_metric (BaseMetric) – Anything that is an instance of BaseMetric for calculating training metrics
  • validation_metric (BaseMetric) – Anything that is an instance of BaseMetric for calculating validation metrics
  • test_metric (BaseMetric) – Anything that is an instance of BaseMetric for calculating test metrics
  • experiment_name (str) – The experiment should be given a name for ease of tracking. Instead experiment name is not given, we generate a unique 10 digit sha for the experiment.
  • experiment_hyperparams (Dict[str, Any]) – This is mostly used for tracking the different hyper-params of the experiment being run. This may be used by wandb to save the hyper-params
  • tensorboard_logdir (str) – The directory where all the tensorboard runs are stored. If None is passed then it defaults to the tensorboard default of storing the log in the current directory.
  • track_for_best (str) – Which metric should be tracked for deciding the best model?. Anything that the metric emits and is a single value can be used for tracking. The defauly value is loss. If its loss, then the best value will be the lowest one. For some other metrics like macro_fscore, the best metric might be the one that has the highest value
  • collate_fn (Callable[[List[Any]], List[Any]]) – Collates the different examples into a single batch of examples. This is the same terminology adopted from pytorch. There is no different
  • device (torch.device) – The device on which the model will be placed. If this is “cpu”, then the model and the tensors will all be on cpu. If this is “cuda:0”, then the model and the tensors will be placed on cuda device 0. You can mention any other cuda device that is suitable for your environment
  • gradient_norm_clip_value (float) – To avoid gradient explosion, the gradients of the norm will be clipped if the gradient norm exceeds this value
  • lr_scheduler (torch.optim.lr_scheduler) – Any pytorch lr_scheduler can be used for reducing the learning rate if the performance on the validation set reduces.
  • use_wandb (bool) – wandb or weights and biases is a tool that is used to track experiments online. Sciwing comes with inbuilt functionality to track experiments on weights and biases
  • seeds (Dict[str, int]) – The dict of seeds to be set. Set the random_seed, pytorch_seed and numpy_seed Found in https://github.com/allenai/allennlp/blob/master/allennlp/common/util.py
static get_iter(loader: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13cdf150>) → Iterator[T_co]

Returns the iterator for a pytorch data loader.

The loader is a pytorch DataLoader that iterates over the dataset in batches and employs many strategies to do so. We want an iterator that returns the dataset in batches. The end of the iterator would signify the end of an epoch and then we can use that information to perform house-keeping.

Parameters:loader (DataLoader) – a pytorch data loader
Returns:An iterator over the data loader
Return type:Iterator
get_loader(dataset: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13e39410>) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13cdf150>

Returns the DataLoader for the Dataset

Parameters:dataset (Dataset) –
Returns:A pytorch DataLoader
Return type:DataLoader
get_test_dataset()

Returns the test dataset of the experiment

Returns:Anything that conforms to the pytorch style dataset.
Return type:Dataset
get_train_dataset()

Returns the train dataset of the experiment

Returns:Anything that conforms to the pytorch style dataset.
Return type:Dataset
get_validation_dataset()

Returns the validation dataset of the experiment

Returns:Anything that conforms to the pytorch style dataset.
Return type:Dataset
is_best_higher(current_best=None)

Returns True if the current value of the metric is HIGHER than the best metric. This is useful for tracking metrics like FSCORE where, higher the value, the better it is

Parameters:current_best (float) – The current value for the metric that is being tracked
Returns:
Return type:bool
is_best_lower(current_best=None)

Returns True if the current value of the metric is lower than the best metric. This is useful for tracking metrics like loss where, lower the value, the better it is

Parameters:current_best (float) – The current value for the metric that is being tracked
Returns:
Return type:bool
load_model_from_file(filename: str)
run()

Run the engine :return:

set_best_track_value(current_best=None)

Set the best value of the value being tracked

Parameters:current_best (float) – The current value that is best
test_epoch(epoch_num: int)

Runs the test epoch for epoch_num

Loads the best model that is saved during the training and runs the test dataset.

Parameters:epoch_num (int) – zero based epoch number for which the test dataset is run This is after the last training epoch.
test_epoch_end(epoch_num: int)

Performs house-keeping at the end of the test epoch

It reports the metric that is being traced at the end of the test epoch

Parameters:epoch_num (int) – Epoch num after which the test dataset is run
train_epoch(epoch_num: int)

Run the training for one epoch :param epoch_num: type: int The current epoch number

train_epoch_end(epoch_num: int)

Performs house-keeping at the end of a training epoch

At the end of the training epoch, it does some house-keeping. It reports the average loss, the average metric and other information.

Parameters:epoch_num (int) – The current epoch number (0 based)
validation_epoch(epoch_num: int)

Runs one validation epoch on the validation dataset

Parameters:
  • epoch_num (int) –
  • epoch number (0-based) –
validation_epoch_end(epoch_num: int)

Performs house-keeping at the end of validation epoch

Parameters:epoch_num (int) – The current epoch number

sciwing.infer

sciwing.infer.classification

BaseClassificationInference
class sciwing.infer.classification.BaseClassificationInference.BaseClassificationInference(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b7d5d0>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b7d290>, None] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: object

Abstract Base Class for Classification Inference.The BaseClassification Inference provides a skeleton for concrete classes that would want to perform inference for a text classification task.

__init__(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b7d5d0>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b7d290>, None] = <sphinx.ext.autodoc.importer._MockObject object>)
Parameters:
  • model (nn.Module) – A pytorch module
  • model_filepath (str) – The path where the parameters for the best models are stored. This is usually the best_model.pt while in an experiment directory
  • datasets_manager (DatasetsManager) – Any dataset that conforms to the pytorch Dataset specification
  • device (Optional[Union[str, torch.device]]) – This is either a string like cpu, cuda:0 or a torch.device object
get_misclassified_sentences(true_label_idx: int, pred_label_idx: int) → List[str]
get_true_label_indices_names(labels: List[sciwing.data.label.Label]) -> (typing.List[int], typing.List[str])

Given an list of labels, it returns the indices and the names of the label

Parameters:labels (Dict[str, Any]) – iter_dict returned by a dataset
Returns:List of integers that represent the true class List of strings that represent the true class
Return type:(List[int], List[str])
infer_batch(lines: List[sciwing.data.line.Line])
load_model()

Loads the best_model from the model_filepath.

model_forward_on_lines(lines: List[sciwing.data.line.Line])

Perform the model forward pass given an iter_dict

Parameters:lines (List[Line]) –
model_output_dict_to_prediction_indices_names(model_output_dict: Dict[str, Any]) -> (typing.List[int], typing.List[str])

Given an model_output_dict, it returns the predicted class indices and names

Parameters:model_output_dict (Dict[str, Any]) – output dictionary from a model
Returns:List of integers that represent the predicted class List of strings that represent the predicted class
Return type:(List[int], List[str])
on_user_input(line: sciwing.data.line.Line)
print_confusion_matrix()
report_metrics()

Reports the metrics for returning the dataset

run_inference() → Dict[str, Any]

Should Run inference on the test dataset

This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use

Returns:Returns
Return type:Dict[str, Any]
run_test()
Classification Inference
class sciwing.infer.classification.classification_inference.ClassificationInference(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b7d250>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, tokens_namespace: str = 'tokens', normalized_probs_namespace: str = 'normalized_probs', device: str = 'cpu')

Bases: sciwing.infer.classification.BaseClassificationInference.BaseClassificationInference

The sciwing engine runs the test lines through the classifier and returns the predictions/probabilities for different classes At a later point in time this method should be able to take any context of lines (may be from a file) and produce the output.

This class also helps in performing various interactions with the results on the test dataset. Some features are 1) Show confusion matrix 2) Investigate a particular example in the test dataset 3) Get instances that were classified as 2 when their true label is 1 and others

All it needs is the configuration file stored under every experiment to have a vocab already stored in the experiment folder

generate_report_for_paper()

Generates just the fscore to be used in reporting on print

get_misclassified_sentences(true_label_idx: int, pred_label_idx: int)

This returns the true label misclassified as pred label idx

Parameters:
  • true_label_idx (int) – The label index of the true class name
  • pred_label_idx (int) – The label index of the predicted class name
Returns:

A list of strings where the true class is classified as pred class.

Return type:

List[str]

get_true_label_indices_names(labels: List[sciwing.data.label.Label]) -> (typing.List[int], typing.List[str])

Given an list of labels, it returns the indices and the names of the label

Parameters:labels (Dict[str, Any]) – iter_dict returned by a dataset
Returns:List of integers that represent the true class List of strings that represent the true class
Return type:(List[int], List[str])
infer_batch(lines: List[str]) → List[str]

Runs inference on a batch of lines This method can be used for applications. When APIS are being developed to serve over the web or when terminal applications are being written to read from files and infer, this method comes in handy

Parameters:lines (List[str]) – List of text spans to be infered
Returns:Reutrns the class names for all the sentences in the input
Return type:List[str]
model_forward_on_lines(lines: List[sciwing.data.line.Line])

Perform the model forward pass given an iter_dict

Parameters:lines (List[Line]) –
model_output_dict_to_prediction_indices_names(model_output_dict: Dict[str, Any]) -> (typing.List[int], typing.List[str])

Given an model_output_dict, it returns the predicted class indices and names

Parameters:model_output_dict (Dict[str, Any]) – output dictionary from a model
Returns:List of integers that represent the predicted class List of strings that represent the predicted class
Return type:(List[int], List[str])
on_user_input(line: str) → str

Runs the inference when the user inputs a single sentence either on the terminal or some other application

Parameters:line (str) – The line entered by the user
Returns:The class label that is infered for the user input
Return type:str
print_confusion_matrix() → None

Prints the confusion matrix for the test dataset

report_metrics()

Reports the metrics for returning the dataset

run_inference() → Dict[str, Any]

Should Run inference on the test dataset

This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use

Returns:Returns
Return type:Dict[str, Any]
run_test()

Runs inference and reports test metrics

sciwing.infer.seq_label_inference

BaseSeqLabelInference
class sciwing.infer.seq_label_inference.BaseSeqLabelInference.BaseSeqLabelInference(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c146f5310>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740690>, None] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: object

Abstract Base Class for Sequence Labeling Inference.The BaseSeqLabelInference Inference provides a skeleton for concrete classes that would want to perform inference for a text classification task.

__init__(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c146f5310>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740690>, None] = <sphinx.ext.autodoc.importer._MockObject object>)
Parameters:
  • model (nn.Module) – A pytorch module
  • model_filepath (str) – The path where the parameters for the best models are stored. This is usually the best_model.pt while in an experiment directory
  • datasets_manager (DatasetsManager) – Any dataset that conforms to the pytorch Dataset specification
  • device (Optional[Union[str, torch.device]]) – This is either a string like cpu, cuda:0 or a torch.device object
get_misclassified_sentences(true_label_idx: int, pred_label_idx: int) → List[str]
get_true_label_indices_names(labels: List[sciwing.data.seq_label.SeqLabel]) -> (typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[str]])

Given an list of labels, it returns the indices and the names of the label

Parameters:labels (Dict[str, Any]) – iter_dict returned by a dataset
Returns:A mapping between a label namespace and List of integers that represent the true class A mapping between a label namespace and a List of strings that represent the true class
Return type:(Dict[str, List[int]], Dict[str, List[str]])
infer_batch(lines: Union[List[sciwing.data.line.Line], List[str]]) → Dict[str, List[str]]
load_model()

Loads the best_model from the model_filepath.

model_forward_on_lines(lines: List[sciwing.data.line.Line])

Perform the model forward pass given an iter_dict

Parameters:lines (List[Line]) – iter_dict returned by a dataset
model_output_dict_to_prediction_indices_names(model_output_dict: Dict[str, Any]) -> (typing.List[int], typing.List[str])

Given an model_output_dict, it returns the predicted class indices and names

Parameters:model_output_dict (Dict[str, Any]) – output dictionary from a model
Returns:List of integers that represent the predicted class List of strings that represent the predicted class
Return type:(List[int], List[str])
on_user_input(line: Union[sciwing.data.line.Line, str]) → Dict[str, List[str]]
print_confusion_matrix()
report_metrics()

Reports the metrics for returning the dataset

run_inference()

Should Run inference on the test dataset

This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use

Returns:Returns
Return type:Dict[str, Any]
run_test()
CONLL Inference
class sciwing.infer.seq_label_inference.conll_inference.Conll2003Inference(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12178ed0>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12178990>, None] = <sphinx.ext.autodoc.importer._MockObject object>, predicted_tags_namespace_prefix: str = 'predicted_tags')

Bases: sciwing.infer.seq_label_inference.seq_label_inference.SequenceLabellingInference

generate_predictions_for(task: str, test_filename: str, output_filename: str)
Parameters:
  • task (str) – Can be one of pos, dep or ner The task for which the predictions are made using the current model
  • test_filename (str) – This is the eng.testb of the CoNLL 2003 dataset
  • output_filename (str) – The file where you want to store predictions
Returns:

  • None – Writes the predictions to the output_filename
  • The output file is meant to be used with conlleval.perl script
  • ./conlleval < output_filename
  • The file expects the correct tag and the predicted tag to be in the last
  • two columns in that order
  • The first column is the token for which the prediction is made

SeqLabel Inference
class sciwing.infer.seq_label_inference.seq_label_inference.SequenceLabellingInference(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681150>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681cd0>, None] = <sphinx.ext.autodoc.importer._MockObject object>, predicted_tags_namespace_prefix: str = 'predicted_tags')

Bases: sciwing.infer.seq_label_inference.BaseSeqLabelInference.BaseSeqLabelInference

generate_scienceie_prediction_folder(dev_folder: pathlib.Path, pred_folder: pathlib.Path)

Generates the predicted folder for the dataset in the test folder for ScienceIE. This is very specific to ScienceIE. Not meant to use with other tasks

ScienceIE is a SemEval Task that needs the files to be written into a folder and it reports metrics by reading files from that folder. This method generates the predicted folder given the dev folder

Parameters:
  • dev_folder (pathlib.Path) – The path where the dev files are present
  • pred_folder (pathlib.Path) – The path where the predicted files will be written
get_misclassified_sentences(true_label_idx: int, pred_label_idx: int)
get_true_label_indices_names(labels: List[sciwing.data.seq_label.SeqLabel]) -> (typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[str]])

Given an list of labels, it returns the indices and the names of the label

Parameters:labels (Dict[str, Any]) – iter_dict returned by a dataset
Returns:A mapping between a label namespace and List of integers that represent the true class A mapping between a label namespace and a List of strings that represent the true class
Return type:(Dict[str, List[int]], Dict[str, List[str]])
infer_batch(lines: Union[List[sciwing.data.line.Line], List[str]]) → Dict[str, List[str]]
model_forward_on_lines(lines: List[sciwing.data.line.Line])

Perform the model forward pass given an iter_dict

Parameters:lines (List[Line]) – iter_dict returned by a dataset
model_output_dict_to_prediction_indices_names(model_output_dict: Dict[str, Any]) -> (typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[str]])

Given an model_output_dict, it returns the predicted class indices and names

Parameters:model_output_dict (Dict[str, Any]) – output dictionary from a model
Returns:List of integers that represent the predicted class List of strings that represent the predicted class
Return type:(List[int], List[str])
on_user_input(line: Union[sciwing.data.line.Line, str]) → Dict[str, List[str]]
print_confusion_matrix()

This prints the confusion metrics for the entire dataset :returns: :rtype: None

report_metrics()

Reports the metrics for returning the dataset

run_inference()

Should Run inference on the test dataset

This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use

Returns:Returns
Return type:Dict[str, Any]
run_test()

sciwing.meters

loss_meter

class sciwing.meters.loss_meter.LossMeter

Bases: object

add_loss(avg_batch_loss: float, num_instances: int) → None

Adds the average batch loss and the num of instances in that batch to that loss

Parameters:
  • avg_batch_loss (float) – Average batch loss
  • num_instances (int) – Number of instances from the batch
get_average() → float

Returns the average loss over all the batches at this point in time

Returns:Average loss
Return type:float
reset()

Resets all the losses and batch sizes that are accumulated

sciwing.metrics

BaseMetric

class sciwing.metrics.BaseMetric.BaseMetric(datasets_manager: sciwing.data.datasets_manager.DatasetsManager)

Bases: object

calc_metric(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label], model_forward_dict: Dict[str, Any]) → None

Calculates the metric using the lines and labels returned by any dataset and model_forward_dict of a model. This is usually called for a batch of inputs and a forward pass. The state of the different metrics should be retained by the metric across an epoch before reset method can be called and all the metric related data can be reset for a new epoch

Parameters:
  • lines (List[Line]) –
  • labels (List[Label]) –
  • model_forward_dict (Dict[str, Any]) –
get_metric() → Dict[str, Any]

Returns the value of different metrics being tracked

Return anything that is being tracked by the metric. Return it as a dictionary that can be used by outside method for reporting purposes or repurposing it for the sake of reporting

Returns:Metric/values being tracked by the metric
Return type:Dict[str, Any]
report_metrics(report_type: str = None) → Any

A method to report the tracked metrics in a suitable form

Parameters:report_type (str) – The type of report that will be returned by the method
Returns:This method can return any suitable format for reporting. If it is ought to be printed, return a suitable string. If the report needs to be saved to a file, go ahead.
Return type:Any
reset()

Should reset all the metrics/value being tracked by this metric This method is generally used at the end of a training/validation epoch to reset the values before starting another epoch

classification_metrics_utils

class sciwing.metrics.classification_metrics_utils.ClassificationMetricsUtils

Bases: object

The classification metrics like accuracy, precision recall and fmeasure are often used in supervised learning. This class provides a few utilities that helps in calculating these.

generate_table_report_from_counters(tp_counter: Dict[int, int], fp_counter: Dict[int, int], fn_counter: Dict[int, int], idx2labelname_mapping: Dict[int, str] = None) → str

Returns a table representation for Precision Recall and FMeasure

Parameters:
  • tp_counter (Dict[int, int]) – The mapping between class index and true positive count
  • fp_counter (Dict[int, int]) – The mapping between class index and false positive count
  • fn_counter (Dict[int, int]) – The mapping between class index and false negative count
  • idx2labelname_mapping (Dict[int, str]) – The mapping between idx and label name
Returns:

Returns a string representing the table of precision recall and fmeasure for every class in the dataset

Return type:

str

static get_confusion_matrix_and_labels(predicted_tag_indices: List[List[int]], true_tag_indices: List[List[int]], true_masked_label_indices: List[List[int]], pred_labels_mask: List[List[int]] = None) -> (<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b9f350>, typing.List[int])

Gets the confusion matrix and the list of classes for which the confusion matrix is generated

Parameters:
  • predicted_tag_indices (List[List[int]]) – Predicted tag indices for a batch
  • true_tag_indices (List[List[int]]) – True tag indices for a batch
  • true_masked_label_indices (List[List[int]]) – Every integer is either a 0 or 1, where 1 will indicate that the label in true_tag_indices will be ignored
static get_macro_prf_from_prf_dicts(precision_dict: Dict[int, int], recall_dict: Dict[int, int], fscore_dict: Dict[int, int]) -> (<class 'int'>, <class 'int'>, <class 'int'>)

Calculates Macro Precision, Recall and FMeasure

Parameters:
  • precision_dict (Dict[int, int]) – Dictionary mapping betwen the class index and precision values
  • recall_dict (Dict[int, int]) – Dictionary mapping between the class index and recall values
  • fscore_dict (Dict[int, int]) – Dictionary mapping between the class index and fscore values
Returns:

The macro precision, macro recall and macro fscore measures

Return type:

int, int, int

get_micro_prf_from_counters(tp_counter: Dict[int, int], fp_counter: Dict[int, int], fn_counter: Dict[int, int]) -> (<class 'int'>, <class 'int'>, <class 'int'>)

This calculates the micro precision recall and fmeasure from different counters. The counters contain a mapping from a class index to the particular number

Parameters:
  • tp_counter (Dict[int, int]) – Mapping from class index to true positive count
  • fp_counter (Dict[int, int]) – Mapping from class index to false posiive count
  • fn_counter (Dict[int, int]) – Mapping from class index to false negative count
Returns:

Micro precision, Micro Recall and Micro fmeasure

Return type:

int, int, int

get_prf_from_counters(tp_counter: Dict[int, int], fp_counter: Dict[int, int], fn_counter: Dict[int, int])

This calculates the precision recall f-measure from different counters. The counters contain a mapping from a class index to the particular number

Parameters:
  • tp_counter (Dict[int, int]) – Mapping from class index to true positive count
  • fp_counter (Dict[int, int]) – Mapping from class index to false posiive count
  • fn_counter (Dict[int, int]) – Mapping from class index to false negative count
Returns:

Three dictionaries representing the Precision Recall and Fmeasure for all the different classes

Return type:

Dict[int, int], Dict[int, int], Dict[int, int]

precision_recall_measure

class sciwing.metrics.precision_recall_fmeasure.PrecisionRecallFMeasure(datasets_manager: sciwing.data.datasets_manager.DatasetsManager)

Bases: sciwing.metrics.BaseMetric.BaseMetric, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager)
Parameters:datasets_manager (DatasetsManager) – The dataset manager managing the labels and other information
calc_metric(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label], model_forward_dict: Dict[str, Any]) → None

Updates the values being tracked for calculating the metric

For Precision Recall FMeasure we update the true positive, false positive and false negative of the different classes being tracked

Parameters:
  • lines (List[Line]) – A list of lines
  • labels (List[Label]) – A list of labels. This has to be the label used for classification Refer to the documentation of Label for more information
  • model_forward_dict (Dict[str, Any]) – The dictionary obtained after a forward pass The model_forward_pass is expected to have normalized_probs that usually is of the size [batch_size, num_classes]
get_metric() → Dict[str, Any]

Returns different values being tracked to calculate Precision Recall FMeasure

Returns:Returns a dictionary with the following key value pairs for every namespace
precision: Dict[str, float]
The precision for different classes
recall: Dict[str, float]
The recall values for different classes
fscore: Dict[str, float]
The fscore values for different classes,
num_tp: Dict[str, int]
The number of true positives for different classes,
num_fp: Dict[str, int]
The number of false positives for different classes,
num_fn: Dict[str, int]
The number of false negatives for different classes
”macro_precision”: float
The macro precision value considering all different classes,
macro_recall: float
The macro recall value considering all different classes
macro_fscore: float
The macro fscore value considering all different classes
micro_precision: float
The micro precision value considering all different classes,
micro_recall: float
The micro recall value considering all different classes.
micro_fscore: float
The micro fscore value considering all different classes
Return type:Dict[str, Any]
print_confusion_metrics(predicted_probs: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f3490>, labels: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f3a10>, labels_mask: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f3ad0>] = None) → None

Prints confusion matrix

Parameters:
  • predicted_probs (torch.FloatTensor) – Predicted Probabilities [batch_size, num_classes]
  • labels (torch.FloatTensor) – True labels of the size [batch_size, 1]
  • labels_mask (Optional[torch.ByteTensor]) – Labels mask indicating 1 in thos places where the true label is ignored Otherwise 0. It should be of same size as labels
report_metrics(report_type='wasabi')

Reports metrics in a printable format

Parameters:report_type (type) – Select one of [wasabi, paper] If wasabi, then we return a printable table that represents the precision recall and fmeasures for different classes
reset() → None

Resets all the counters

Resets the tp_counter which is the true positive counter Resets the fp_counter which is the false positive counter Resets the fn_counter - which is the false negative counter Resets the tn_counter - which is the true nagative counter

token_cls_accuracy

class sciwing.metrics.token_cls_accuracy.TokenClassificationAccuracy(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, predicted_tags_namespace_prefix='predicted_tags')

Bases: sciwing.metrics.BaseMetric.BaseMetric, sciwing.utils.class_nursery.ClassNursery

calc_metric(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.seq_label.SeqLabel], model_forward_dict: Dict[str, Any]) → None
Parameters:
  • lines (List[Line]) – The list of lines
  • labels (List[Label]) – The list of sequence labels
  • model_forward_dict (Dict[str, Any]) – The model_forward_dict should have predicted tags for every namespace The predicted_tags are the best possible predicted tags for the batch They are List[List[int]] where the size is [batch_size, time_steps] We expect that the predicted tags are
get_metric() → Dict[str, Union[Dict[str, float], float]]

Returns different values being tracked to calculate Precision Recall FMeasure :returns: Returns a dictionary with following key value pairs for every namespace

precision: Dict[str, float]
The precision for different classes
recall: Dict[str, float]
The recall values for different classes
“fscore”: Dict[str, float]
The fscore values for different classes,
num_tp: Dict[str, int]
The number of true positives for different classes,
num_fp: Dict[str, int]
The number of false positives for different classes,
num_fn: Dict[str, int]
The number of false negatives for different classes
“macro_precision”: float
The macro precision value considering all different classes,
macro_recall: float
The macro recall value considering all different classes
macro_fscore: float
The macro fscore value considering all different classes
micro_precision: float
The micro precision value considering all different classes,
micro_recall: float
The micro recall value considering all different classes.
micro_fscore: float
The micro fscore value considering all different classes
Return type:Dict[str, Any]
print_confusion_metrics(predicted_tag_indices: List[List[int]], true_tag_indices: List[List[int]], labels_mask: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14b9fbd0>] = None) → None

Prints confusion matrics for a batch of tag indices. It assumes that the batch is padded and every instance is of similar length

Parameters:
  • predicted_tag_indices (List[List[int]]) – Predicted tag indices for a batch of sentences
  • true_tag_indices (List[List[int]]) – True tag indices for a batch of sentences
  • labels_mask (Optional[torch.ByteTensor]) – The labels mask which has the same as true_tag_indices. 0 in a position indicates that there is no masking 1 indicates that there is a masking
report_metrics(report_type='wasabi') → Any

Reports metrics in a printable format

Parameters:report_type (type) – Select one of [wasabi, paper] If wasabi, then we return a printable table that represents the precision recall and fmeasures for different classes
reset()

Should reset all the metrics/value being tracked by this metric This method is generally used at the end of a training/validation epoch to reset the values before starting another epoch

sciwing.models

Simple Classifier

class sciwing.models.simpleclassifier.SimpleClassifier(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1558c2d0>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1551b750>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1558c2d0>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1551b750>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

SimpleClassifier is a linear classifier head on top of any encoder

Parameters:
  • encoder (nn.Module) – Any encoder that takes in lines and produces a single vector for every line.
  • encoding_dim (int) – The encoding dimension
  • num_classes (int) – The number of classes
  • classification_layer_bias (bool) – Whether to add classification layer bias or no This is set to false only for debugging purposes ff
  • label_namespace (str) – The namespace used for labels in the dataset
  • datasets_manager (DatasetsManager) – The datasets manager for the model
  • device (torch.device) – The device on which the model is run
forward(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False) → Dict[str, Any]
Parameters:
  • lines (List[Line]) – iter_dict from any dataset that will be passed on to the encoder
  • labels (List[Label]) – A list of labels for every instance
  • is_training (bool) – running forward on training dataset?
  • is_validation (bool) – running forward on validation dataset?
  • is_test (bool) – running forward on test dataset?
Returns:

logits: torch.FloatTensor

Un-normalized probabilities over all the classes of the shape [batch_size, num_classes]

normalized_probs: torch.FloatTensor

Normalized probabilities over all the classes of the shape [batch_size, num_classes]

loss: float

Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset

Return type:

Dict[str, Any]

Simple Tagger

class sciwing.models.simple_tagger.SimpleTagger(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12052b90> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

PyTorch module for Neural Parscit

__init__(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12052b90> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')
Parameters:
  • rnn2seqencoder (Lstm2SeqEncoder) – Lstm2SeqEncoder that encodes a set of instances to a sequence of hidden states
  • encoding_dim (int) – Hidden dimension of the lstm2seq encoder
forward(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.seq_label.SeqLabel] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False)
Parameters:
  • lines (List[lines]) – A list of lines
  • labels (List[SeqLabel]) – A list of sequence labels
  • is_training (bool) – running forward on training dataset?
  • is_validation (bool) – running forward on training dataset ?
  • is_test (bool) – running forward on test dataset?
Returns:

logits: torch.FloatTensor

Un-normalized probabilities over all the classes of the shape [batch_size, num_classes]

predicted_tags: List[List[int]]

Set of predicted tags for the batch

loss: float

Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset

Return type:

Dict[str, Any]

Neural Parscit

class sciwing.models.neural_parscit.NeuralParscit(device: Optional[Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1482dc90>, int]] = -1)

Bases: sphinx.ext.autodoc.importer._MockObject

It defines a neural parscit model. The model is used for citation string parsing. This model helps you use a pre-trained model who architecture is fixed and is trained by SciWING. You can also fine-tune the model on your own dataset.

For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.

interact()

Interact with the pretrained model You can also interact from command line using sciwing interact neural-parscit

predict_for_file(filename: str) → List[str]

Parse the references in a file where every line is a reference

Parameters:filename (str) – The filename where the references are stored
Returns:A list of parsed tags
Return type:List[str]
predict_for_text(text: str, show=True) → str

Parse the citation string for the given text

Parameters:
  • text (str) – reference string to parse
  • show (bool) – If True, then we print the stylized string - where the stylized string provides different colors for different tags If False - then we do not print the stylized string
Returns:

The parsed citation string

Return type:

str

Citation Intent Classification

class sciwing.models.citation_intent_clf.CitationIntentClassification

Bases: sphinx.ext.autodoc.importer._MockObject

interact()

Interact with the pretrained model

predict_for_file(filename: str) → List[str]

Predict the intents for all the citations in the filename The citations should be contained one per line

Parameters:filename (str) – The filename where the citations are stored
Returns:Returns the intents for each line of citation
Return type:List[str]
predict_for_text(text: str) → str

Predict the intent for citation

Parameters:text (str) – The citation string
Returns:The predicted label for the citation
Return type:str

Generic Section Header Classification

class sciwing.models.generic_sect.GenericSect

Bases: object

interact()

Interact with the pretrained model

predict_for_file(filename: str) → List[str]

Make predictions for every line in the file

Parameters:filename (str) – The filename where section headers are stored one per line
Returns:A list of predictions
Return type:List[str]
predict_for_text(text: str, show=True) → str

Predicts the generic section headers of the text

Parameters:
  • text (str) – The section header string to be normalized
  • show (bool) – If True then we print the prediction.
Returns:

The prediction for the section header

Return type:

str

I2B2 NER

class sciwing.models.i2b2.I2B2NER

Bases: sphinx.ext.autodoc.importer._MockObject

It defines a I2B2 clinical NER model trained using SciWING

For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.

interact()
predict_for_file(filename: str) → List[str]
predict_for_text(text: str)

SectLabel

class sciwing.models.sectlabel.SectLabel(log_file: str = None, device: str = 'cpu')

Bases: object

dehyphenate(lines: List[str]) → List[str]

Dehyphenates a list of strings

Parameters:lines (List[str]) – A list of hyphenated strings
Returns:A list of dehyphenated strings
Return type:List[str]
extract_abstract_for_file(pdf_filename: pathlib.Path, dehyphenate: bool = True) → str

Extracts abstracts from a pdf using sectlabel. This is the python programmatic version of the API. The APIs can be found in sciwing/api. You can see that for more information

Parameters:
  • pdf_filename (pathlib.Path) – The path where the pdf is stored
  • dehyphenate (bool) – Scientific documents are two columns sometimes and there are a lot of hyphenation introduced. If this is true, we remove the hyphens from the code
Returns:

The abstract of the pdf

Return type:

str

extract_abstract_for_folder(foldername: pathlib.Path, dehyphenate=True)

Extracts the abstracts for all the pdf fils stored in a folder

Parameters:
  • foldername (pathlib.Path) – THe path of the folder containing pdf files
  • dehyphenate (bool) – We will try to dehyphenate the lines. Useful if the pdfs are two column research paper
Returns:

Writes the abstracts to files

Return type:

None

extract_all_info(pdf_filename: pathlib.Path)

Extracts information from the pdf file.

Parameters:pdf_filename (pathlib.Path) – The path of the pdf file
Returns:A dictionary containing information parsed from the pdf file
Return type:Dict[str, Any]
interact()

Interact with the pre-trained model

predict_for_file(filename: str) → List[str]

Predicts the logical sections for all the sentences in a file, with one sentence per line

Parameters:filename (str) – The path of the file
Returns:The predictions for each line.
Return type:List[str]
predict_for_pdf(pdf_filename: pathlib.Path) -> (typing.List[str], typing.List[str])

Predicts lines and labels given a pdf filename

Parameters:pdf_filename (pathlib.Path) – The location where pdf files are stored
Returns:The lines and labels inferred on the file
Return type:List[str], List[str]
predict_for_text(text: str) → str

Predicts the logical section that the line belongs to

Parameters:text (str) – A single line of text
Returns:The logical section of the text.
Return type:str
predict_for_text_batch(texts: List[str]) → List[str]

Predicts the logical section for a batch of text.

Parameters:texts (List[str]) – A batch of text
Returns:A batch of predictions
Return type:List[str]

sciwing.modules

sciwing.modules.embedders

bert_embedder
class sciwing.modules.embedders.bert_embedder.BertEmbedder(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3390>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3390>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bert Embedder that embeds the given instance to BERT embeddings

Parameters:
  • dropout_value (float) – The amount of dropout to be added after the embedding
  • aggregation_type (str) –

    The kind of aggregation of different layers. BERT produces representations from different layers. This specifies the strategy to aggregating them One of

    sum
    Sum the representations from all the layers
    average
    Average the representations from all the layers
  • bert_type (type) –

    The kind of BERT embedding to be used

    bert-base-uncased
    12 layer transformer trained on lowercased vocab
    bert-large-uncased:
    24 layer transformer trained on lowercased vocab
    bert-base-cased:
    12 layer transformer trained on cased vocab
    bert-large-cased:
    24 layer transformer train on cased vocab
    scibert-base-cased
    12 layer transformer trained on scientific document on cased normal vocab
    scibert-sci-cased
    12 layer transformer trained on scientific documents on cased scientifc vocab
    scibert-base-uncased
    12 layer transformer trained on scientific docments on uncased normal vocab
    scibert-sci-uncased
    12 layer transformer train on scientific documents on ncased scientific vocab
  • word_tokens_namespace (str) – The namespace in the liens where the tokens are stored
  • device (Union[torch.device, str]) – The device on which the model is run.
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3610>
Parameters:lines (List[Line]) – A list of lines
Returns:The bert embeddings for all the words in the instances The size of the returned embedding is [batch_size, max_len_word_tokens, emb_dim]
Return type:torch.Tensor
get_embedding_dimension() → int
bow_elmo_embedder
class sciwing.modules.embedders.bow_elmo_embedder.BowElmoEmbedder(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8e50>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8e50>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')

Bag of words Elmo Embedder which aggregates elmo embedding for every token

Parameters:
  • layer_aggregation (str) –

    You can chose one of [sum, average, last, first] which decides how to aggregate different layers of ELMO. ELMO produces three layers of representations

    sum
    Representations from different layers are summed
    average
    Representations from different layers are average
    last
    Representations from last layer is considered
    first
    Representations from first layer is considered
  • device (Union[str, torch.device]) – device for running the model on
  • word_tokens_namespace (int) – Namespace where all the word tokens are stored
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e1050>
Parameters:lines (List[Line]) – Just a list of lines
Returns:Returns the representation for every token in the instance [batch_size, max_num_words, emb_dim]. In case of Elmo the emb_dim is 1024
Return type:torch.Tensor
get_embedding_dimension() → int
char_embedder
class sciwing.modules.embedders.char_embedder.CharEmbedder(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3d90>] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3d90>] = <sphinx.ext.autodoc.importer._MockObject object>)

This is a character embedder that takes in lines and collates the character embeddings for all the tokens in the lines.

Parameters:
  • char_embedding_dimension (int) – The dimension of the character embedding
  • word_tokens_namespace (int) – The name space where the words are saved
  • char_tokens_namespace (str) – The namespace where the character tokens are saved
  • datasets_manager (DatasetsManager) – The dataset manager that handles all the datasets
  • hidden_dimension (int) – The hidden dimension of the LSTM which will be used to get character embeddings
forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension() → int
concat_embedders
class sciwing.modules.embedders.concat_embedders.ConcatEmbedders(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8c10>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8c10>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)

Concatenates a set of embedders into a single embedder.

Parameters:embedders (List[nn.Module]) – A list of embedders that can be concatenated
forward(lines: List[sciwing.data.line.Line])
Parameters:lines (List[Line]) – A list of Lines.
Returns:Returns the concatenated embedding that is of the size [batch_size, time_steps, embedding_dimension] where the embedding_dimension is after the concatenation
Return type:torch.FloatTensor
get_embedding_dimension()
elmo_embedder
class sciwing.modules.embedders.elmo_embedder.ElmoEmbedder(dropout_value: float = 0.5, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154cd190> = <sphinx.ext.autodoc.importer._MockObject object>, fine_tune: bool = False)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension()
flair_embedder
class sciwing.modules.embedders.flair_embedder.FlairEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13f55f50>] = 'cpu', word_tokens_namespace: str = 'tokens')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery, sciwing.modules.embedders.base_embedders.BaseEmbedder

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13f55f50>] = 'cpu', word_tokens_namespace: str = 'tokens')

Flair Embeddings. This is used to produce Named Entity Recognition. Note: This only works if your tokens are produced by splitting based on white space

Parameters:
  • embedding_type
  • datasets_manager
  • device
  • word_tokens_namespace
forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension()
trainable_word_embedder
class sciwing.modules.embedders.trainable_word_embedder.TrainableWordEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740410> = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740410> = <sphinx.ext.autodoc.importer._MockObject object>)

This represents trainable word embeddings which are trained along with the parameters of the network. The embeddings in the class WordEmbedder are not trainable. They are static

Parameters:embedding_type (str) – The type of embedding that you would want
datasets_manager: DatasetsManager
The datasets manager which is running your experiments
word_tokens_namespace: str
The namespace where the word tokens are stored in your data
device: Union[torch.device, str]
The device on which this embedder is run
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740a50>
get_embedding_dimension() → int
word_embedder
class sciwing.modules.embedders.word_embedder.WordEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8510>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8510>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Word Embedder embeds the tokens using the desired embeddings. These are static embeddings.

Parameters:
  • embedding_type (str) – The type of embedding that you would want
  • datasets_manager (DatasetsManager) – The datasets manager which is running your experiments
  • word_tokens_namespace (str) – The namespace where the word tokens are stored in your data
  • device (Union[torch.device, str]) – The device on which this embedder is run
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8850>

This will only consider the “tokens” present in the line. The namespace for the tokens is set with the class instantiation

Parameters:lines (List[Line]) –
Returns:It returns the embedding of the size [batch_size, max_num_timesteps, embedding_dimension]
Return type:torch.FloatTensor
get_embedding_dimension() → int

bow_encoder

class sciwing.modules.bow_encoder.BOW_Encoder(embedder=None, dropout_value: float = 0, aggregation_type='sum', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3f90>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(embedder=None, dropout_value: float = 0, aggregation_type='sum', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3f90>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bag of Words Encoder

Parameters:
  • embedder (nn.Module) – Any embedder that you would want to use
  • dropout_value (float) – The input dropout value that you would want to use
  • aggregation_type (str) –
    The strategy for aggregating words
    sum
    Aggregate word embedding by summing them
    average
    Aggregate word embedding by averaging them
  • device (Union[torch.device, str]) – The device where the embeddings are stored
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c90d0>
Parameters:lines (Dict[str, Any]) – The iter_dict returned by a dataset
Returns:The bag of words encoded embedding either average or summed The size is [batch_size, embedding_dimension]
Return type:torch.FloatTensor

charlstm_encoder

class sciwing.modules.charlstm_encoder.CharLSTMEncoder(char_embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154d5050>, char_emb_dim: int, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154d5090> = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(char_embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154d5050>, char_emb_dim: int, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154d5090> = <sphinx.ext.autodoc.importer._MockObject object>)

Encodes character tokens using lstms

Parameters:
  • char_embedder (nn.Module) – An embedder that embeds character tokens
  • char_emb_dim (int) – The embedding of characters
  • hidden_dim (int) – Hidden dimension of the LSTM
  • bidirectional (bool) – Should the LSTM be bi-directional
  • combine_strategy (str) – Combine strategy for the lstm hidden dimensions
  • device (torch.device("cpu)) – The device on which the lstm will run
forward(iter_dict: Dict[str, Any])
Parameters:iter_dict (Dict[str, Any]) – expects char_tokens to be present in the iter_dict from any dataset
Returns:[batch_size, num_time_steps, hidden_dim] The hidden dimension is the hidden dimension of the LSTM if it is bidirectional and concat then hidden_dim will be 2 * self.hidden_dim
Return type:torch.Tensor

lstm2seqencoder

class sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder(embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9990>, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, num_layers: int = 1, combine_strategy: str = 'concat', rnn_bias: bool = False, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9a50> = <sphinx.ext.autodoc.importer._MockObject object>, add_projection_layer: bool = True, projection_activation: str = 'Tanh')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9990>, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, num_layers: int = 1, combine_strategy: str = 'concat', rnn_bias: bool = False, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9a50> = <sphinx.ext.autodoc.importer._MockObject object>, add_projection_layer: bool = True, projection_activation: str = 'Tanh')

Encodes a set of tokens to a set of hidden states.

Parameters:
  • embedder (nn.Module) – Any embedder can be used for this purpose
  • dropout_value (float) – The dropout value for the embedding
  • hidden_dim (int) – The hidden dimensions for the LSTM
  • bidirectional (bool) – Whether the LSTM is bidirectional
  • num_layers (int) – The number of layers of the LSTM
  • combine_strategy (str) –

    The strategy to combine the different layers of the LSTM This can be one of

    sum
    Sum the different layers of the embedding
    concat
    Concat the layers of the embedding
  • rnn_bias (bool) – Set this to false only for debugging purposes
  • device (torch.device) –
  • add_projection_layer (bool) – Adds a projection layer after the lstm over the hidden activation
  • projection_activation (str) – Refer to torch.nn activations. Use any class name as a projection here
forward(lines: List[sciwing.data.line.Line], c0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9bd0> = None, h0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9c10> = None) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9810>
Parameters:
  • lines (List[Line]) – A list of lines
  • c0 (torch.FloatTensor) – The initial state vector for the LSTM
  • h0 (torch.FloatTensor) – The initial hidden state for the LSTM
Returns:

Returns the vector encoding of the set of instances [batch_size, seq_len, hidden_dim] if single direction [batch_size, seq_len, 2*hidden_dim] if bidirectional

Return type:

torch.Tensor

get_initial_hidden(batch_size: int)

lstm2vecencoder

class sciwing.modules.lstm2vecencoder.LSTM2VecEncoder(embedder, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', rnn_bias: bool = True, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9490>] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(embedder, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', rnn_bias: bool = True, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9490>] = <sphinx.ext.autodoc.importer._MockObject object>)

LSTM2Vec encoder that encodes a series of tokens to a single vector representation

Parameters:
  • embedder (nn.Module) – Any embedder can be passed
  • dropout_value (float) – The dropout value for input embeddings
  • hidden_dim (int) – The hidden dimension for the LSTM
  • bidirectional (bool) – Whether the LSTM is bidirectional or no
  • combine_strategy (str) – Strategy to combine the vectors from two different directions
  • rnn_bias (str) – Whether to use the bias layer in RNN. Should be set to false only for debugging purposes
  • device (Union[str, torch.device]) – The device on which the model is run
forward(lines: List[sciwing.data.line.Line], c0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c9290> = None, h0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c91d0> = None) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154c96d0>
Parameters:
  • lines (List[Line]) – A list of lines to be encoder
  • c0 (torch.FloatTensor) – The initial state vector for the LSTM
  • h0 (torch.FloatTensor) – The initial hidden state for the LSTM
Returns:

Returns the vector encoding of the set of instances [batch_size, hidden_dim] if single direction [batch_size, 2*hidden_dim] if bidirectional

Return type:

torch.Tensor

get_initial_hidden(batch_size: int)

Gets the initial hidden states of the LSTM2Vec encoder

Parameters:batch_size (int) – The batch size of the current forward pass
Returns:
Return type:torch.Tensor, torch.Tensor

sciwing.numericalizer

numericalizer

class sciwing.numericalizers.numericalizer.Numericalizer(vocabulary: sciwing.vocab.vocab.Vocab = None)

Bases: sciwing.numericalizers.base_numericalizer.BaseNumericalizer

__init__(vocabulary: sciwing.vocab.vocab.Vocab = None)

Numericalizer converts tokens that are strings to numbers

Parameters:vocabulary (Vocab) – A vocabulary object that is built using a set of tokenized strings
get_mask_for_batch_instances(instances: List[List[int]]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14bcd590>
get_mask_for_instance(instance: List[int]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14bcd550>
numericalize_batch_instances(instances: List[List[str]]) → List[List[int]]

Numericalizes a batch of instances

Parameters:instances (List[List[str]]) – A list of tokenized sentences
Returns:A list of numericalized instances
Return type:List[List[int]]
numericalize_instance(instance: List[str]) → List[int]

Numericalize a single instance

Parameters:instance (List[str]) – An instance is a list of tokens
Returns:Numericalized instance
Return type:List[int]
pad_batch_instances(instances: List[List[int]], max_length: int, add_start_end_token: bool = True) → List[List[int]]

Pads a batch of instances according to the vocab object

Parameters:
  • instances (List[List[int]]) –
  • max_length (int) –
  • add_start_end_token (int) –
Returns:

Return type:

List[List[int]]

pad_instance(numericalized_text: List[int], max_length: int, add_start_end_token: bool = True) → List[int]

Pads the instance according to the vocab object

Parameters:
  • numericalized_text (List[int]) – Pads a numericalized instance
  • max_length (int) – The maximum length to pad to
  • add_start_end_token (bool) – If true, start and end token will be added to the tokenized text
Returns:

Padded instance

Return type:

List[int]

vocabulary

transformer_numericalizer

class sciwing.numericalizers.transformer_numericalizer.NumericalizerForTransformer(vocab: sciwing.vocab.vocab.Vocab = None, tokenizer: sciwing.tokenizers.bert_tokenizer.TokenizerForBert = None)

Bases: sciwing.numericalizers.base_numericalizer.BaseNumericalizer

get_mask_for_batch_instances(instances: List[List[int]])
get_mask_for_instance(instance: List[int])
numericalize_batch_instances(instances: List[List[str]]) → List[int]
numericalize_instance(instance: Union[List[str], List[sciwing.data.token.Token]]) → List[int]
pad_batch_instances(instances: List[List[int]], max_length: int, add_start_end_token: bool = True)

Pads a batch of instances according to the vocab object

Parameters:
  • instances (List[List[int]]) –
  • max_length (int) –
  • add_start_end_token (int) –
Returns:

Return type:

List[List[int]]

pad_instance(numericalized_text: List[int], max_length: int, add_start_end_token: bool = True) → List[int]

Pads the instance according to the vocab object

Parameters:
  • numericalized_text (List[int]) – Pads a numericalized instance
  • max_length (int) – The maximum length to pad to
  • add_start_end_token (bool) – If true, start and end token will be added to the tokenized text
Returns:

Padded instance

Return type:

List[int]

sciwing.preprocessing

instance_preprocessing

class sciwing.preprocessing.instance_preprocessing.InstancePreprocessing

Bases: object

This class implements some common pre-processing that may be applied on instances which are List[str]. For example, you can remove stop words, convert the word into lower case and others. Most of the methods here accept an instance and return an instance

static indicate_capitalization(instance: List[str]) → List[str]

Indicates whether every word is all small, all caps or captialized

Parameters:instance (List[str]) – A list of tokens
Returns:Strings indicating capitalization
Return type:List[str]
static lowercase(instance: List[str]) → List[str]
remove_stop_words(instance: List[str]) → List[str]

Remove stop words if they are present We will use stop-words package from pip https://github.com/Alir3z4/python-stop-words

Parameters:instance (List[str]) – The list of tokens
Returns:The instance with stop words removed
Return type:List[str]

sciwing.tokenizers

BaseTokenizer

class sciwing.tokenizers.BaseTokenizer.BaseTokenizer

Bases: object

tokenize(text: str) → List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

bert_tokenizer

class sciwing.tokenizers.bert_tokenizer.TokenizerForBert(bert_type: str, do_basic_tokenize=True)

Bases: sciwing.tokenizers.BaseTokenizer.BaseTokenizer

tokenize(text: str) → List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

character_tokenizer

class sciwing.tokenizers.character_tokenizer.CharacterTokenizer

Bases: sciwing.tokenizers.BaseTokenizer.BaseTokenizer

tokenize(text: str) → List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

word_tokenizer

class sciwing.tokenizers.word_tokenizer.WordTokenizer(tokenizer: str = 'spacy')

Bases: sciwing.tokenizers.BaseTokenizer.BaseTokenizer

__init__(tokenizer: str = 'spacy')

WordTokenizers split the text into tokens

Parameters:tokenizer (str) –

The type of tokenizer.

spacy
Tokenizer from spact
nltk
NLTK based tokenizer
vanilla
Tokenize words according to space
spacy-whtiespace
Same as vanilla but implemented using custom white space tokenizer from spacy
tokenize(text: str) → List[str]

Tokenize text into a set of tokens

Parameters:text (str) – A single instance that is tokenized to a set of tokens
Returns:A set of tokens
Return type:List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

Tokenize a batch of sentences

Parameters:texts (List[List[str]]) –
Returns:
Return type:List[List[str]]

sciwing.utils

Amazon S3 Utils

class sciwing.utils.amazon_s3.S3Util(aws_cred_config_json_filename: str)

Bases: object

__init__(aws_cred_config_json_filename: str)

Some utilities that would be useful to upload folders/models to s3

Parameters:aws_cred_config_json_filename (str) –

You need to instantiate this file with a aws configuration json file

The following will be the keys and values
aws_access_key_id : str
The access key id for the AWS account that you have
aws_access_secret : str
The access secret
region : str
The region in which your bucket is present
parsect_bucket_name : str
The name of the bucket where all the models/experiments will be sotred
download_file(filename_s3: str, local_filename: str)

Downloads a file from s3

Parameters:
  • filename_s3 (str) – A filename in s3 that needs to be downloaded
  • local_filename (str) – The local filename that will be used
download_folder(folder_name_s3: str, download_only_best_checkpoint: bool = False, chkpoints_foldername: str = 'checkpoints', best_model_filename='best_model.pt', output_dir: str = '/home/docs/.sciwing.output_cache')

Downloads a folder from s3 recursively

Parameters:
  • folder_name_s3 (str) – The name of the folder in s3
  • download_only_best_checkpoint (bool) – If the folder being downloaded is an experiment folder, then you can download only the best model checkpoints for running test or inference
  • chkpoints_foldername (str) – The name of the checkpoints folder where the best model parameters are stored
  • best_model_filename (str) – The name of the file where the best model parameters are stored
get_client()

Returns boto3 client

Returns:The client object that manages all the aws operations The client is the low level access to the connection with s3
Return type:boto3.client
get_resource()

Returns a high level manager for the aws bucket

Returns:Resource that manages connections with s3
Return type:boto3.resource
load_credentials() → NamedTuple

Read the credentials from the json file

Returns:a named tuple with access_key, access_secret, region and bucket_name as the keys and the corresponding values filled in
Return type:NamedTuple
search_folders_with(pattern)

Searches for folders in the s3 bucket with specific pattern

Parameters:pattern (str) – A regex pattern
Returns:The list of foldernames that match the pattern
Return type:List[str]
upload_file(filename: str, obj_name: str = None)
Parameters:
  • filename (str) – The filename in the local directory that needs to be uploaded to s3
  • obj_name (str) – The filename to be used in s3 bucket. If None then obj_name in s3 will be the same as the filename
upload_folder(folder_name: str, base_folder_name: str)

Recursively uploads a folder to s3

Parameters:
  • folder_name (str) – The name of the local folder that is uploaded
  • base_folder_name (str) – The name of the folder from which the current folder being uploaded stems from. This is needed to associate appropriate files and directories to their hierarchies within the folder

Class Nursery

class sciwing.utils.class_nursery.ClassNursery

Bases: object

ClassNursery is the place where all the classes in SciWING are nursed

SciWING needs to get handle on the different classes that are being used. This is further useful for example, when we have to instantiate appropriate classes when the experiments are run from the TOML file

This uses a python 36 feature called __init_subclass__ that simplifies class creation. Whenever ClassNursery is mentioned as the parent class of a class, then init subclass is called. In SciWING we use it as a plugin registry where the mapping between the different class and their module is stored.

class_nursery = {'Adam': <sphinx.ext.autodoc.importer._MockObject object>, 'BOW_Encoder': 'sciwing.modules.bow_encoder', 'BertEmbedder': 'sciwing.modules.embedders.bert_embedder', 'BowElmoEmbedder': 'sciwing.modules.embedders.bow_elmo_embedder', 'CharEmbedder': 'sciwing.modules.embedders.char_embedder', 'CharLSTMEncoder': 'sciwing.modules.charlstm_encoder', 'CoNLLDatasetManager': 'sciwing.datasets.seq_labeling.conll_dataset', 'ConcatEmbedders': 'sciwing.modules.embedders.concat_embedders', 'ElmoEmbedder': 'sciwing.modules.embedders.elmo_embedder', 'Engine': 'sciwing.engine.engine', 'FlairEmbedder': 'sciwing.modules.embedders.flair_embedder', 'LSTM2VecEncoder': 'sciwing.modules.lstm2vecencoder', 'Lstm2SeqEncoder': 'sciwing.modules.lstm2seqencoder', 'PrecisionRecallFMeasure': 'sciwing.metrics.precision_recall_fmeasure', 'RnnSeqCrfTagger': 'sciwing.models.rnn_seq_crf_tagger', 'SGD': <sphinx.ext.autodoc.importer._MockObject object>, 'SimpleClassifier': 'sciwing.models.simpleclassifier', 'SimpleTagger': 'sciwing.models.simple_tagger', 'TextClassificationDatasetManager': 'sciwing.datasets.classification.text_classification_dataset', 'TokenClassificationAccuracy': 'sciwing.metrics.token_cls_accuracy', 'TrainableWordEmbedder': 'sciwing.modules.embedders.trainable_word_embedder', 'WordEmbedder': 'sciwing.modules.embedders.word_embedder'}

Common Utils

sciwing.utils.common.cached_path(path: Union[pathlib.Path, str], url: str, unzip=True) → pathlib.Path
sciwing.utils.common.chunks(seq, n)

Yield successive n-sized chunks from seq.

sciwing.utils.common.convert_generic_sect_to_json(filename: str) → Dict[str, Any]

Converts the Generic sect data file into more readable json format

Parameters:filename (str) – The sectlabel file name available at WING-NUS website
Returns:
text
The text of the line
label
The label of the file
file_no
A unique file number
line_count
A line count within the file
Return type:Dict[str, Any]
sciwing.utils.common.convert_generic_sect_to_sciwing_clf_format(filename: str, out_dir: str)

Converts the generic sect original file to the sciwing classification format

Parameters:
  • filename (str) – The path of the file where the original generic section classification file is stored
  • out_dir (str) – The output path where the train, dev and test files are written
Returns:

Return type:

None

sciwing.utils.common.convert_parscit_to_conll(parscit_train_filepath: pathlib.Path) → List[Dict[str, Any]]

Convert the parscit data available at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to a CONLL dummy version This is done so that we can use it with AllenNLPs built in data reader called conll2013 dataset reader

Parameters:parscit_train_filepath (pathlib.Path) – The path where the train file path is stored
sciwing.utils.common.convert_parscit_to_sciwing_seqlabel_format(parscit_train_filepath: pathlib.Path, output_dir: str)

Convert the parscit data availabel at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to the format required for sciwing seqential labelling

Parameters:
  • parscit_train_filepath (pathlib.Path) – The local path where the files are stored
  • output_dir (str) – The output dir where the train dev and test file will be written
sciwing.utils.common.convert_sectlabel_to_json(filename: str) → Dict[KT, VT]

Converts the secthead file into more readable json format

Parameters:filename (str) – The sectlabel file name available at WING-NUS website
Returns:
text
The text of the line
label
The label of the file
file_no
A unique file number
line_count
A line count within the file
Return type:Dict[str, Any]
sciwing.utils.common.convert_sectlabel_to_sciwing_clf_format(filename: str, out_dir: str)

Writes the file in the format required for sciwing text classification dataset

Parameters:
  • filename (str) – The path of the sectlabel original format file.
  • out_dir (str) – The path where the new files will be written
sciwing.utils.common.create_class(classname: str, module_name: str) → type

Given the classname and module, creates a class object and returns it

Parameters:
  • classname (str) – Class name to import
  • module_name (str) – The module in which the class is present
Returns:

Return type:

type

sciwing.utils.common.download_file(url: str, dest_filename: str) → None

Download a file from the given url

Parameters:
  • url (str) – The url from which the file will be downloaded
  • dest_filename (str) – The destination filename
sciwing.utils.common.extract_tar(filename: str, destination_dir: str, mode='r')

Extracts tar, targz and other files

Parameters:
  • filename (str) – The tar zipped file
  • destination_dir (str) – The destination directory in which the files should be placed
  • mode (str) – A valid tar mode. You can refer to https://docs.python.org/3/library/tarfile.html for the different modes.
sciwing.utils.common.extract_zip(filename: str, destination_dir: str)

Extracts a zipped file

Parameters:
  • filename (str) – The zipped filename
  • destination_dir (str) – The directory where the zipped will be placed
sciwing.utils.common.flatten(list_items: List[Any]) → List[Any]

Flattens an arbitrarily long nesting of lists

Parameters:list_items (List[Any]) – It can be an arbitrarily long nesting of lists
Returns:Flattened list
Return type:List
sciwing.utils.common.get_system_mem_in_gb()

Returns the total system memory in GB

Returns:Memory size in GB
Return type:float
sciwing.utils.common.get_train_dev_test_stratified_split(lines: List[str], labels: List[str], train_split: float = 0.8, dev_split: float = 0.1, test_split: float = 0.1, random_state: int = 1729) -> ((typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]))

Slits the lines and labels into train, dev and test splits using stratified and random shuffle

Parameters:
  • lines (List[str]) – A list of lines
  • labels (List[str]) – A list of labels
  • train_split (float) – The proportion of lines to be used for training
  • dev_split (float) – The proportion of lines to be used for validation
  • test_split (float) – The proportion of lines to be used for testing
  • random_state (int) – The seed to be used for randomization. Good for reproducing the same splits Passing None will cause the random number generator to be RandomState used by np.random
sciwing.utils.common.merge_dictionaries_with_sum(a: Dict[KT, VT], b: Dict[KT, VT]) → Dict[KT, VT]
sciwing.utils.common.pack_to_length(tokenized_text: List[str], max_length: int, pad_token: str = '<PAD>', add_start_end_token: bool = False, start_token: str = '<SOS>', end_token: str = '<EOS>') → List[str]

Packs tokenized text to maximum length

Parameters:
  • tokenized_text (List[str]) – A list of toekns
  • max_length (int) – The max length to pack to
  • pad_token (int) – The pad token to be used for the padding
  • add_start_end_token (bool) – Whether to add the start and end token to every sentence while packing
  • start_token (str) – The start token to be used if add_start_token is True.
  • end_token (str) – The end token to be used if add_end_token is True
sciwing.utils.common.pairwise(iterable: Iterable[T_co]) → Iterator[T_co]

Return the overlapping pairwise elements of the iterable

Parameters:iterable (Iterable) – Anything that can be iterated
Returns:Iterator over the paired sequence
Return type:Iterator
sciwing.utils.common.write_cora_to_conll_file(cora_conll_filepath: pathlib.Path) → None

Writes cora file that is availabel at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train to CONLL format

Parameters:cora_conll_filepath (The destination filepath where the CORA is converted to CONLL format) –
sciwing.utils.common.write_nfold_parscit_train_test(parscit_train_filepath: pathlib.Path, output_train_filepath: pathlib.Path, output_test_filepath: pathlib.Path, nsplits: int = 2) → bool

Convert the parscit train folder into different folds. This is useful for n-fold cross validation on the dataset. This method can be iterated over to get all the different folds of the data contained in the parscit_train_filepath

Parameters:
  • parscit_train_filepath (pathlib.Path) – The path where the Parscit file is stored The file is available at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train
  • output_train_filepath (pathlib.Path) – The path where the train fold of the dataset will be stored
  • output_test_filepath (pathlib.Path) – The path where the teset fold of the dataset will be stored
  • nsplits (int) – The number of splits in the dataset.
Returns:

Indicates whether the particular fold has been written

Return type:

bool

sciwing.utils.common.write_parscit_to_conll_file(parscit_conll_filepath: pathlib.Path) → None

Write Parscit file to CONLL file format

Parameters:parscit_conll_filepath (pathlib.Path) – The destination file where the parscit data is written to

Custom Spacy Tokenizers

This module implements custom spacy tokenizers if needed This can be useful for custom tokenization that is required for scientific domain

class sciwing.utils.custom_spacy_tokenizers.CustomSpacyWhiteSpaceTokenizer(vocab)

Bases: object

__init__(vocab)

White space tokenizer tokenizes the word according to spaces.

Parameters:vocab (nlp.vocab) – Spacy vocab object

Custom Exceptions

exception sciwing.utils.exceptions.ClassInNurseryError

Bases: KeyError

The ClassNursery cannot have two classes of the same name. This error is raised when that happens

exception sciwing.utils.exceptions.DatasetPresentError(message: str)

Bases: Exception

exception sciwing.utils.exceptions.TOMLConfigurationError(message: str)

Bases: Exception

This error is raised for illegal configuration of TOML

Science IE Data Utils

class sciwing.utils.science_ie_data_utils.ScienceIEDataUtils(folderpath: pathlib.Path, ignore_warnings=False)

Bases: object

Science-IE is a SemEval Task that is aimed at extracting entities from scientific articles This class is a utility for various operations on the competitions data files.

__init__(folderpath: pathlib.Path, ignore_warnings=False)

Given the folderpath where the ScienceIE data is stored, this class provides various utilities. For more information on the dataset you can refer to https://scienceie.github.io/

Parameters:
  • folderpath (pathlib.Path) – The path where the ScienceIEDataset is stored
  • ignore_warnings (bool) – If True, then all the warnings generated by this class for inconsistencies in the data is ignored
static _form_ann_line(idx: str, char_offset: Tuple[int, int, str], tag_name: str, doc: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681c50>)

Forms a ann line that can be used to write the ANN files for CoNLL format

Parameters:
  • idx (int) – The index for the entity being written
  • char_offset (int) – THe start, end, tag for the line
  • tag_name (str) – The tag to be used and is one of [Task, Process, Material]
  • doc (str) – Spacy doc to query the appropriate characters
Returns:

An ANN line that is formed.

Return type:

str

_get_annotations_for_entity(file_id: str, entity: str) → List[Dict[str, Any]]
Parameters:
  • file_id (str) – A ScienceIE file id
  • entity (str) – One of [Task, Process, Material]
Returns:

A list of annotations where every annotation is
start

The start character index of the annotation

end

The end character index of the annotation

words

The set of words between the start and the end index

entity_number

The entity number

tag

The tag associated with the set of tags

Return type:

List[Dict[str, Any]]

_get_bilou_lines_for_entity(text: str, annotations: List[Dict[str, Any]], entity: str) → List[str]

The list of BILOU lines for entity

Parameters:
  • text (str) – The text for which BILOU lines need to be returned
  • annotations (List[Dict[str, Any]]) – The list of annotations where every annotation is a dictionary
  • entity (str) – A particular entity for which the BILOU lines are returned
Returns:

The list of BILOU tagged lines, where every line is a word, tag, tag, tag where the tag is decided by the entity.

Return type:

List[str]

get_bilou_lines_for_entity(file_id: str, entity: str)

Writes conll file for the entity type

Parameters:
  • file_id (str) – File id of the annotation file
  • entity (str) – The entity for which conll file is written
Returns:

The list of BILOU lines for the entity

Return type:

List[str]

get_file_ids() → List[str]

Get all the file ids from the folder

Returns:A List of File ids in the folder
Return type:List[str]
get_sentence_wise_bilou_lines(file_id: str, entity_type: str) → List[List[str]]

Get BILOU lines sentence-wise

Parameters:
  • file_id (str) – File id from ScienceIE Dataset
  • entity_type (str) – One of ['Task', 'Process', 'Material']
Returns:

A list of sentences where every sentence is composed

Return type:

List[List[str]]

get_sents(text: str) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681050>]

Returns all the sentences in the text

Parameters:text (str) –
Returns:All the sentences in the text as a spacy span. A spacy span encodes more information within
Return type:List[span.Span]
get_text_from_fileid(file_id: str) → str

Given a file id return the text from the file

Parameters:file_id (str) – A ScienceIE data file id
Returns:Text read from the file
Return type:str
merge_files(task_filename: pathlib.Path, process_filename: pathlib.Path, material_filename: pathlib.Path, out_filename: pathlib.Path)

Merge different files to one conll file

Parameters:
  • task_filename (pathlib.Path) – The CONLL style file having TASK tags
  • process_filename (pathlib.Path) – The CONLL style file having Process tags
  • material_filename (pathlib.Path) – The CONLL style file having Material Tags
  • out_filename (pathlib.Path) – The output file where the different files will be merged and every line will consist of word Task-tag Process-tag Material-tag
write_ann_file_from_conll_file(conll_filepath: pathlib.Path, ann_filepath: pathlib.Path, text: str)
write_bilou_lines(out_filename: pathlib.Path, is_sentence_wise: bool = False)

Writes bilou lines in the out_filename for all the files in self.folderpath. The output file will contain every word on one line with their tag in BILOU format.

You can even opt to write the text in a sentence wise. The text which is possibly of multiple sentences, is broken down into sentences and then written into the output filename. Different sentences are separated by an empty line.

Parameters:
  • out_filename (pathlib.Path) – The output filename where the conll filename is written
  • is_sentence_wise (bool) – You can write the BILOU lines sentence wise. The text in all the ScienceIE files will be broken into sentences, and the sentences will be tagged with BILOU tags

Sciwing TOML Runner

class sciwing.utils.sciwing_toml_runner.SciWingTOMLRunner(toml_filename: pathlib.Path, infer: bool = False)

Bases: object

_form_dag(section_name: str, section: Dict[KT, VT], parent: str)

Forms a DAG of the model section for execution

The model can be a complex structure with various other sub-components that can be used One depends on the other and the order of execution has to be decided DAG is a good abstract model to define the dependence between different modules This method instantiates a DAG given the section name, the TOML section that is being parsed with a directed edge between the parent and the child

Parameters:
  • section_name (str) – The name of the TOML section being parsed
  • section (Dict) – The details of the actual section
  • parent (str) – The node id of the parent graph
_instantiate_model_using_dag()

This is a key method that instantiates the DAG using topological order

THE DAG from the TOML model section should be instantiated with the submodules of a module instantiated before the parent module can be instantiated This method does it using topological sort. Topoloogical sort is the sorting of nodes of a DAG where if there is an edge between two nodes from u ->v , then u appears before v in the ordering.

We do exactly this for SciWING. We instantiate the children nodes that are used by parent nodes before we can instantiate the root node of the DAG that will represent the entire module.

Returns:The instantiation of the root node
Return type:nn.Module
_parse_toml_file()

Parses the toml file and returns the document

Returns:The dictionary by parsing the toml file
Return type:Dict[str, Any]
parse()

Parases the dataset, model and engine section of a toml file

parse_dataset_section()

Parse the dataset section of the toml file and instantiate the dataset

Returns:The dataset manager for the experiment
Return type:DatasetManager
parse_engine_section()

Parses the engine section of the TOML file

Returns:Object of the Engine class
Return type:Engine
parse_model_section()

Parses the Model section of the toml file

Returns:A torch module representing the model
Return type:nn.Module
run()

Tensor Utils

sciwing.utils.tensor_utils.get_mask(batch_size: int, max_size: int, lengths: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c11f3ded0>)

Returns mask given the lengths tensor. A convenience method

Given a lengths tensor as in

>> torch.LongTensor([3, 1, 2])

which often indicates the original length of the tensor without padding, get_mask() returns a tensor with 1 positions where there is no padding and 0 where there is padding

Parameters:
  • batch_size (int) – Batch size of the tensors
  • max_size (int) – Maximum size or often Maximum number of time steps
  • lengths (torch.LongTensor) – The original length of the tensors in the batch without padding
Returns:

Mask having 1 where there are no paddings and 0 where there are paddings

Return type:

torch.LongTensor

sciwing.utils.tensor_utils.has_tensor(obj) → bool

Given a possibly complex data structure, check if it has any torch.Tensors in it. From allennlp.nn.util

sciwing.utils.tensor_utils.move_to_device(obj, cuda_device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c11f3df50>)

Given a structure (possibly) containing Tensors on the CPU, move all the Tensors to the specified GPU (or do nothing, if they should be on the CPU). From allenlp.nn.util

NER Terminal Visualizer

class sciwing.utils.vis_seq_tags.VisTagging(colors: List[str] = None, colors_palette: str = None, tags: List[str] = None)

Bases: object

__init__(colors: List[str] = None, colors_palette: str = None, tags: List[str] = None)

Visualize Sequence Tagging

Parameters:
  • colors (List[str]) – The set of colors that will be used for tagging
  • colors_palette (str) – The color palette that should be used. We recommend For more information on color palettes you can refer to the documentation of the python package colorful
  • tags (List[str]) – The set of all labels that can be labelled If this is not given, then the tags will be infered using the labels during tagging
visualize_tags_from_json(json_annotation: Dict[str, Any], show_only_entities: List[str] = None)

Visualize the tags from json.

Parameters:
  • json_annotation (str) – You can send a json that has the following format {‘text’: str, ‘tags’: [{‘start’:int, ‘end’:str, ‘tag’: str}] }
  • show_only_entities (List[str]) – You can filter to show only these entities.
visualize_tokens(text: List[str], labels: List[str]) → str

Visualizes sequential tagged data where the string is represented as a set of words and every word has a corresponding label. This can be extended to having different tagging schemes at a later point in time

Parameters:
  • text (List[str]) –
  • to be tagged represented as a list of strings (String) –
  • labels (List[str]) –
  • labels corresponding to each word in the string (The) –
Returns:

Return type:

None