Welcome to SciWing’s documentation!¶
![]()
SciWING is a modular and easy to extend framework, that enables easy experimentation of modern techniques for Scholarly Document Processing. It enables easy addition of datasets, and models and provides tools to easily experiment with them.
SciWING is a modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and believes in modularity from ground up and easy to use interface. SciWING includes many pre-trained models for fundamental tasks in Scientific Document Processing for practitioners. It has the following advantages
- Modularity - The framework embraces modularity from ground-up. SciWING helps in creating new models by combining multiple re-usable modules. You can combine different modules and experiment with new approaches in an easy manner
- Pre-trained Models -SciWING has many pre-trained models for fundamental tasks like Logical SectionClassifier for scientific documents, Citation string Parsing(Take a look at some of the other project related to station parsing Parscit, Neural_Parscit . Easy access to pre-trained models are made available through web APIs.
- Run from Config File- SciWING enables you to declare datasets, models and experiment hyper-params in a TOML file. The models declared in a TOML file have a one-one correspondence with their respective class declaration in a python file. SciWING parses the model to a Directed Acyclic Graph and instantiates the model using the DAG’s topological ordering.
- Extensible - SciWING enables easy addition of new datasets and provides command line tools for it. It enables addition of custom modules which are PyTorch modules.
Usage¶
Installation and Getting Started¶
The first step to use SciWING is to install the package on your local system. Once the package is installed, you can directly access the functionalities of SciWING. SciWING downloads the pre-trained models, embeddings and other information that is required to run the models on-demand basis.
On this page, we provide some basic tutorials on installation of SciWING and basic usage of SciWING.
Installation from Pip¶
SciWING currently only supports Python 3.7. This is the default and the only way to install
SciWING using Pip, the python package manager. Do make sure that the pip version is 3.7 as well. We recommend using
virtualenv
which helps in keeping the development environment clean. To setup a virtual environtment, simply run
virtualenv -ppython3.7 .venv
source .venv/bin/activate
To install SciWING, just run
pip install sciwing
This installs all the dependencies required to run SciWING like PyTorch
.
If you want to install sciwing for the current user then you can use
pip install -U sciwing
Building from source¶
- Clone from git
git clone https://github.com/abhinavkashyap/sciwing.git
- cd sciwing
- Install the module in development mode
pip install -e .
- Download spacy models
python -m spacy download en
- Create directories where SciWING’s data are stored and embeddings/data are downloaded
sciwing develop makedirs sciwing develop downloadThis step is optional. This will download the data and embeddings required for development. If you do not perform this step, then it gets downloadd later upon first request
- SciWING uses
pytest
for testing. You can use the following command to run testspytest tests -n auto --dist=loadfile
The test suite is huge and again, it will take some time to run. We will put efforts to reduce the test time in the next iterations.
Running API Services¶
The APIs are built using FastAPI. We have APIs for citation string parsing, citation and intent classification and many other models. To run the APIs navigate into the api folder of this repository and run
uvicorn api:app --reload
Note
Navigate to http://localhost:8000/docs to access the SwaggerUI. The UI enables you to try the different APIs using a web interface.
Running the Demos¶
The demos are built using Streamlit. The Demos make use of the APIs. Please make sure that the APIs are running before the demos can be started. Navigate to the app folder and run the demo using streamlit (Installed along with the package). For example, this command runs all the demos.
Note
The demos download the models and the embeddings if already not downloaded and running the first time on your local machine might take time and memory. We have tested this on a 16GB MacBook Pro and works well. All the demos run on CPU for now and does not make use of any GPU, even when present.
streamlit run all_apps.py
Accessing Models¶
SciWING comes with many pre-trained scientific documenting processing models, that are easily accessible using a few lines of Python code. SciWING provides a consistent interface for all of its models. You can access these models, immediately after installation. The required model parameters, the embeddings etc are downloaded and initialized.
Note
The first time access of these models takes time, since we need to download them. Allow 60s, for the downloads to complete. Future access of the models are faster
Citation String Parsing¶
Neural Parscit is a citation parsing model. A citation string contains information like the author, the title of the publication, the conference/journal the year of publication etc. Neural Parscit extracts such information from references.
from sciwing.models.neural_parscit import NeuralParscit
# predict for a citation
neural_parscit = NeuralParscit()
# Predict on a reference string
neural_parscit.predict_for_text("Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")
# Predict on a file - The file should contain one referece for string
neural_parascit.predict_for_file("/path/to/file")
Tutorials¶
If you have not installed SciWING on your local machine yet, head over to our Installation and Getting Started section first. Here we are going to provide more indepth tutorials on accessing the pre-trained models, introspecting them, building the models step by step and others.
Examples¶
If you would like to see examples of how SciWING is used for training models for different tasks, the python code for various tasks in SciWING are given in the examples folder of our Github Repo. The instructions to run the code for examples are provided within every example.
Pre-trained Models¶
Note
If this is your first time use of the package, it takes time to download the pre-trained models. Subsequent access to the models are going to be faster.
Neural Parscit¶
Neural Parscit is a citation parsing model. A citation string contains information like the author, the title of the publication, the conference/journal the year of publication etc. Neural Parscit extracts such information from references.
>> from sciwing.models.neural_parscit import NeuralParscit
# predict for a citation
>> neural_parscit = NeuralParscit()
# Predict on a reference string
>> neural_parscit.predict_for_text("Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")
# Predict on a file - The file should contain one referece for string
>> neural_parascit.predict_for_file("/path/to/file")
Citation Intent Classification¶
Identify the intent behind citing another scholarly document helps in fine-grain analysis of documents. Some citations refer to the methodology in another document, some citations may refer to other works for background knowledge and some might compare and contrast their methods with another work. Citation Intent Classification models classify such intents.
>> from sciwing.models.citation_intent_clf import CitationIntentClassification
# instantiate an object
>> citation_intent_clf = CitationIntentClassification()
# predict the intention of the citation
>> citation_intent_clf.predict_for_text("Abu-Jbara et al. (2013) relied on lexical,structural, and syntactic features and a linear SVMfor classification.")
I2B2 Clinical Notes Tagging¶
Clinical Natural Language Processing helps in identifying salient information from clinical notes. Here, we have trained a neural network model on the i2b2: Informatics for Integrating Biology and the Bedside dataset.This dataset has manual annotation for the problems identified, the treatments and tests suggested.
>> from sciwing.models.i2b2 import I2B2NER
>> i2b2ner = I2B2NER()
>> i2b2ner.predict_for_text("Chest x - ray showed no evidency of cardiomegaly")
Extracting Abstracts¶
You can extract the abstracts from pdf files or abstracts from a folder containing pdf files.
>> from sciwing.models.sectlabel import SectLabel
>> sectlabel = SectLabel()
# extract abstract for file
>> sectlabel.extract_abstract_for_file("/path/to/pdf/file")
# extract abstract for all the files in the folder
>> sectlabel.extract_abstract_for_folder("/path/to/folder")
Identifying Different Logical Sections¶
Identifying different logical sections of the model is a fundamental task in scientific document processing.
The SectLabel
model of SciWING is used to obtain information about different sections of a research
article.
SectLabel
can label every line of the document to one of many different labels like
title
, author
, bodyText
etc. which can then be used for many other down-stream
applications.
>> from sciwing.models.sectlabel import SectLabel
>> sectlabel = SectLabel()
# label all the lines in a document
>> sectlabel.predict_for_file("/path/to/pdf")
You can also get the abstract
, section headers
and the embedded refernces in the document
using the same model as follows
>> from sciwing.models.sectlabel import SectLabel
>> sectlabel = SectLabel()
>> sectlabel.predict_for_file("/path/to/pdf")
>> info = sectlabel.extract_all_info("/path/to/pdf")
>> abstract = info["abstract"]
>> section_headers = info["section_headers"]
>> references = info["references]
Normalising Section Headers¶
Different research paper use different section headers. However, in order to identify the logical flow of a research paper, it would be helpful, if we could normalize the different section headers to a pre-defined set of headers. This model helps in performing such classifications.
>> from sciwing.models.generic_sect import GenericSect
>> generic_sect = GenericSect()
>> generic_sect.predict_for_text("experiments and results")
## evaluation
Interacting with Models¶
SciWING allows you to interact with pre-trained models even without writing code. We can interact with
all the pre-trained models using command line application. Upon installation, the command
sciwing
is available to the users. One of the sub-commands is the interact command. Let us
see an example
sciwing interact neural-parscit
This will run the inference of the best model on test data and prepare the model for interaction.
Note
The inference time again depends on whether you have a CPU or GPU. By default, we assume that you are running the model on a CPU.
1. See-Confusion-Matrix
2. See-examples-of-Classifications
3. See-prf-table
4. Enter text
1.The first option shows confusion matrix for different classes of Neural ParsCit.
2.The second option shows examples where one class is misclassified as the other. For eg., enter
4 5
to show examples where some tags belonging to class 4 is misclassified as 5
3.The Precision Recall and F-measure for the test dataset is shown along with the macro and micro F-scores.
4.You can enter a reference string and see the results.
PDF Pipelines¶
Note
Under Construction: This will allow you to provide path to a PDF file and extract all the information with respect to the pdf file. The information includes abstract, title, author, section headers, normalized section headers, embedded references, parses of the references etc.
Package Documentation¶
sciwing.api¶
sciwing.api.routers¶
citation_intent_clf¶
-
sciwing.api.routers.citation_intent_clf.
classify_citation_intent
(citation: str)¶ End point to classify a citation intent into
`Background`, `Method`, `Result Comparison`
Parameters: citation (str) – String containing the citation to another work Returns: {"tags": Predicted tag for the citation, "citation": the citation itself}
Return type: JSON
i2b2¶
Tags the text that you send according to i2b2 model with
problem, treatment and tests
Parameters: text (str) – The text to be tagged Returns: {tags: Predicted tags, text_tokens: Tokens in the text }
Return type: JSON
parscit¶
-
sciwing.api.routers.parscit.
tag_citation_string
(citation: str)¶ End point to tag parse a reference string to their constituent parts.
Parameters: citation (str) – The reference string to be parsed. Returns: {"tags": Predicted tags, "text_tokens": Tokenized citation string}
Return type: JSON
pdf_pipeline¶
-
sciwing.api.routers.pdf_pipeline.
pdf_pipeline
(file: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c31550> = <sphinx.ext.autodoc.importer._MockObject object>)¶ Parses the file and returns various analytics about the pdf
Parameters: file (File) – A File stream Returns: Returns a JSON where the key can be a section in the document with value as the text of the document. It can also be other information such as parsed reference strings in the document, or normalised section headers of the document. This is a feature in development. Be careful in using this. Return type: JSON
sectlabel¶
-
sciwing.api.routers.sectlabel.
extract_pdf
(file: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2b12a90> = <sphinx.ext.autodoc.importer._MockObject object>)¶ Extracts the abstract from a scholarly article
Parameters: file (uploadFile) – Byte Stream of a file uploaded. Returns: {"abstract": The abstract found in the scholarly document}
Return type: JSON
-
sciwing.api.routers.sectlabel.
process_pdf
(file: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2b12a90> = <sphinx.ext.autodoc.importer._MockObject object>)¶ Classifies every line in the document to the logical section of the document. The logical section can be title, author, email, section header, subsection header etc
Parameters: file (File) – The Bytestream of a file to be uploaded Returns: {"labels": [(line, label)]}
Return type: JSON
sciwing.cli¶
sciwing_interact¶
-
class
sciwing.cli.sciwing_interact.
SciWINGInteract
(infer_client: sciwing.infer.interface_client_base.BaseInterfaceClient)¶ Bases:
object
This cli helps in interacting with different models of sciwing
-
interact
()¶ Interact with the user to explore different models
This method provides various options for exploration of the different models.
See-Confusion-Matrix
shows the confusion matrix on the test dataset.See-Examples-of-Classification
is to explore correct and mis-classifications. You can provide two class numbers as in,2 3
and it shows examples in the test dataset where text that belong to class2
is classified as class3
.See-prf-table
shows the precision recall and fmeasure per class.See-text
- Manually enter text and look at the classification results.
-
s3_mv_cli¶
-
class
sciwing.cli.s3_mv_cli.
S3OutputMove
(foldername: str)¶ Bases:
object
-
__init__
(foldername: str)¶ Provides an interactive way to move some folders to s3
Parameters: foldername (str) – The folder name which will be moved to S3 bucket
-
static
ask_deletion
() → str¶ Since this is deletion, we want confirmation, just to be sure whether to keep the deleted folder locally or to remove it
Returns: An yes or no answer to the question Return type: str
-
get_folder_choice
()¶ Goes through the folder and gets the choice on which folder should be moved
Returns: The folder which is chosen to be moved Return type: str
-
interact
()¶ Interacts with the user by providing various options
-
sciwing.commands¶
validators¶
Utility functions for validation.
-
sciwing.commands.validators.
is_file_exist
(name: str)¶ Indicates whether file name exists or not
Parameters: name (str) – String representing filename Returns: True when filename indicated by name exists, False otherwise Return type: bool
-
sciwing.commands.validators.
is_valid_python_classname
(name: str)¶ Indicates whether name is a valid Python identifier
Parameters: name (str) – A string representing a class name Returns: True when name is a valid python identifier, False otherwise Return type: bool
sciwing.datasets¶
sciwing.datasets.classification¶
base_text_classification¶
-
class
sciwing.datasets.classification.base_text_classification.
BaseTextClassification
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Bases:
object
-
__init__
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Base Text Classification Dataset to be inherited by all text classification datasets
Parameters: - filename (str) – Full path of the filename where classification dataset is stored
- tokenizers (Dict[str, BaseTokenizer]) – The mapping between namespace and a tokenizer
-
get_lines_labels
() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])¶ A list of lines from the file and a list of corresponding labels
This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.
Returns: Returns a list of text examples and corresponding labels Return type: (List[str], List[str])
-
text_classification_dataset¶
-
class
sciwing.datasets.classification.text_classification_dataset.
TextClassificationDataset
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = <sciwing.tokenizers.word_tokenizer.WordTokenizer object>)¶ Bases:
sciwing.datasets.classification.base_text_classification.BaseTextClassification
,sphinx.ext.autodoc.importer._MockObject
This represents a dataset that is of the form
line1###label1
line2###label2
line3###label3 . . .
-
get_lines_labels
() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.label.Label])¶ A list of lines from the file and a list of corresponding labels
This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.
Returns: Returns a list of text examples and corresponding labels Return type: (List[str], List[str])
-
labels
¶
-
lines
¶
-
-
class
sciwing.datasets.classification.text_classification_dataset.
TextClassificationDatasetManager
(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶ Bases:
sciwing.data.datasets_manager.DatasetsManager
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶ Parameters: - train_filename (str) – The path wehere the train file is stored
- dev_filename (str) – The path where the dev file is stored
- test_filename (str) – The path where the test file is stored
- tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
- namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
- namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
- batch_size (int) – The batch size of the data returned
-
sciwing.datasets.seq_labeling¶
base_seq_labeling¶
-
class
sciwing.datasets.seq_labeling.base_seq_labeling.
BaseSeqLabelingDataset
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Bases:
object
-
__init__
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Base Text Classification Dataset to be inherited by all text classification datasets
Parameters: - filename (str) – Path of file where the text classification dataset is stored. Ideally this should have an example text and label separated by space. But it is left to the specific dataset to handle the different ways in which file could be structured
- tokenizers (Dict[str, BaseTokeizer]) –
-
get_lines_labels
() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])¶ A list of lines from the file and a list of corresponding labels
This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.
Returns: Returns a list of text examples and corresponding labels Return type: (List[str], List[str])
-
seq_labelling_dataset¶
-
class
sciwing.datasets.seq_labeling.seq_labelling_dataset.
SeqLabellingDataset
(filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer])¶ Bases:
sciwing.datasets.seq_labeling.base_seq_labeling.BaseSeqLabelingDataset
,sphinx.ext.autodoc.importer._MockObject
This represents a dataset that is of the form
word1###label1 word2###label2 word3###label3
word1###label1 word2###label2 word3###label3
word1###label1 word2###label2 word3###label3
.
.
.
-
get_lines_labels
() -> (typing.List[sciwing.data.line.Line], typing.List[sciwing.data.seq_label.SeqLabel])¶ A list of lines from the file and a list of corresponding labels
This method is to be implemented by a new dataset. The decision on the implementation logic is left to the new class. Datasets come in all shapes and sizes.
Returns: Returns a list of text examples and corresponding labels Return type: (List[str], List[str])
-
-
class
sciwing.datasets.seq_labeling.seq_labelling_dataset.
SeqLabellingDatasetManager
(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶ Bases:
sciwing.data.datasets_manager.DatasetsManager
-
__init__
(train_filename: str, dev_filename: str, test_filename: str, tokenizers: Dict[str, sciwing.tokenizers.BaseTokenizer.BaseTokenizer] = None, namespace_vocab_options: Dict[str, Dict[str, Any]] = None, namespace_numericalizer_map: Dict[str, sciwing.numericalizers.base_numericalizer.BaseNumericalizer] = None, batch_size: int = 10)¶ Parameters: - train_filename (str) – The path wehere the train file is stored
- dev_filename (str) – The path where the dev file is stored
- test_filename (str) – The path where the test file is stored
- tokenizers (Dict[str, BaseTokenizer]) – A mapping from namespace to the tokenizer
- namespace_vocab_options (Dict[str, Dict[str, Any]]) – A mapping from the name to options
- namespace_numericalizer_map (Dict[str, BaseNumericalizer]) – Every namespace can have a different numericalizer specified
- batch_size (int) – The batch size of the data returned
-
sciwing.engine¶
engine¶
-
class
sciwing.engine.engine.
Engine
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f21a4ad0>, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, optimizer: sphinx.ext.autodoc.importer.<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f07cafd0>, batch_size: int, save_dir: str, num_epochs: int, save_every: int, log_train_metrics_every: int, train_metric: sciwing.metrics.BaseMetric.BaseMetric, validation_metric: sciwing.metrics.BaseMetric.BaseMetric, test_metric: sciwing.metrics.BaseMetric.BaseMetric, experiment_name: Optional[str] = None, experiment_hyperparams: Optional[Dict[str, Any]] = None, tensorboard_logdir: str = None, track_for_best: str = 'loss', collate_fn=<class 'list'>, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f22bb5d0>, str] = <sphinx.ext.autodoc.importer._MockObject object>, gradient_norm_clip_value: Optional[float] = 5.0, lr_scheduler: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f22bb410>] = None, use_wandb: bool = False, sample_proportion: float = 1.0, seeds: Dict[str, int] = None)¶ Bases:
sciwing.utils.class_nursery.ClassNursery
-
__init__
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f21a4ad0>, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, optimizer: sphinx.ext.autodoc.importer.<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f07caed0>, batch_size: int, save_dir: str, num_epochs: int, save_every: int, log_train_metrics_every: int, train_metric: sciwing.metrics.BaseMetric.BaseMetric, validation_metric: sciwing.metrics.BaseMetric.BaseMetric, test_metric: sciwing.metrics.BaseMetric.BaseMetric, experiment_name: Optional[str] = None, experiment_hyperparams: Optional[Dict[str, Any]] = None, tensorboard_logdir: str = None, track_for_best: str = 'loss', collate_fn=<class 'list'>, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f22bb5d0>, str] = <sphinx.ext.autodoc.importer._MockObject object>, gradient_norm_clip_value: Optional[float] = 5.0, lr_scheduler: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f22bb410>] = None, use_wandb: bool = False, sample_proportion: float = 1.0, seeds: Dict[str, int] = None)¶ Engine runs the models end to end. It iterates through the train dataset and passes it through the model. During training it helps in tracking a lot of parameters for the run and saving the parameters. It also reports validation and test parameters from time to time. Many utilities required for end-end running of the model is here.
Parameters: - model (nn.Module) – A pytorch module defining a model to be run
- datasets_manager (DatasetsManager) – A datasets manager that handles all the different datasets
- optimizer (torch.optim) – Any Optimizer object instantiated using
torch.optim
- batch_size (int) – Batch size for the dataset. The same batch size is used for
train
,valid
andtest
dataset - save_dir (int) – The experiments are saved in
save_dir
. We save checkpoints, the best model, logs and other information into the save dir - num_epochs (int) – The number of epochs to run the training
- save_every (int) – The model will be checkpointed every
save_every
number of iterations - log_train_metrics_every (int) – The train metrics will be reported every
log_train_metrics_every
iterations during training - train_metric (BaseMetric) – Anything that is an instance of
BaseMetric
for calculating training metrics - validation_metric (BaseMetric) – Anything that is an instance of
BaseMetric
for calculating validation metrics - test_metric (BaseMetric) – Anything that is an instance of
BaseMetric
for calculating test metrics - experiment_name (str) – The experiment should be given a name for ease of tracking. Instead experiment name is not given, we generate a unique 10 digit sha for the experiment.
- experiment_hyperparams (Dict[str, Any]) – This is mostly used for tracking the different hyper-params of the experiment
being run. This may be used by
wandb
to save the hyper-params - tensorboard_logdir (str) – The directory where all the tensorboard runs are stored. If
None
is passed then it defaults to the tensorboard default of storing the log in the current directory. - track_for_best (str) – Which metric should be tracked for deciding the best model?. Anything that
the metric emits and is a single value can be used for tracking. The defauly value
is
loss
. If its loss, then the best value will be the lowest one. For some other metrics likemacro_fscore
, the best metric might be the one that has the highest value - collate_fn (Callable[[List[Any]], List[Any]]) – Collates the different examples into a single batch of examples.
This is the same terminology adopted from
pytorch
. There is no different - device (torch.device) – The device on which the model will be placed. If this is “cpu”, then the model and the tensors will all be on cpu. If this is “cuda:0”, then the model and the tensors will be placed on cuda device 0. You can mention any other cuda device that is suitable for your environment
- gradient_norm_clip_value (float) – To avoid gradient explosion, the gradients of the norm will be clipped if the gradient norm exceeds this value
- lr_scheduler (torch.optim.lr_scheduler) – Any pytorch
lr_scheduler
can be used for reducing the learning rate if the performance on the validation set reduces. - use_wandb (bool) – wandb or weights and biases is a tool that is used to track experiments online. Sciwing comes with inbuilt functionality to track experiments on weights and biases
- seeds (Dict[str, int]) – The dict of seeds to be set. Set the random_seed, pytorch_seed and numpy_seed Found in https://github.com/allenai/allennlp/blob/master/allennlp/common/util.py
-
static
get_iter
(loader: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f21a4110>) → Iterator[T_co]¶ Returns the iterator for a pytorch data loader.
The
loader
is a pytorch DataLoader that iterates over the dataset in batches and employs many strategies to do so. We want an iterator that returns the dataset in batches. The end of the iterator would signify the end of an epoch and then we can use that information to perform house-keeping.Parameters: loader (DataLoader) – a pytorch data loader Returns: An iterator over the data loader Return type: Iterator
-
get_loader
(dataset: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f21a40d0>) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f21a4110>¶ Returns the DataLoader for the Dataset
Parameters: dataset (Dataset) – Returns: A pytorch DataLoader Return type: DataLoader
-
get_test_dataset
()¶ Returns the test dataset of the experiment
Returns: Anything that conforms to the pytorch style dataset. Return type: Dataset
-
get_train_dataset
()¶ Returns the train dataset of the experiment
Returns: Anything that conforms to the pytorch style dataset. Return type: Dataset
-
get_validation_dataset
()¶ Returns the validation dataset of the experiment
Returns: Anything that conforms to the pytorch style dataset. Return type: Dataset
-
is_best_higher
(current_best=None)¶ Returns
True
if the current value of the metric is HIGHER than the best metric. This is useful for tracking metrics like FSCORE where, higher the value, the better it isParameters: current_best (float) – The current value for the metric that is being tracked Returns: Return type: bool
-
is_best_lower
(current_best=None)¶ Returns True if the current value of the metric is lower than the best metric. This is useful for tracking metrics like loss where, lower the value, the better it is
Parameters: current_best (float) – The current value for the metric that is being tracked Returns: Return type: bool
-
load_model_from_file
(filename: str)¶
-
run
()¶ Run the engine :return:
-
set_best_track_value
(current_best=None)¶ Set the best value of the value being tracked
Parameters: current_best (float) – The current value that is best
-
test_epoch
(epoch_num: int)¶ Runs the test epoch for
epoch_num
Loads the best model that is saved during the training and runs the test dataset.
Parameters: epoch_num (int) – zero based epoch number for which the test dataset is run This is after the last training epoch.
-
test_epoch_end
(epoch_num: int)¶ Performs house-keeping at the end of the test epoch
It reports the metric that is being traced at the end of the test epoch
Parameters: epoch_num (int) – Epoch num after which the test dataset is run
-
train_epoch
(epoch_num: int)¶ Run the training for one epoch :param epoch_num: type: int The current epoch number
-
train_epoch_end
(epoch_num: int)¶ Performs house-keeping at the end of a training epoch
At the end of the training epoch, it does some house-keeping. It reports the average loss, the average metric and other information.
Parameters: epoch_num (int) – The current epoch number (0 based)
-
validation_epoch
(epoch_num: int)¶ Runs one validation epoch on the validation dataset
Parameters: - epoch_num (int) –
- epoch number (0-based) –
-
validation_epoch_end
(epoch_num: int)¶ Performs house-keeping at the end of validation epoch
Parameters: epoch_num (int) – The current epoch number
-
sciwing.infer¶
sciwing.infer.classification¶
BaseClassificationInference¶
-
class
sciwing.infer.classification.BaseClassificationInference.
BaseClassificationInference
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3043590>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3043250>, None] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
object
Abstract Base Class for Classification Inference.The BaseClassification Inference provides a skeleton for concrete classes that would want to perform inference for a text classification task.
-
__init__
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3043590>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3043250>, None] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Parameters: - model (nn.Module) – A pytorch module
- model_filepath (str) – The path where the parameters for the best models are stored. This is usually
the
best_model.pt
while in an experiment directory - datasets_manager (DatasetsManager) – Any dataset that conforms to the pytorch Dataset specification
- device (Optional[Union[str, torch.device]]) – This is either a string like
cpu
,cuda:0
or a torch.device object
-
get_misclassified_sentences
(true_label_idx: int, pred_label_idx: int) → List[str]¶
-
get_true_label_indices_names
(labels: List[sciwing.data.label.Label]) -> (typing.List[int], typing.List[str])¶ Given an list of labels, it returns the indices and the names of the label
Parameters: labels (Dict[str, Any]) – iter_dict
returned by a datasetReturns: List of integers that represent the true class List of strings that represent the true class Return type: (List[int], List[str])
-
infer_batch
(lines: List[sciwing.data.line.Line])¶
-
load_model
()¶ Loads the best_model from the model_filepath.
-
model_forward_on_lines
(lines: List[sciwing.data.line.Line])¶ Perform the model forward pass given an
iter_dict
Parameters: lines (List[Line]) –
-
model_output_dict_to_prediction_indices_names
(model_output_dict: Dict[str, Any]) -> (typing.List[int], typing.List[str])¶ Given an
model_output_dict
, it returns the predicted class indices and namesParameters: model_output_dict (Dict[str, Any]) – output dictionary from a model Returns: List of integers that represent the predicted class List of strings that represent the predicted class Return type: (List[int], List[str])
-
on_user_input
(line: sciwing.data.line.Line)¶
-
print_confusion_matrix
()¶
-
report_metrics
()¶ Reports the metrics for returning the dataset
-
run_inference
() → Dict[str, Any]¶ Should Run inference on the test dataset
This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use
Returns: Returns Return type: Dict[str, Any]
-
run_test
()¶
-
Classification Inference¶
-
class
sciwing.infer.classification.classification_inference.
ClassificationInference
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3043210>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, tokens_namespace: str = 'tokens', normalized_probs_namespace: str = 'normalized_probs', device: str = 'cpu')¶ Bases:
sciwing.infer.classification.BaseClassificationInference.BaseClassificationInference
The sciwing engine runs the test lines through the classifier and returns the predictions/probabilities for different classes At a later point in time this method should be able to take any context of lines (may be from a file) and produce the output.
This class also helps in performing various interactions with the results on the test dataset. Some features are 1) Show confusion matrix 2) Investigate a particular example in the test dataset 3) Get instances that were classified as 2 when their true label is 1 and others
All it needs is the configuration file stored under every experiment to have a vocab already stored in the experiment folder
-
generate_report_for_paper
()¶ Generates just the fscore to be used in reporting on print
-
get_misclassified_sentences
(true_label_idx: int, pred_label_idx: int)¶ This returns the true label misclassified as pred label idx
Parameters: - true_label_idx (int) – The label index of the true class name
- pred_label_idx (int) – The label index of the predicted class name
Returns: A list of strings where the true class is classified as pred class.
Return type: List[str]
-
get_true_label_indices_names
(labels: List[sciwing.data.label.Label]) -> (typing.List[int], typing.List[str])¶ Given an list of labels, it returns the indices and the names of the label
Parameters: labels (Dict[str, Any]) – iter_dict
returned by a datasetReturns: List of integers that represent the true class List of strings that represent the true class Return type: (List[int], List[str])
-
infer_batch
(lines: List[str]) → List[str]¶ Runs inference on a batch of lines This method can be used for applications. When APIS are being developed to serve over the web or when terminal applications are being written to read from files and infer, this method comes in handy
Parameters: lines (List[str]) – List of text spans to be infered Returns: Reutrns the class names for all the sentences in the input Return type: List[str]
-
model_forward_on_lines
(lines: List[sciwing.data.line.Line])¶ Perform the model forward pass given an
iter_dict
Parameters: lines (List[Line]) –
-
model_output_dict_to_prediction_indices_names
(model_output_dict: Dict[str, Any]) -> (typing.List[int], typing.List[str])¶ Given an
model_output_dict
, it returns the predicted class indices and namesParameters: model_output_dict (Dict[str, Any]) – output dictionary from a model Returns: List of integers that represent the predicted class List of strings that represent the predicted class Return type: (List[int], List[str])
-
on_user_input
(line: str) → str¶ Runs the inference when the user inputs a single sentence either on the terminal or some other application
Parameters: line (str) – The line entered by the user Returns: The class label that is infered for the user input Return type: str
-
print_confusion_matrix
() → None¶ Prints the confusion matrix for the test dataset
-
report_metrics
()¶ Reports the metrics for returning the dataset
-
run_inference
() → Dict[str, Any]¶ Should Run inference on the test dataset
This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use
Returns: Returns Return type: Dict[str, Any]
-
run_test
()¶ Runs inference and reports test metrics
-
sciwing.infer.seq_label_inference¶
BaseSeqLabelInference¶
-
class
sciwing.infer.seq_label_inference.BaseSeqLabelInference.
BaseSeqLabelInference
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04fd0>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04810>, None] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
object
Abstract Base Class for Sequence Labeling Inference.The BaseSeqLabelInference Inference provides a skeleton for concrete classes that would want to perform inference for a text classification task.
-
__init__
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04fd0>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04810>, None] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Parameters: - model (nn.Module) – A pytorch module
- model_filepath (str) – The path where the parameters for the best models are stored. This is usually
the
best_model.pt
while in an experiment directory - datasets_manager (DatasetsManager) – Any dataset that conforms to the pytorch Dataset specification
- device (Optional[Union[str, torch.device]]) – This is either a string like
cpu
,cuda:0
or a torch.device object
-
get_misclassified_sentences
(true_label_idx: int, pred_label_idx: int) → List[str]¶
-
get_true_label_indices_names
(labels: List[sciwing.data.seq_label.SeqLabel]) -> (typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[str]])¶ Given an list of labels, it returns the indices and the names of the label
Parameters: labels (Dict[str, Any]) – iter_dict
returned by a datasetReturns: A mapping between a label namespace and List of integers that represent the true class A mapping between a label namespace and a List of strings that represent the true class Return type: (Dict[str, List[int]], Dict[str, List[str]])
-
infer_batch
(lines: Union[List[sciwing.data.line.Line], List[str]]) → Dict[str, List[str]]¶
-
load_model
()¶ Loads the best_model from the model_filepath.
-
model_forward_on_lines
(lines: List[sciwing.data.line.Line])¶ Perform the model forward pass given an
iter_dict
Parameters: lines (List[Line]) – iter_dict
returned by a dataset
-
model_output_dict_to_prediction_indices_names
(model_output_dict: Dict[str, Any]) -> (typing.List[int], typing.List[str])¶ Given an
model_output_dict
, it returns the predicted class indices and namesParameters: model_output_dict (Dict[str, Any]) – output dictionary from a model Returns: List of integers that represent the predicted class List of strings that represent the predicted class Return type: (List[int], List[str])
-
on_user_input
(line: Union[sciwing.data.line.Line, str]) → Dict[str, List[str]]¶
-
print_confusion_matrix
()¶
-
report_metrics
()¶ Reports the metrics for returning the dataset
-
run_inference
()¶ Should Run inference on the test dataset
This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use
Returns: Returns Return type: Dict[str, Any]
-
run_test
()¶
-
CONLL Inference¶
-
class
sciwing.infer.seq_label_inference.conll_inference.
Conll2003Inference
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f063d850>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f063dd10>, None] = <sphinx.ext.autodoc.importer._MockObject object>, predicted_tags_namespace_prefix: str = 'predicted_tags')¶ Bases:
sciwing.infer.seq_label_inference.seq_label_inference.SequenceLabellingInference
-
generate_predictions_for
(task: str, test_filename: str, output_filename: str)¶ Parameters: - task (str) – Can be one of pos, dep or ner The task for which the predictions are made using the current model
- test_filename (str) – This is the eng.testb of the CoNLL 2003 dataset
- output_filename (str) – The file where you want to store predictions
Returns: - None – Writes the predictions to the output_filename
- The output file is meant to be used with conlleval.perl script
- ./conlleval < output_filename
- The file expects the correct tag and the predicted tag to be in the last
- two columns in that order
- The first column is the token for which the prediction is made
-
SeqLabel Inference¶
-
class
sciwing.infer.seq_label_inference.seq_label_inference.
SequenceLabellingInference
(model: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c0bb10>, model_filepath: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c0b7d0>, None] = <sphinx.ext.autodoc.importer._MockObject object>, predicted_tags_namespace_prefix: str = 'predicted_tags')¶ Bases:
sciwing.infer.seq_label_inference.BaseSeqLabelInference.BaseSeqLabelInference
-
generate_scienceie_prediction_folder
(dev_folder: pathlib.Path, pred_folder: pathlib.Path)¶ Generates the predicted folder for the dataset in the test folder for ScienceIE. This is very specific to ScienceIE. Not meant to use with other tasks
ScienceIE is a SemEval Task that needs the files to be written into a folder and it reports metrics by reading files from that folder. This method generates the predicted folder given the dev folder
Parameters: - dev_folder (pathlib.Path) – The path where the dev files are present
- pred_folder (pathlib.Path) – The path where the predicted files will be written
-
get_misclassified_sentences
(true_label_idx: int, pred_label_idx: int)¶
-
get_true_label_indices_names
(labels: List[sciwing.data.seq_label.SeqLabel]) -> (typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[str]])¶ Given an list of labels, it returns the indices and the names of the label
Parameters: labels (Dict[str, Any]) – iter_dict
returned by a datasetReturns: A mapping between a label namespace and List of integers that represent the true class A mapping between a label namespace and a List of strings that represent the true class Return type: (Dict[str, List[int]], Dict[str, List[str]])
-
infer_batch
(lines: Union[List[sciwing.data.line.Line], List[str]]) → Dict[str, List[str]]¶
-
model_forward_on_lines
(lines: List[sciwing.data.line.Line])¶ Perform the model forward pass given an
iter_dict
Parameters: lines (List[Line]) – iter_dict
returned by a dataset
-
model_output_dict_to_prediction_indices_names
(model_output_dict: Dict[str, Any]) -> (typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[str]])¶ Given an
model_output_dict
, it returns the predicted class indices and namesParameters: model_output_dict (Dict[str, Any]) – output dictionary from a model Returns: List of integers that represent the predicted class List of strings that represent the predicted class Return type: (List[int], List[str])
-
on_user_input
(line: Union[sciwing.data.line.Line, str]) → Dict[str, List[str]]¶
-
print_confusion_matrix
()¶ This prints the confusion metrics for the entire dataset :returns: :rtype: None
-
report_metrics
()¶ Reports the metrics for returning the dataset
-
run_inference
()¶ Should Run inference on the test dataset
This method should run the model through the test dataset. It should perform inference and collect the appropriate metrics and data that is necessary for further use
Returns: Returns Return type: Dict[str, Any]
-
run_test
()¶
-
sciwing.meters¶
loss_meter¶
-
class
sciwing.meters.loss_meter.
LossMeter
¶ Bases:
object
-
add_loss
(avg_batch_loss: float, num_instances: int) → None¶ Adds the average batch loss and the num of instances in that batch to that loss
Parameters: - avg_batch_loss (float) – Average batch loss
- num_instances (int) – Number of instances from the batch
-
get_average
() → float¶ Returns the average loss over all the batches at this point in time
Returns: Average loss Return type: float
-
reset
()¶ Resets all the losses and batch sizes that are accumulated
-
sciwing.metrics¶
BaseMetric¶
-
class
sciwing.metrics.BaseMetric.
BaseMetric
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager)¶ Bases:
object
-
calc_metric
(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label], model_forward_dict: Dict[str, Any]) → None¶ Calculates the metric using the lines and labels returned by any dataset and
model_forward_dict
of a model. This is usually called for a batch of inputs and a forward pass. The state of the different metrics should be retained by the metric across an epoch beforereset
method can be called and all the metric related data can be reset for a new epochParameters: - lines (List[Line]) –
- labels (List[Label]) –
- model_forward_dict (Dict[str, Any]) –
-
get_metric
() → Dict[str, Any]¶ Returns the value of different metrics being tracked
Return anything that is being tracked by the metric. Return it as a dictionary that can be used by outside method for reporting purposes or repurposing it for the sake of reporting
Returns: Metric/values being tracked by the metric Return type: Dict[str, Any]
-
report_metrics
(report_type: str = None) → Any¶ A method to report the tracked metrics in a suitable form
Parameters: report_type (str) – The type of report that will be returned by the method Returns: This method can return any suitable format for reporting. If it is ought to be printed, return a suitable string. If the report needs to be saved to a file, go ahead. Return type: Any
-
reset
()¶ Should reset all the metrics/value being tracked by this metric This method is generally used at the end of a training/validation epoch to reset the values before starting another epoch
-
classification_metrics_utils¶
-
class
sciwing.metrics.classification_metrics_utils.
ClassificationMetricsUtils
¶ Bases:
object
The classification metrics like accuracy, precision recall and fmeasure are often used in supervised learning. This class provides a few utilities that helps in calculating these.
-
generate_table_report_from_counters
(tp_counter: Dict[int, int], fp_counter: Dict[int, int], fn_counter: Dict[int, int], idx2labelname_mapping: Dict[int, str] = None) → str¶ Returns a table representation for Precision Recall and FMeasure
Parameters: - tp_counter (Dict[int, int]) – The mapping between class index and true positive count
- fp_counter (Dict[int, int]) – The mapping between class index and false positive count
- fn_counter (Dict[int, int]) – The mapping between class index and false negative count
- idx2labelname_mapping (Dict[int, str]) – The mapping between idx and label name
Returns: Returns a string representing the table of precision recall and fmeasure for every class in the dataset
Return type: str
-
static
get_confusion_matrix_and_labels
(predicted_tag_indices: List[List[int]], true_tag_indices: List[List[int]], true_masked_label_indices: List[List[int]], pred_labels_mask: List[List[int]] = None) -> (<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3065310>, typing.List[int])¶ Gets the confusion matrix and the list of classes for which the confusion matrix is generated
Parameters: - predicted_tag_indices (List[List[int]]) – Predicted tag indices for a batch
- true_tag_indices (List[List[int]]) – True tag indices for a batch
- true_masked_label_indices (List[List[int]]) – Every integer is either a 0 or 1, where 1 will indicate that the label in true_tag_indices will be ignored
-
static
get_macro_prf_from_prf_dicts
(precision_dict: Dict[int, int], recall_dict: Dict[int, int], fscore_dict: Dict[int, int]) -> (<class 'int'>, <class 'int'>, <class 'int'>)¶ Calculates Macro Precision, Recall and FMeasure
Parameters: - precision_dict (Dict[int, int]) – Dictionary mapping betwen the class index and precision values
- recall_dict (Dict[int, int]) – Dictionary mapping between the class index and recall values
- fscore_dict (Dict[int, int]) – Dictionary mapping between the class index and fscore values
Returns: The macro precision, macro recall and macro fscore measures
Return type: int, int, int
-
get_micro_prf_from_counters
(tp_counter: Dict[int, int], fp_counter: Dict[int, int], fn_counter: Dict[int, int]) -> (<class 'int'>, <class 'int'>, <class 'int'>)¶ This calculates the micro precision recall and fmeasure from different counters. The counters contain a mapping from a class index to the particular number
Parameters: - tp_counter (Dict[int, int]) – Mapping from class index to true positive count
- fp_counter (Dict[int, int]) – Mapping from class index to false posiive count
- fn_counter (Dict[int, int]) – Mapping from class index to false negative count
Returns: Micro precision, Micro Recall and Micro fmeasure
Return type: int, int, int
-
get_prf_from_counters
(tp_counter: Dict[int, int], fp_counter: Dict[int, int], fn_counter: Dict[int, int])¶ This calculates the precision recall f-measure from different counters. The counters contain a mapping from a class index to the particular number
Parameters: - tp_counter (Dict[int, int]) – Mapping from class index to true positive count
- fp_counter (Dict[int, int]) – Mapping from class index to false posiive count
- fn_counter (Dict[int, int]) – Mapping from class index to false negative count
Returns: Three dictionaries representing the Precision Recall and Fmeasure for all the different classes
Return type: Dict[int, int], Dict[int, int], Dict[int, int]
-
precision_recall_measure¶
-
class
sciwing.metrics.precision_recall_fmeasure.
PrecisionRecallFMeasure
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager)¶ Bases:
sciwing.metrics.BaseMetric.BaseMetric
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager)¶ Parameters: datasets_manager (DatasetsManager) – The dataset manager managing the labels and other information
-
calc_metric
(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label], model_forward_dict: Dict[str, Any]) → None¶ Updates the values being tracked for calculating the metric
For Precision Recall FMeasure we update the true positive, false positive and false negative of the different classes being tracked
Parameters: - lines (List[Line]) – A list of lines
- labels (List[Label]) – A list of labels. This has to be the label used for classification Refer to the documentation of Label for more information
- model_forward_dict (Dict[str, Any]) – The dictionary obtained after a forward pass
The model_forward_pass is expected to have
normalized_probs
that usually is of the size[batch_size, num_classes]
-
get_metric
() → Dict[str, Any]¶ Returns different values being tracked to calculate Precision Recall FMeasure
Returns: Returns a dictionary with the following key value pairs for every namespace - precision: Dict[str, float]
- The precision for different classes
- recall: Dict[str, float]
- The recall values for different classes
- fscore: Dict[str, float]
- The fscore values for different classes,
- num_tp: Dict[str, int]
- The number of true positives for different classes,
- num_fp: Dict[str, int]
- The number of false positives for different classes,
- num_fn: Dict[str, int]
- The number of false negatives for different classes
- ”macro_precision”: float
- The macro precision value considering all different classes,
- macro_recall: float
- The macro recall value considering all different classes
- macro_fscore: float
- The macro fscore value considering all different classes
- micro_precision: float
- The micro precision value considering all different classes,
- micro_recall: float
- The micro recall value considering all different classes.
- micro_fscore: float
- The micro fscore value considering all different classes
Return type: Dict[str, Any]
-
print_confusion_metrics
(predicted_probs: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39b8450>, labels: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39b89d0>, labels_mask: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39b8a90>] = None) → None¶ Prints confusion matrix
Parameters: - predicted_probs (torch.FloatTensor) – Predicted Probabilities
[batch_size, num_classes]
- labels (torch.FloatTensor) – True labels of the size
[batch_size, 1]
- labels_mask (Optional[torch.ByteTensor]) – Labels mask indicating 1 in thos places where the true label is ignored Otherwise 0. It should be of same size as labels
- predicted_probs (torch.FloatTensor) – Predicted Probabilities
-
report_metrics
(report_type='wasabi')¶ Reports metrics in a printable format
Parameters: report_type (type) – Select one of [wasabi, paper]
If wasabi, then we return a printable table that represents the precision recall and fmeasures for different classes
-
reset
() → None¶ Resets all the counters
Resets the
tp_counter
which is the true positive counter Resets thefp_counter
which is the false positive counter Resets thefn_counter
- which is the false negative counter Resets thetn_counter
- which is the true nagative counter
-
token_cls_accuracy¶
-
class
sciwing.metrics.token_cls_accuracy.
TokenClassificationAccuracy
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, predicted_tags_namespace_prefix='predicted_tags')¶ Bases:
sciwing.metrics.BaseMetric.BaseMetric
,sciwing.utils.class_nursery.ClassNursery
-
calc_metric
(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.seq_label.SeqLabel], model_forward_dict: Dict[str, Any]) → None¶ Parameters: - lines (List[Line]) – The list of lines
- labels (List[Label]) – The list of sequence labels
- model_forward_dict (Dict[str, Any]) – The model_forward_dict should have predicted tags for every namespace
The predicted_tags are the best possible predicted tags for the batch
They are List[List[int]] where the size is
[batch_size, time_steps]
We expect that the predicted tags are
-
get_metric
() → Dict[str, Union[Dict[str, float], float]]¶ Returns different values being tracked to calculate Precision Recall FMeasure :returns: Returns a dictionary with following key value pairs for every namespace
- precision: Dict[str, float]
- The precision for different classes
- recall: Dict[str, float]
- The recall values for different classes
- “fscore”: Dict[str, float]
- The fscore values for different classes,
- num_tp: Dict[str, int]
- The number of true positives for different classes,
- num_fp: Dict[str, int]
- The number of false positives for different classes,
- num_fn: Dict[str, int]
- The number of false negatives for different classes
- “macro_precision”: float
- The macro precision value considering all different classes,
- macro_recall: float
- The macro recall value considering all different classes
- macro_fscore: float
- The macro fscore value considering all different classes
- micro_precision: float
- The micro precision value considering all different classes,
- micro_recall: float
- The micro recall value considering all different classes.
- micro_fscore: float
- The micro fscore value considering all different classes
Return type: Dict[str, Any]
-
print_confusion_metrics
(predicted_tag_indices: List[List[int]], true_tag_indices: List[List[int]], labels_mask: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3065b90>] = None) → None¶ Prints confusion matrics for a batch of tag indices. It assumes that the batch is padded and every instance is of similar length
Parameters: - predicted_tag_indices (List[List[int]]) – Predicted tag indices for a batch of sentences
- true_tag_indices (List[List[int]]) – True tag indices for a batch of sentences
- labels_mask (Optional[torch.ByteTensor]) – The labels mask which has the same as
true_tag_indices
. 0 in a position indicates that there is no masking 1 indicates that there is a masking
-
report_metrics
(report_type='wasabi') → Any¶ Reports metrics in a printable format
Parameters: report_type (type) – Select one of [wasabi, paper]
If wasabi, then we return a printable table that represents the precision recall and fmeasures for different classes
-
reset
()¶ Should reset all the metrics/value being tracked by this metric This method is generally used at the end of a training/validation epoch to reset the values before starting another epoch
-
sciwing.models¶
Simple Classifier¶
-
class
sciwing.models.simpleclassifier.
SimpleClassifier
(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3a54450>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39e08d0>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3a54450>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39e08d0>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ SimpleClassifier is a linear classifier head on top of any encoder
Parameters: - encoder (nn.Module) – Any encoder that takes in lines and produces a single vector for every line.
- encoding_dim (int) – The encoding dimension
- num_classes (int) – The number of classes
- classification_layer_bias (bool) – Whether to add classification layer bias or no This is set to false only for debugging purposes ff
- label_namespace (str) – The namespace used for labels in the dataset
- datasets_manager (DatasetsManager) – The datasets manager for the model
- device (torch.device) – The device on which the model is run
-
forward
(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False) → Dict[str, Any]¶ Parameters: - lines (List[Line]) –
iter_dict
from any dataset that will be passed on to the encoder - labels (List[Label]) – A list of labels for every instance
- is_training (bool) – running forward on training dataset?
- is_validation (bool) – running forward on validation dataset?
- is_test (bool) – running forward on test dataset?
Returns: - logits: torch.FloatTensor
Un-normalized probabilities over all the classes of the shape
[batch_size, num_classes]
- normalized_probs: torch.FloatTensor
Normalized probabilities over all the classes of the shape
[batch_size, num_classes]
- loss: float
Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset
Return type: Dict[str, Any]
- lines (List[Line]) –
-
Simple Tagger¶
-
class
sciwing.models.simple_tagger.
SimpleTagger
(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2422710> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
PyTorch module for Neural Parscit
-
__init__
(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2422710> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')¶ Parameters: - rnn2seqencoder (Lstm2SeqEncoder) – Lstm2SeqEncoder that encodes a set of instances to a sequence of hidden states
- encoding_dim (int) – Hidden dimension of the lstm2seq encoder
-
forward
(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.seq_label.SeqLabel] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False)¶ Parameters: - lines (List[lines]) – A list of lines
- labels (List[SeqLabel]) – A list of sequence labels
- is_training (bool) – running forward on training dataset?
- is_validation (bool) – running forward on training dataset ?
- is_test (bool) – running forward on test dataset?
Returns: - logits: torch.FloatTensor
Un-normalized probabilities over all the classes of the shape
[batch_size, num_classes]
- predicted_tags: List[List[int]]
Set of predicted tags for the batch
- loss: float
Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset
Return type: Dict[str, Any]
-
Neural Parscit¶
-
class
sciwing.models.neural_parscit.
NeuralParscit
(device: Optional[Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2cfacd0>, int]] = -1)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
It defines a neural parscit model. The model is used for citation string parsing. This model helps you use a pre-trained model who architecture is fixed and is trained by SciWING. You can also fine-tune the model on your own dataset.
For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.
-
interact
()¶ Interact with the pretrained model You can also interact from command line using sciwing interact neural-parscit
-
predict_for_file
(filename: str) → List[str]¶ Parse the references in a file where every line is a reference
Parameters: filename (str) – The filename where the references are stored Returns: A list of parsed tags Return type: List[str]
-
predict_for_text
(text: str, show=True) → str¶ Parse the citation string for the given text
Parameters: - text (str) – reference string to parse
- show (bool) – If True, then we print the stylized string - where the stylized string provides different colors for different tags If False - then we do not print the stylized string
Returns: The parsed citation string
Return type: str
-
Citation Intent Classification¶
-
class
sciwing.models.citation_intent_clf.
CitationIntentClassification
¶ Bases:
sphinx.ext.autodoc.importer._MockObject
-
interact
()¶ Interact with the pretrained model
-
predict_for_file
(filename: str) → List[str]¶ Predict the intents for all the citations in the filename The citations should be contained one per line
Parameters: filename (str) – The filename where the citations are stored Returns: Returns the intents for each line of citation Return type: List[str]
-
predict_for_text
(text: str) → str¶ Predict the intent for citation
Parameters: text (str) – The citation string Returns: The predicted label for the citation Return type: str
-
Generic Section Header Classification¶
-
class
sciwing.models.generic_sect.
GenericSect
¶ Bases:
object
-
interact
()¶ Interact with the pretrained model
-
predict_for_file
(filename: str) → List[str]¶ Make predictions for every line in the file
Parameters: filename (str) – The filename where section headers are stored one per line Returns: A list of predictions Return type: List[str]
-
predict_for_text
(text: str, show=True) → str¶ Predicts the generic section headers of the text
Parameters: - text (str) – The section header string to be normalized
- show (bool) – If True then we print the prediction.
Returns: The prediction for the section header
Return type: str
-
I2B2 NER¶
-
class
sciwing.models.i2b2.
I2B2NER
¶ Bases:
sphinx.ext.autodoc.importer._MockObject
It defines a I2B2 clinical NER model trained using SciWING
For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.
-
interact
()¶
-
predict_for_file
(filename: str) → List[str]¶
-
predict_for_text
(text: str)¶
-
SectLabel¶
-
class
sciwing.models.sectlabel.
SectLabel
(log_file: str = None, device: str = 'cpu')¶ Bases:
object
-
dehyphenate
(lines: List[str]) → List[str]¶ Dehyphenates a list of strings
Parameters: lines (List[str]) – A list of hyphenated strings Returns: A list of dehyphenated strings Return type: List[str]
-
extract_abstract_for_file
(pdf_filename: pathlib.Path, dehyphenate: bool = True) → str¶ Extracts abstracts from a pdf using sectlabel. This is the python programmatic version of the API. The APIs can be found in sciwing/api. You can see that for more information
Parameters: - pdf_filename (pathlib.Path) – The path where the pdf is stored
- dehyphenate (bool) – Scientific documents are two columns sometimes and there are a lot of hyphenation introduced. If this is true, we remove the hyphens from the code
Returns: The abstract of the pdf
Return type: str
-
extract_abstract_for_folder
(foldername: pathlib.Path, dehyphenate=True)¶ Extracts the abstracts for all the pdf fils stored in a folder
Parameters: - foldername (pathlib.Path) – THe path of the folder containing pdf files
- dehyphenate (bool) – We will try to dehyphenate the lines. Useful if the pdfs are two column research paper
Returns: Writes the abstracts to files
Return type: None
-
extract_all_info
(pdf_filename: pathlib.Path)¶ Extracts information from the pdf file.
Parameters: pdf_filename (pathlib.Path) – The path of the pdf file Returns: A dictionary containing information parsed from the pdf file Return type: Dict[str, Any]
-
interact
()¶ Interact with the pre-trained model
-
predict_for_file
(filename: str) → List[str]¶ Predicts the logical sections for all the sentences in a file, with one sentence per line
Parameters: filename (str) – The path of the file Returns: The predictions for each line. Return type: List[str]
-
predict_for_pdf
(pdf_filename: pathlib.Path) -> (typing.List[str], typing.List[str])¶ Predicts lines and labels given a pdf filename
Parameters: pdf_filename (pathlib.Path) – The location where pdf files are stored Returns: The lines and labels inferred on the file Return type: List[str], List[str]
-
predict_for_text
(text: str) → str¶ Predicts the logical section that the line belongs to
Parameters: text (str) – A single line of text Returns: The logical section of the text. Return type: str
-
predict_for_text_batch
(texts: List[str]) → List[str]¶ Predicts the logical section for a batch of text.
Parameters: texts (List[str]) – A batch of text Returns: A batch of predictions Return type: List[str]
-
sciwing.modules¶
sciwing.modules.embedders¶
bert_embedder¶
-
class
sciwing.modules.embedders.bert_embedder.
BertEmbedder
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a8350>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a8350>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bert Embedder that embeds the given instance to BERT embeddings
Parameters: - dropout_value (float) – The amount of dropout to be added after the embedding
- aggregation_type (str) –
The kind of aggregation of different layers. BERT produces representations from different layers. This specifies the strategy to aggregating them One of
- sum
- Sum the representations from all the layers
- average
- Average the representations from all the layers
- bert_type (type) –
The kind of BERT embedding to be used
- bert-base-uncased
- 12 layer transformer trained on lowercased vocab
- bert-large-uncased:
- 24 layer transformer trained on lowercased vocab
- bert-base-cased:
- 12 layer transformer trained on cased vocab
- bert-large-cased:
- 24 layer transformer train on cased vocab
- scibert-base-cased
- 12 layer transformer trained on scientific document on cased normal vocab
- scibert-sci-cased
- 12 layer transformer trained on scientific documents on cased scientifc vocab
- scibert-base-uncased
- 12 layer transformer trained on scientific docments on uncased normal vocab
- scibert-sci-uncased
- 12 layer transformer train on scientific documents on ncased scientific vocab
- word_tokens_namespace (str) – The namespace in the liens where the tokens are stored
- device (Union[torch.device, str]) – The device on which the model is run.
-
forward
(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a85d0>¶ Parameters: lines (List[Line]) – A list of lines Returns: The bert embeddings for all the words in the instances The size of the returned embedding is [batch_size, max_len_word_tokens, emb_dim]
Return type: torch.Tensor
-
get_embedding_dimension
() → int¶
-
bow_elmo_embedder¶
-
class
sciwing.modules.embedders.bow_elmo_embedder.
BowElmoEmbedder
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bddd0>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bddd0>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')¶ Bag of words Elmo Embedder which aggregates elmo embedding for every token
Parameters: - layer_aggregation (str) –
You can chose one of
[sum, average, last, first]
which decides how to aggregate different layers of ELMO. ELMO produces three layers of representations- sum
- Representations from different layers are summed
- average
- Representations from different layers are average
- last
- Representations from last layer is considered
- first
- Representations from first layer is considered
- device (Union[str, torch.device]) – device for running the model on
- word_tokens_namespace (int) – Namespace where all the word tokens are stored
- layer_aggregation (str) –
-
forward
(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a60d0>¶ Parameters: lines (List[Line]) – Just a list of lines Returns: Returns the representation for every token in the instance [batch_size, max_num_words, emb_dim]
. In case of Elmo theemb_dim
is 1024Return type: torch.Tensor
-
get_embedding_dimension
() → int¶
-
char_embedder¶
-
class
sciwing.modules.embedders.char_embedder.
CharEmbedder
(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a8d50>] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a8d50>] = <sphinx.ext.autodoc.importer._MockObject object>)¶ This is a character embedder that takes in lines and collates the character embeddings for all the tokens in the lines.
Parameters: - char_embedding_dimension (int) – The dimension of the character embedding
- word_tokens_namespace (int) – The name space where the words are saved
- char_tokens_namespace (str) – The namespace where the character tokens are saved
- datasets_manager (DatasetsManager) – The dataset manager that handles all the datasets
- hidden_dimension (int) – The hidden dimension of the LSTM which will be used to get character embeddings
-
forward
(lines: List[sciwing.data.line.Line])¶
-
get_embedding_dimension
() → int¶
-
concat_embedders¶
-
class
sciwing.modules.embedders.concat_embedders.
ConcatEmbedders
(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bdbd0>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bdbd0>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)¶ Concatenates a set of embedders into a single embedder.
Parameters: embedders (List[nn.Module]) – A list of embedders that can be concatenated
-
forward
(lines: List[sciwing.data.line.Line])¶ Parameters: lines (List[Line]) – A list of Lines. Returns: Returns the concatenated embedding that is of the size [batch_size, time_steps, embedding_dimension]
where theembedding_dimension
is after the concatenationReturn type: torch.FloatTensor
-
get_embedding_dimension
()¶
-
elmo_embedder¶
-
class
sciwing.modules.embedders.elmo_embedder.
ElmoEmbedder
(dropout_value: float = 0.5, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3990110> = <sphinx.ext.autodoc.importer._MockObject object>, fine_tune: bool = False)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
forward
(lines: List[sciwing.data.line.Line])¶
-
get_embedding_dimension
()¶
-
flair_embedder¶
-
class
sciwing.modules.embedders.flair_embedder.
FlairEmbedder
(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f0495c90>] = 'cpu', word_tokens_namespace: str = 'tokens')¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
,sciwing.modules.embedders.base_embedders.BaseEmbedder
-
__init__
(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f0495c90>] = 'cpu', word_tokens_namespace: str = 'tokens')¶ Flair Embeddings. This is used to produce Named Entity Recognition. Note: This only works if your tokens are produced by splitting based on white space
Parameters: - embedding_type –
- datasets_manager –
- device –
- word_tokens_namespace –
-
forward
(lines: List[sciwing.data.line.Line])¶
-
get_embedding_dimension
()¶
-
trainable_word_embedder¶
-
class
sciwing.modules.embedders.trainable_word_embedder.
TrainableWordEmbedder
(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04210> = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04210> = <sphinx.ext.autodoc.importer._MockObject object>)¶ This represents trainable word embeddings which are trained along with the parameters of the network. The embeddings in the class WordEmbedder are not trainable. They are static
Parameters: embedding_type (str) – The type of embedding that you would want - datasets_manager: DatasetsManager
- The datasets manager which is running your experiments
- word_tokens_namespace: str
- The namespace where the word tokens are stored in your data
- device: Union[torch.device, str]
- The device on which this embedder is run
-
forward
(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c04850>¶
-
get_embedding_dimension
() → int¶
-
word_embedder¶
-
class
sciwing.modules.embedders.word_embedder.
WordEmbedder
(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bd4d0>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.modules.embedders.base_embedders.BaseEmbedder
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bd4d0>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Word Embedder embeds the tokens using the desired embeddings. These are static embeddings.
Parameters: - embedding_type (str) – The type of embedding that you would want
- datasets_manager (DatasetsManager) – The datasets manager which is running your experiments
- word_tokens_namespace (str) – The namespace where the word tokens are stored in your data
- device (Union[torch.device, str]) – The device on which this embedder is run
-
forward
(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39bd810>¶ This will only consider the “tokens” present in the line. The namespace for the tokens is set with the class instantiation
Parameters: lines (List[Line]) – Returns: It returns the embedding of the size [batch_size, max_num_timesteps, embedding_dimension]
Return type: torch.FloatTensor
-
get_embedding_dimension
() → int¶
-
bow_encoder¶
-
class
sciwing.modules.bow_encoder.
BOW_Encoder
(embedder=None, dropout_value: float = 0, aggregation_type='sum', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a8f50>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(embedder=None, dropout_value: float = 0, aggregation_type='sum', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39a8f50>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bag of Words Encoder
Parameters: - embedder (nn.Module) – Any embedder that you would want to use
- dropout_value (float) – The input dropout value that you would want to use
- aggregation_type (str) –
- The strategy for aggregating words
- sum
- Aggregate word embedding by summing them
- average
- Aggregate word embedding by averaging them
- device (Union[torch.device, str]) – The device where the embeddings are stored
-
forward
(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992090>¶ Parameters: lines (Dict[str, Any]) – The iter_dict returned by a dataset Returns: The bag of words encoded embedding either average or summed The size is [batch_size, embedding_dimension] Return type: torch.FloatTensor
-
charlstm_encoder¶
-
class
sciwing.modules.charlstm_encoder.
CharLSTMEncoder
(char_embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3999110>, char_emb_dim: int, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3999050> = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(char_embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3999110>, char_emb_dim: int, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3999050> = <sphinx.ext.autodoc.importer._MockObject object>)¶ Encodes character tokens using lstms
Parameters: - char_embedder (nn.Module) – An embedder that embeds character tokens
- char_emb_dim (int) – The embedding of characters
- hidden_dim (int) – Hidden dimension of the LSTM
- bidirectional (bool) – Should the LSTM be bi-directional
- combine_strategy (str) – Combine strategy for the lstm hidden dimensions
- device (torch.device("cpu)) – The device on which the lstm will run
-
forward
(iter_dict: Dict[str, Any])¶ Parameters: iter_dict (Dict[str, Any]) – expects char_tokens to be present in the iter_dict
from any datasetReturns: [batch_size, num_time_steps, hidden_dim]
The hidden dimension is the hidden dimension of the LSTM if it is bidirectional and concat thenhidden_dim
will be 2 * self.hidden_dimReturn type: torch.Tensor
-
lstm2seqencoder¶
-
class
sciwing.modules.lstm2seqencoder.
Lstm2SeqEncoder
(embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992950>, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, num_layers: int = 1, combine_strategy: str = 'concat', rnn_bias: bool = False, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992a10> = <sphinx.ext.autodoc.importer._MockObject object>, add_projection_layer: bool = True, projection_activation: str = 'Tanh')¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(embedder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992950>, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, num_layers: int = 1, combine_strategy: str = 'concat', rnn_bias: bool = False, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992a10> = <sphinx.ext.autodoc.importer._MockObject object>, add_projection_layer: bool = True, projection_activation: str = 'Tanh')¶ Encodes a set of tokens to a set of hidden states.
Parameters: - embedder (nn.Module) – Any embedder can be used for this purpose
- dropout_value (float) – The dropout value for the embedding
- hidden_dim (int) – The hidden dimensions for the LSTM
- bidirectional (bool) – Whether the LSTM is bidirectional
- num_layers (int) – The number of layers of the LSTM
- combine_strategy (str) –
The strategy to combine the different layers of the LSTM This can be one of
- sum
- Sum the different layers of the embedding
- concat
- Concat the layers of the embedding
- rnn_bias (bool) – Set this to false only for debugging purposes
- device (torch.device) –
- add_projection_layer (bool) – Adds a projection layer after the lstm over the hidden activation
- projection_activation (str) – Refer to torch.nn activations. Use any class name as a projection here
-
forward
(lines: List[sciwing.data.line.Line], c0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992b90> = None, h0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992bd0> = None) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f39927d0>¶ Parameters: - lines (List[Line]) – A list of lines
- c0 (torch.FloatTensor) – The initial state vector for the LSTM
- h0 (torch.FloatTensor) – The initial hidden state for the LSTM
Returns: Returns the vector encoding of the set of instances [batch_size, seq_len, hidden_dim] if single direction [batch_size, seq_len, 2*hidden_dim] if bidirectional
Return type: torch.Tensor
-
lstm2vecencoder¶
-
class
sciwing.modules.lstm2vecencoder.
LSTM2VecEncoder
(embedder, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', rnn_bias: bool = True, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992450>] = <sphinx.ext.autodoc.importer._MockObject object>)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
,sciwing.utils.class_nursery.ClassNursery
-
__init__
(embedder, dropout_value: float = 0.0, hidden_dim: int = 1024, bidirectional: bool = False, combine_strategy: str = 'concat', rnn_bias: bool = True, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992450>] = <sphinx.ext.autodoc.importer._MockObject object>)¶ LSTM2Vec encoder that encodes a series of tokens to a single vector representation
Parameters: - embedder (nn.Module) – Any embedder can be passed
- dropout_value (float) – The dropout value for input embeddings
- hidden_dim (int) – The hidden dimension for the LSTM
- bidirectional (bool) – Whether the LSTM is bidirectional or no
- combine_strategy (str) – Strategy to combine the vectors from two different directions
- rnn_bias (str) – Whether to use the bias layer in RNN. Should be set to false only for debugging purposes
- device (Union[str, torch.device]) – The device on which the model is run
-
forward
(lines: List[sciwing.data.line.Line], c0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992250> = None, h0: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992190> = None) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3992690>¶ Parameters: - lines (List[Line]) – A list of lines to be encoder
- c0 (torch.FloatTensor) – The initial state vector for the LSTM
- h0 (torch.FloatTensor) – The initial hidden state for the LSTM
Returns: Returns the vector encoding of the set of instances [batch_size, hidden_dim] if single direction [batch_size, 2*hidden_dim] if bidirectional
Return type: torch.Tensor
Gets the initial hidden states of the LSTM2Vec encoder
Parameters: batch_size (int) – The batch size of the current forward pass Returns: Return type: torch.Tensor, torch.Tensor
-
sciwing.numericalizer¶
numericalizer¶
-
class
sciwing.numericalizers.numericalizer.
Numericalizer
(vocabulary: sciwing.vocab.vocab.Vocab = None)¶ Bases:
sciwing.numericalizers.base_numericalizer.BaseNumericalizer
-
__init__
(vocabulary: sciwing.vocab.vocab.Vocab = None)¶ Numericalizer converts tokens that are strings to numbers
Parameters: vocabulary (Vocab) – A vocabulary object that is built using a set of tokenized strings
-
get_mask_for_batch_instances
(instances: List[List[int]]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3092550>¶
-
get_mask_for_instance
(instance: List[int]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f3092510>¶
-
numericalize_batch_instances
(instances: List[List[str]]) → List[List[int]]¶ Numericalizes a batch of instances
Parameters: instances (List[List[str]]) – A list of tokenized sentences Returns: A list of numericalized instances Return type: List[List[int]]
-
numericalize_instance
(instance: List[str]) → List[int]¶ Numericalize a single instance
Parameters: instance (List[str]) – An instance is a list of tokens Returns: Numericalized instance Return type: List[int]
-
pad_batch_instances
(instances: List[List[int]], max_length: int, add_start_end_token: bool = True) → List[List[int]]¶ Pads a batch of instances according to the vocab object
Parameters: - instances (List[List[int]]) –
- max_length (int) –
- add_start_end_token (int) –
Returns: Return type: List[List[int]]
-
pad_instance
(numericalized_text: List[int], max_length: int, add_start_end_token: bool = True) → List[int]¶ Pads the instance according to the vocab object
Parameters: - numericalized_text (List[int]) – Pads a numericalized instance
- max_length (int) – The maximum length to pad to
- add_start_end_token (bool) – If true, start and end token will be added to the tokenized text
Returns: Padded instance
Return type: List[int]
-
vocabulary
¶
-
transformer_numericalizer¶
-
class
sciwing.numericalizers.transformer_numericalizer.
NumericalizerForTransformer
(vocab: sciwing.vocab.vocab.Vocab = None, tokenizer: sciwing.tokenizers.bert_tokenizer.TokenizerForBert = None)¶ Bases:
sciwing.numericalizers.base_numericalizer.BaseNumericalizer
-
get_mask_for_batch_instances
(instances: List[List[int]])¶
-
get_mask_for_instance
(instance: List[int])¶
-
numericalize_batch_instances
(instances: List[List[str]]) → List[int]¶
-
numericalize_instance
(instance: Union[List[str], List[sciwing.data.token.Token]]) → List[int]¶
-
pad_batch_instances
(instances: List[List[int]], max_length: int, add_start_end_token: bool = True)¶ Pads a batch of instances according to the vocab object
Parameters: - instances (List[List[int]]) –
- max_length (int) –
- add_start_end_token (int) –
Returns: Return type: List[List[int]]
-
pad_instance
(numericalized_text: List[int], max_length: int, add_start_end_token: bool = True) → List[int]¶ Pads the instance according to the vocab object
Parameters: - numericalized_text (List[int]) – Pads a numericalized instance
- max_length (int) – The maximum length to pad to
- add_start_end_token (bool) – If true, start and end token will be added to the tokenized text
Returns: Padded instance
Return type: List[int]
-
sciwing.preprocessing¶
instance_preprocessing¶
-
class
sciwing.preprocessing.instance_preprocessing.
InstancePreprocessing
¶ Bases:
object
This class implements some common pre-processing that may be applied on instances which are List[str]. For example, you can remove stop words, convert the word into lower case and others. Most of the methods here accept an instance and return an instance
-
static
indicate_capitalization
(instance: List[str]) → List[str]¶ Indicates whether every word is all small, all caps or captialized
Parameters: instance (List[str]) – A list of tokens Returns: Strings indicating capitalization Return type: List[str]
-
static
lowercase
(instance: List[str]) → List[str]¶
-
remove_stop_words
(instance: List[str]) → List[str]¶ Remove stop words if they are present We will use stop-words package from pip https://github.com/Alir3z4/python-stop-words
Parameters: instance (List[str]) – The list of tokens Returns: The instance with stop words removed Return type: List[str]
-
static
sciwing.tokenizers¶
BaseTokenizer¶
bert_tokenizer¶
-
class
sciwing.tokenizers.bert_tokenizer.
TokenizerForBert
(bert_type: str, do_basic_tokenize=True)¶ Bases:
sciwing.tokenizers.BaseTokenizer.BaseTokenizer
-
tokenize
(text: str) → List[str]¶
-
tokenize_batch
(texts: List[str]) → List[List[str]]¶
-
character_tokenizer¶
-
class
sciwing.tokenizers.character_tokenizer.
CharacterTokenizer
¶ Bases:
sciwing.tokenizers.BaseTokenizer.BaseTokenizer
-
tokenize
(text: str) → List[str]¶
-
tokenize_batch
(texts: List[str]) → List[List[str]]¶
-
word_tokenizer¶
-
class
sciwing.tokenizers.word_tokenizer.
WordTokenizer
(tokenizer: str = 'spacy')¶ Bases:
sciwing.tokenizers.BaseTokenizer.BaseTokenizer
-
__init__
(tokenizer: str = 'spacy')¶ WordTokenizers split the text into tokens
Parameters: tokenizer (str) – The type of tokenizer.
- spacy
- Tokenizer from spact
- nltk
- NLTK based tokenizer
- vanilla
- Tokenize words according to space
- spacy-whtiespace
- Same as vanilla but implemented using custom white space tokenizer from spacy
-
tokenize
(text: str) → List[str]¶ Tokenize text into a set of tokens
Parameters: text (str) – A single instance that is tokenized to a set of tokens Returns: A set of tokens Return type: List[str]
-
tokenize_batch
(texts: List[str]) → List[List[str]]¶ Tokenize a batch of sentences
Parameters: texts (List[List[str]]) – Returns: Return type: List[List[str]]
-
sciwing.utils¶
Amazon S3 Utils¶
-
class
sciwing.utils.amazon_s3.
S3Util
(aws_cred_config_json_filename: str)¶ Bases:
object
-
__init__
(aws_cred_config_json_filename: str)¶ Some utilities that would be useful to upload folders/models to s3
Parameters: aws_cred_config_json_filename (str) – You need to instantiate this file with a aws configuration json file
- The following will be the keys and values
- aws_access_key_id : str
- The access key id for the AWS account that you have
- aws_access_secret : str
- The access secret
- region : str
- The region in which your bucket is present
- parsect_bucket_name : str
- The name of the bucket where all the models/experiments will be sotred
-
download_file
(filename_s3: str, local_filename: str)¶ Downloads a file from s3
Parameters: - filename_s3 (str) – A filename in s3 that needs to be downloaded
- local_filename (str) – The local filename that will be used
-
download_folder
(folder_name_s3: str, download_only_best_checkpoint: bool = False, chkpoints_foldername: str = 'checkpoints', best_model_filename='best_model.pt', output_dir: str = '/home/docs/.sciwing.output_cache')¶ Downloads a folder from s3 recursively
Parameters: - folder_name_s3 (str) – The name of the folder in s3
- download_only_best_checkpoint (bool) – If the folder being downloaded is an experiment folder, then you can download only the best model checkpoints for running test or inference
- chkpoints_foldername (str) – The name of the checkpoints folder where the best model parameters are stored
- best_model_filename (str) – The name of the file where the best model parameters are stored
-
get_client
()¶ Returns boto3 client
Returns: The client object that manages all the aws operations The client is the low level access to the connection with s3 Return type: boto3.client
-
get_resource
()¶ Returns a high level manager for the aws bucket
Returns: Resource that manages connections with s3 Return type: boto3.resource
-
load_credentials
() → NamedTuple¶ Read the credentials from the json file
Returns: a named tuple with access_key, access_secret, region and bucket_name as the keys and the corresponding values filled in Return type: NamedTuple
-
search_folders_with
(pattern)¶ Searches for folders in the s3 bucket with specific pattern
Parameters: pattern (str) – A regex pattern Returns: The list of foldernames that match the pattern Return type: List[str]
-
upload_file
(filename: str, obj_name: str = None)¶ Parameters: - filename (str) – The filename in the local directory that needs to be uploaded to s3
- obj_name (str) – The filename to be used in s3 bucket. If None then obj_name in s3 will be the same as the filename
-
upload_folder
(folder_name: str, base_folder_name: str)¶ Recursively uploads a folder to s3
Parameters: - folder_name (str) – The name of the local folder that is uploaded
- base_folder_name (str) – The name of the folder from which the current folder being uploaded stems from. This is needed to associate appropriate files and directories to their hierarchies within the folder
-
Class Nursery¶
-
class
sciwing.utils.class_nursery.
ClassNursery
¶ Bases:
object
ClassNursery is the place where all the classes in SciWING are nursed
SciWING needs to get handle on the different classes that are being used. This is further useful for example, when we have to instantiate appropriate classes when the experiments are run from the TOML file
This uses a python 36 feature called __init_subclass__ that simplifies class creation. Whenever ClassNursery is mentioned as the parent class of a class, then init subclass is called. In SciWING we use it as a plugin registry where the mapping between the different class and their module is stored.
-
class_nursery
= {'Adam': <sphinx.ext.autodoc.importer._MockObject object>, 'BOW_Encoder': 'sciwing.modules.bow_encoder', 'BertEmbedder': 'sciwing.modules.embedders.bert_embedder', 'BowElmoEmbedder': 'sciwing.modules.embedders.bow_elmo_embedder', 'CharEmbedder': 'sciwing.modules.embedders.char_embedder', 'CharLSTMEncoder': 'sciwing.modules.charlstm_encoder', 'CoNLLDatasetManager': 'sciwing.datasets.seq_labeling.conll_dataset', 'ConcatEmbedders': 'sciwing.modules.embedders.concat_embedders', 'ElmoEmbedder': 'sciwing.modules.embedders.elmo_embedder', 'Engine': 'sciwing.engine.engine', 'FlairEmbedder': 'sciwing.modules.embedders.flair_embedder', 'LSTM2VecEncoder': 'sciwing.modules.lstm2vecencoder', 'Lstm2SeqEncoder': 'sciwing.modules.lstm2seqencoder', 'PrecisionRecallFMeasure': 'sciwing.metrics.precision_recall_fmeasure', 'RnnSeqCrfTagger': 'sciwing.models.rnn_seq_crf_tagger', 'SGD': <sphinx.ext.autodoc.importer._MockObject object>, 'SimpleClassifier': 'sciwing.models.simpleclassifier', 'SimpleTagger': 'sciwing.models.simple_tagger', 'TextClassificationDatasetManager': 'sciwing.datasets.classification.text_classification_dataset', 'TokenClassificationAccuracy': 'sciwing.metrics.token_cls_accuracy', 'TrainableWordEmbedder': 'sciwing.modules.embedders.trainable_word_embedder', 'WordEmbedder': 'sciwing.modules.embedders.word_embedder'}¶
-
Common Utils¶
-
sciwing.utils.common.
cached_path
(path: Union[pathlib.Path, str], url: str, unzip=True) → pathlib.Path¶
-
sciwing.utils.common.
chunks
(seq, n)¶ Yield successive n-sized chunks from seq.
-
sciwing.utils.common.
convert_generic_sect_to_json
(filename: str) → Dict[str, Any]¶ Converts the Generic sect data file into more readable json format
Parameters: filename (str) – The sectlabel file name available at WING-NUS website Returns: - text
- The text of the line
- label
- The label of the file
- file_no
- A unique file number
- line_count
- A line count within the file
Return type: Dict[str, Any]
-
sciwing.utils.common.
convert_generic_sect_to_sciwing_clf_format
(filename: str, out_dir: str)¶ Converts the generic sect original file to the sciwing classification format
Parameters: - filename (str) – The path of the file where the original generic section classification file is stored
- out_dir (str) – The output path where the train, dev and test files are written
Returns: Return type: None
-
sciwing.utils.common.
convert_parscit_to_conll
(parscit_train_filepath: pathlib.Path) → List[Dict[str, Any]]¶ Convert the parscit data available at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to a CONLL dummy version This is done so that we can use it with AllenNLPs built in data reader called conll2013 dataset reader
Parameters: parscit_train_filepath (pathlib.Path) – The path where the train file path is stored
-
sciwing.utils.common.
convert_parscit_to_sciwing_seqlabel_format
(parscit_train_filepath: pathlib.Path, output_dir: str)¶ Convert the parscit data availabel at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to the format required for sciwing seqential labelling
Parameters: - parscit_train_filepath (pathlib.Path) – The local path where the files are stored
- output_dir (str) – The output dir where the train dev and test file will be written
-
sciwing.utils.common.
convert_sectlabel_to_json
(filename: str) → Dict[KT, VT]¶ Converts the secthead file into more readable json format
Parameters: filename (str) – The sectlabel file name available at WING-NUS website Returns: - text
- The text of the line
- label
- The label of the file
- file_no
- A unique file number
- line_count
- A line count within the file
Return type: Dict[str, Any]
-
sciwing.utils.common.
convert_sectlabel_to_sciwing_clf_format
(filename: str, out_dir: str)¶ Writes the file in the format required for sciwing text classification dataset
Parameters: - filename (str) – The path of the sectlabel original format file.
- out_dir (str) – The path where the new files will be written
-
sciwing.utils.common.
create_class
(classname: str, module_name: str) → type¶ Given the classname and module, creates a class object and returns it
Parameters: - classname (str) – Class name to import
- module_name (str) – The module in which the class is present
Returns: Return type: type
-
sciwing.utils.common.
download_file
(url: str, dest_filename: str) → None¶ Download a file from the given url
Parameters: - url (str) – The url from which the file will be downloaded
- dest_filename (str) – The destination filename
-
sciwing.utils.common.
extract_tar
(filename: str, destination_dir: str, mode='r')¶ Extracts tar, targz and other files
Parameters: - filename (str) – The tar zipped file
- destination_dir (str) – The destination directory in which the files should be placed
- mode (str) – A valid tar mode. You can refer to https://docs.python.org/3/library/tarfile.html for the different modes.
-
sciwing.utils.common.
extract_zip
(filename: str, destination_dir: str)¶ Extracts a zipped file
Parameters: - filename (str) – The zipped filename
- destination_dir (str) – The directory where the zipped will be placed
-
sciwing.utils.common.
flatten
(list_items: List[Any]) → List[Any]¶ Flattens an arbitrarily long nesting of lists
Parameters: list_items (List[Any]) – It can be an arbitrarily long nesting of lists Returns: Flattened list Return type: List
-
sciwing.utils.common.
get_system_mem_in_gb
()¶ Returns the total system memory in GB
Returns: Memory size in GB Return type: float
-
sciwing.utils.common.
get_train_dev_test_stratified_split
(lines: List[str], labels: List[str], train_split: float = 0.8, dev_split: float = 0.1, test_split: float = 0.1, random_state: int = 1729) -> ((typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]))¶ Slits the lines and labels into train, dev and test splits using stratified and random shuffle
Parameters: - lines (List[str]) – A list of lines
- labels (List[str]) – A list of labels
- train_split (float) – The proportion of lines to be used for training
- dev_split (float) – The proportion of lines to be used for validation
- test_split (float) – The proportion of lines to be used for testing
- random_state (int) – The seed to be used for randomization. Good for reproducing the same splits Passing None will cause the random number generator to be RandomState used by np.random
-
sciwing.utils.common.
merge_dictionaries_with_sum
(a: Dict[KT, VT], b: Dict[KT, VT]) → Dict[KT, VT]¶
-
sciwing.utils.common.
pack_to_length
(tokenized_text: List[str], max_length: int, pad_token: str = '<PAD>', add_start_end_token: bool = False, start_token: str = '<SOS>', end_token: str = '<EOS>') → List[str]¶ Packs tokenized text to maximum length
Parameters: - tokenized_text (List[str]) – A list of toekns
- max_length (int) – The max length to pack to
- pad_token (int) – The pad token to be used for the padding
- add_start_end_token (bool) – Whether to add the start and end token to every sentence while packing
- start_token (str) – The start token to be used if
add_start_token
is True. - end_token (str) – The end token to be used if
add_end_token
is True
-
sciwing.utils.common.
pairwise
(iterable: Iterable[T_co]) → Iterator[T_co]¶ Return the overlapping pairwise elements of the iterable
Parameters: iterable (Iterable) – Anything that can be iterated Returns: Iterator over the paired sequence Return type: Iterator
-
sciwing.utils.common.
write_cora_to_conll_file
(cora_conll_filepath: pathlib.Path) → None¶ Writes cora file that is availabel at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train to CONLL format
Parameters: cora_conll_filepath (The destination filepath where the CORA is converted to CONLL format) –
-
sciwing.utils.common.
write_nfold_parscit_train_test
(parscit_train_filepath: pathlib.Path, output_train_filepath: pathlib.Path, output_test_filepath: pathlib.Path, nsplits: int = 2) → bool¶ Convert the parscit train folder into different folds. This is useful for n-fold cross validation on the dataset. This method can be iterated over to get all the different folds of the data contained in the
parscit_train_filepath
Parameters: - parscit_train_filepath (pathlib.Path) – The path where the Parscit file is stored The file is available at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train
- output_train_filepath (pathlib.Path) – The path where the train fold of the dataset will be stored
- output_test_filepath (pathlib.Path) – The path where the teset fold of the dataset will be stored
- nsplits (int) – The number of splits in the dataset.
Returns: Indicates whether the particular fold has been written
Return type: bool
-
sciwing.utils.common.
write_parscit_to_conll_file
(parscit_conll_filepath: pathlib.Path) → None¶ Write Parscit file to CONLL file format
Parameters: parscit_conll_filepath (pathlib.Path) – The destination file where the parscit data is written to
Custom Spacy Tokenizers¶
This module implements custom spacy tokenizers if needed This can be useful for custom tokenization that is required for scientific domain
Custom Exceptions¶
-
exception
sciwing.utils.exceptions.
ClassInNurseryError
¶ Bases:
KeyError
The ClassNursery cannot have two classes of the same name. This error is raised when that happens
-
exception
sciwing.utils.exceptions.
DatasetPresentError
(message: str)¶ Bases:
Exception
-
exception
sciwing.utils.exceptions.
TOMLConfigurationError
(message: str)¶ Bases:
Exception
This error is raised for illegal configuration of TOML
Science IE Data Utils¶
-
class
sciwing.utils.science_ie_data_utils.
ScienceIEDataUtils
(folderpath: pathlib.Path, ignore_warnings=False)¶ Bases:
object
Science-IE is a SemEval Task that is aimed at extracting entities from scientific articles This class is a utility for various operations on the competitions data files.
-
__init__
(folderpath: pathlib.Path, ignore_warnings=False)¶ Given the folderpath where the ScienceIE data is stored, this class provides various utilities. For more information on the dataset you can refer to https://scienceie.github.io/
Parameters: - folderpath (pathlib.Path) – The path where the ScienceIEDataset is stored
- ignore_warnings (bool) – If True, then all the warnings generated by this class for inconsistencies in the data is ignored
-
static
_form_ann_line
(idx: str, char_offset: Tuple[int, int, str], tag_name: str, doc: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c0bf50>)¶ Forms a ann line that can be used to write the ANN files for CoNLL format
Parameters: - idx (int) – The index for the entity being written
- char_offset (int) – THe start, end, tag for the line
- tag_name (str) – The tag to be used and is one of
[Task, Process, Material]
- doc (str) – Spacy doc to query the appropriate characters
Returns: An ANN line that is formed.
Return type: str
-
_get_annotations_for_entity
(file_id: str, entity: str) → List[Dict[str, Any]]¶ Parameters: - file_id (str) – A ScienceIE file id
- entity (str) – One of
[Task, Process, Material]
Returns: - A list of annotations where every annotation is
- start
The start character index of the annotation
- end
The end character index of the annotation
- words
The set of words between the start and the end index
- entity_number
The entity number
- tag
The tag associated with the set of tags
Return type: List[Dict[str, Any]]
-
_get_bilou_lines_for_entity
(text: str, annotations: List[Dict[str, Any]], entity: str) → List[str]¶ The list of BILOU lines for entity
Parameters: - text (str) – The text for which BILOU lines need to be returned
- annotations (List[Dict[str, Any]]) – The list of annotations where every annotation is a dictionary
- entity (str) – A particular entity for which the BILOU lines are returned
Returns: The list of BILOU tagged lines, where every line is a
word, tag, tag, tag
where the tag is decided by the entity.Return type: List[str]
-
get_bilou_lines_for_entity
(file_id: str, entity: str)¶ Writes conll file for the entity type
Parameters: - file_id (str) – File id of the annotation file
- entity (str) – The entity for which conll file is written
Returns: The list of BILOU lines for the entity
Return type: List[str]
-
get_file_ids
() → List[str]¶ Get all the file ids from the folder
Returns: A List of File ids in the folder Return type: List[str]
-
get_sentence_wise_bilou_lines
(file_id: str, entity_type: str) → List[List[str]]¶ Get BILOU lines sentence-wise
Parameters: - file_id (str) – File id from ScienceIE Dataset
- entity_type (str) – One of
['Task', 'Process', 'Material']
Returns: A list of sentences where every sentence is composed
Return type: List[List[str]]
-
get_sents
(text: str) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f32f2c0b990>]¶ Returns all the sentences in the text
Parameters: text (str) – Returns: All the sentences in the text as a spacy span. A spacy span encodes more information within Return type: List[span.Span]
-
get_text_from_fileid
(file_id: str) → str¶ Given a file id return the text from the file
Parameters: file_id (str) – A ScienceIE data file id Returns: Text read from the file Return type: str
-
merge_files
(task_filename: pathlib.Path, process_filename: pathlib.Path, material_filename: pathlib.Path, out_filename: pathlib.Path)¶ Merge different files to one conll file
Parameters: - task_filename (pathlib.Path) – The CONLL style file having TASK tags
- process_filename (pathlib.Path) – The CONLL style file having Process tags
- material_filename (pathlib.Path) – The CONLL style file having Material Tags
- out_filename (pathlib.Path) – The output file where the different files will be merged
and every line will consist of
word Task-tag Process-tag Material-tag
-
write_ann_file_from_conll_file
(conll_filepath: pathlib.Path, ann_filepath: pathlib.Path, text: str)¶
-
write_bilou_lines
(out_filename: pathlib.Path, is_sentence_wise: bool = False)¶ Writes bilou lines in the out_filename for all the files in
self.folderpath
. The output file will contain every word on one line with their tag in BILOU format.You can even opt to write the text in a sentence wise. The text which is possibly of multiple sentences, is broken down into sentences and then written into the output filename. Different sentences are separated by an empty line.
Parameters: - out_filename (pathlib.Path) – The output filename where the conll filename is written
- is_sentence_wise (bool) – You can write the BILOU lines sentence wise. The text in all the ScienceIE files will be broken into sentences, and the sentences will be tagged with BILOU tags
-
Sciwing TOML Runner¶
-
class
sciwing.utils.sciwing_toml_runner.
SciWingTOMLRunner
(toml_filename: pathlib.Path, infer: bool = False)¶ Bases:
object
-
_form_dag
(section_name: str, section: Dict[KT, VT], parent: str)¶ Forms a DAG of the model section for execution
The model can be a complex structure with various other sub-components that can be used One depends on the other and the order of execution has to be decided DAG is a good abstract model to define the dependence between different modules This method instantiates a DAG given the section name, the TOML section that is being parsed with a directed edge between the parent and the child
Parameters: - section_name (str) – The name of the TOML section being parsed
- section (Dict) – The details of the actual section
- parent (str) – The node id of the parent graph
-
_instantiate_model_using_dag
()¶ This is a key method that instantiates the DAG using topological order
THE DAG from the TOML model section should be instantiated with the submodules of a module instantiated before the parent module can be instantiated This method does it using topological sort. Topoloogical sort is the sorting of nodes of a DAG where if there is an edge between two nodes from u ->v , then u appears before v in the ordering.
We do exactly this for SciWING. We instantiate the children nodes that are used by parent nodes before we can instantiate the root node of the DAG that will represent the entire module.
Returns: The instantiation of the root node Return type: nn.Module
-
_parse_toml_file
()¶ Parses the toml file and returns the document
Returns: The dictionary by parsing the toml file Return type: Dict[str, Any]
-
parse
()¶ Parases the dataset, model and engine section of a toml file
-
parse_dataset_section
()¶ Parse the dataset section of the toml file and instantiate the dataset
Returns: The dataset manager for the experiment Return type: DatasetManager
-
parse_engine_section
()¶ Parses the engine section of the TOML file
Returns: Object of the Engine class Return type: Engine
-
parse_model_section
()¶ Parses the Model section of the toml file
Returns: A torch module representing the model Return type: nn.Module
-
run
()¶
-
Tensor Utils¶
-
sciwing.utils.tensor_utils.
get_mask
(batch_size: int, max_size: int, lengths: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f043e2d0>)¶ Returns mask given the lengths tensor. A convenience method
Given a lengths tensor as in
>> torch.LongTensor([3, 1, 2])
which often indicates the original length of the tensor without padding, get_mask() returns a tensor with 1 positions where there is no padding and 0 where there is padding
Parameters: - batch_size (int) – Batch size of the tensors
- max_size (int) – Maximum size or often Maximum number of time steps
- lengths (torch.LongTensor) – The original length of the tensors in the batch without padding
Returns: Mask having 1 where there are no paddings and 0 where there are paddings
Return type: torch.LongTensor
-
sciwing.utils.tensor_utils.
has_tensor
(obj) → bool¶ Given a possibly complex data structure, check if it has any torch.Tensors in it. From
allennlp.nn.util
-
sciwing.utils.tensor_utils.
move_to_device
(obj, cuda_device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f32f043ec50>)¶ Given a structure (possibly) containing Tensors on the CPU, move all the Tensors to the specified GPU (or do nothing, if they should be on the CPU). From
allenlp.nn.util
NER Terminal Visualizer¶
Bases:
object
Visualize Sequence Tagging
Parameters: - colors (List[str]) – The set of colors that will be used for tagging
- colors_palette (str) – The color palette that should be used. We recommend For more information on color palettes you can refer to the documentation of the python package colorful
- tags (List[str]) – The set of all labels that can be labelled If this is not given, then the tags will be infered using the labels during tagging
Visualize the tags from json.
Parameters: - json_annotation (str) – You can send a json that has the following format {‘text’: str, ‘tags’: [{‘start’:int, ‘end’:str, ‘tag’: str}] }
- show_only_entities (List[str]) – You can filter to show only these entities.
Visualizes sequential tagged data where the string is represented as a set of words and every word has a corresponding label. This can be extended to having different tagging schemes at a later point in time
Parameters: - text (List[str]) –
- to be tagged represented as a list of strings (String) –
- labels (List[str]) –
- labels corresponding to each word in the string (The) –
Returns: Return type: None