sciwing.models

Simple Classifier

class sciwing.models.simpleclassifier.SimpleClassifier(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1558c2d0>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1551b750>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1558c2d0>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1551b750>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

SimpleClassifier is a linear classifier head on top of any encoder

Parameters:
  • encoder (nn.Module) – Any encoder that takes in lines and produces a single vector for every line.
  • encoding_dim (int) – The encoding dimension
  • num_classes (int) – The number of classes
  • classification_layer_bias (bool) – Whether to add classification layer bias or no This is set to false only for debugging purposes ff
  • label_namespace (str) – The namespace used for labels in the dataset
  • datasets_manager (DatasetsManager) – The datasets manager for the model
  • device (torch.device) – The device on which the model is run
forward(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False) → Dict[str, Any]
Parameters:
  • lines (List[Line]) – iter_dict from any dataset that will be passed on to the encoder
  • labels (List[Label]) – A list of labels for every instance
  • is_training (bool) – running forward on training dataset?
  • is_validation (bool) – running forward on validation dataset?
  • is_test (bool) – running forward on test dataset?
Returns:

logits: torch.FloatTensor

Un-normalized probabilities over all the classes of the shape [batch_size, num_classes]

normalized_probs: torch.FloatTensor

Normalized probabilities over all the classes of the shape [batch_size, num_classes]

loss: float

Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset

Return type:

Dict[str, Any]

Simple Tagger

class sciwing.models.simple_tagger.SimpleTagger(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12052b90> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

PyTorch module for Neural Parscit

__init__(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12052b90> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')
Parameters:
  • rnn2seqencoder (Lstm2SeqEncoder) – Lstm2SeqEncoder that encodes a set of instances to a sequence of hidden states
  • encoding_dim (int) – Hidden dimension of the lstm2seq encoder
forward(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.seq_label.SeqLabel] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False)
Parameters:
  • lines (List[lines]) – A list of lines
  • labels (List[SeqLabel]) – A list of sequence labels
  • is_training (bool) – running forward on training dataset?
  • is_validation (bool) – running forward on training dataset ?
  • is_test (bool) – running forward on test dataset?
Returns:

logits: torch.FloatTensor

Un-normalized probabilities over all the classes of the shape [batch_size, num_classes]

predicted_tags: List[List[int]]

Set of predicted tags for the batch

loss: float

Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset

Return type:

Dict[str, Any]

Neural Parscit

class sciwing.models.neural_parscit.NeuralParscit(device: Optional[Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1482dc90>, int]] = -1)

Bases: sphinx.ext.autodoc.importer._MockObject

It defines a neural parscit model. The model is used for citation string parsing. This model helps you use a pre-trained model who architecture is fixed and is trained by SciWING. You can also fine-tune the model on your own dataset.

For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.

interact()

Interact with the pretrained model You can also interact from command line using sciwing interact neural-parscit

predict_for_file(filename: str) → List[str]

Parse the references in a file where every line is a reference

Parameters:filename (str) – The filename where the references are stored
Returns:A list of parsed tags
Return type:List[str]
predict_for_text(text: str, show=True) → str

Parse the citation string for the given text

Parameters:
  • text (str) – reference string to parse
  • show (bool) – If True, then we print the stylized string - where the stylized string provides different colors for different tags If False - then we do not print the stylized string
Returns:

The parsed citation string

Return type:

str

Citation Intent Classification

class sciwing.models.citation_intent_clf.CitationIntentClassification

Bases: sphinx.ext.autodoc.importer._MockObject

interact()

Interact with the pretrained model

predict_for_file(filename: str) → List[str]

Predict the intents for all the citations in the filename The citations should be contained one per line

Parameters:filename (str) – The filename where the citations are stored
Returns:Returns the intents for each line of citation
Return type:List[str]
predict_for_text(text: str) → str

Predict the intent for citation

Parameters:text (str) – The citation string
Returns:The predicted label for the citation
Return type:str

Generic Section Header Classification

class sciwing.models.generic_sect.GenericSect

Bases: object

interact()

Interact with the pretrained model

predict_for_file(filename: str) → List[str]

Make predictions for every line in the file

Parameters:filename (str) – The filename where section headers are stored one per line
Returns:A list of predictions
Return type:List[str]
predict_for_text(text: str, show=True) → str

Predicts the generic section headers of the text

Parameters:
  • text (str) – The section header string to be normalized
  • show (bool) – If True then we print the prediction.
Returns:

The prediction for the section header

Return type:

str

I2B2 NER

class sciwing.models.i2b2.I2B2NER

Bases: sphinx.ext.autodoc.importer._MockObject

It defines a I2B2 clinical NER model trained using SciWING

For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.

interact()
predict_for_file(filename: str) → List[str]
predict_for_text(text: str)

SectLabel

class sciwing.models.sectlabel.SectLabel(log_file: str = None, device: str = 'cpu')

Bases: object

dehyphenate(lines: List[str]) → List[str]

Dehyphenates a list of strings

Parameters:lines (List[str]) – A list of hyphenated strings
Returns:A list of dehyphenated strings
Return type:List[str]
extract_abstract_for_file(pdf_filename: pathlib.Path, dehyphenate: bool = True) → str

Extracts abstracts from a pdf using sectlabel. This is the python programmatic version of the API. The APIs can be found in sciwing/api. You can see that for more information

Parameters:
  • pdf_filename (pathlib.Path) – The path where the pdf is stored
  • dehyphenate (bool) – Scientific documents are two columns sometimes and there are a lot of hyphenation introduced. If this is true, we remove the hyphens from the code
Returns:

The abstract of the pdf

Return type:

str

extract_abstract_for_folder(foldername: pathlib.Path, dehyphenate=True)

Extracts the abstracts for all the pdf fils stored in a folder

Parameters:
  • foldername (pathlib.Path) – THe path of the folder containing pdf files
  • dehyphenate (bool) – We will try to dehyphenate the lines. Useful if the pdfs are two column research paper
Returns:

Writes the abstracts to files

Return type:

None

extract_all_info(pdf_filename: pathlib.Path)

Extracts information from the pdf file.

Parameters:pdf_filename (pathlib.Path) – The path of the pdf file
Returns:A dictionary containing information parsed from the pdf file
Return type:Dict[str, Any]
interact()

Interact with the pre-trained model

predict_for_file(filename: str) → List[str]

Predicts the logical sections for all the sentences in a file, with one sentence per line

Parameters:filename (str) – The path of the file
Returns:The predictions for each line.
Return type:List[str]
predict_for_pdf(pdf_filename: pathlib.Path) -> (typing.List[str], typing.List[str])

Predicts lines and labels given a pdf filename

Parameters:pdf_filename (pathlib.Path) – The location where pdf files are stored
Returns:The lines and labels inferred on the file
Return type:List[str], List[str]
predict_for_text(text: str) → str

Predicts the logical section that the line belongs to

Parameters:text (str) – A single line of text
Returns:The logical section of the text.
Return type:str
predict_for_text_batch(texts: List[str]) → List[str]

Predicts the logical section for a batch of text.

Parameters:texts (List[str]) – A batch of text
Returns:A batch of predictions
Return type:List[str]