sciwing.models¶

Simple Classifier¶

class sciwing.models.simpleclassifier.SimpleClassifier(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1558c2d0>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1551b750>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

__init__(encoder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1558c2d0>, encoding_dim: int, num_classes: int, classification_layer_bias: bool = True, label_namespace: str = 'label', datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1551b750>, str] = <sphinx.ext.autodoc.importer._MockObject object>)¶

SimpleClassifier is a linear classifier head on top of any encoder

Parameters:

encoder (nn.Module) – Any encoder that takes in lines and produces a single vector for every line.
encoding_dim (int) – The encoding dimension
num_classes (int) – The number of classes
classification_layer_bias (bool) – Whether to add classification layer bias or no This is set to false only for debugging purposes ff
label_namespace (str) – The namespace used for labels in the dataset
datasets_manager (DatasetsManager) – The datasets manager for the model
device (torch.device) – The device on which the model is run

forward(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.label.Label] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False) → Dict[str, Any]¶

Parameters:

lines (List[Line]) – iter_dict from any dataset that will be passed on to the encoder
labels (List[Label]) – A list of labels for every instance
is_training (bool) – running forward on training dataset?
is_validation (bool) – running forward on validation dataset?
is_test (bool) – running forward on test dataset?

Returns:

logits: torch.FloatTensor: Un-normalized probabilities over all the classes of the shape [batch_size, num_classes]
normalized_probs: torch.FloatTensor: Normalized probabilities over all the classes of the shape [batch_size, num_classes]
loss: float: Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset

Return type:

Dict[str, Any]

Simple Tagger¶

class sciwing.models.simple_tagger.SimpleTagger(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12052b90> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')¶

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery

PyTorch module for Neural Parscit

__init__(rnn2seqencoder: sciwing.modules.lstm2seqencoder.Lstm2SeqEncoder, encoding_dim: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager, device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c12052b90> = <sphinx.ext.autodoc.importer._MockObject object>, label_namespace: str = 'seq_label')¶

Parameters:	rnn2seqencoder (Lstm2SeqEncoder) – Lstm2SeqEncoder that encodes a set of instances to a sequence of hidden states encoding_dim (int) – Hidden dimension of the lstm2seq encoder

forward(lines: List[sciwing.data.line.Line], labels: List[sciwing.data.seq_label.SeqLabel] = None, is_training: bool = False, is_validation: bool = False, is_test: bool = False)¶

Parameters:

lines (List[lines]) – A list of lines
labels (List[SeqLabel]) – A list of sequence labels
is_training (bool) – running forward on training dataset?
is_validation (bool) – running forward on training dataset ?
is_test (bool) – running forward on test dataset?

Returns:

logits: torch.FloatTensor: Un-normalized probabilities over all the classes of the shape [batch_size, num_classes]
predicted_tags: List[List[int]]: Set of predicted tags for the batch
loss: float: Loss value if this is a training forward pass or validation loss. There will be no loss if this is the test dataset

Return type:

Dict[str, Any]

Neural Parscit¶

class sciwing.models.neural_parscit.NeuralParscit(device: Optional[Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c1482dc90>, int]] = -1)¶

Bases: sphinx.ext.autodoc.importer._MockObject

It defines a neural parscit model. The model is used for citation string parsing. This model helps you use a pre-trained model who architecture is fixed and is trained by SciWING. You can also fine-tune the model on your own dataset.

For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.

interact()¶: Interact with the pretrained model You can also interact from command line using sciwing interact neural-parscit

predict_for_file(filename: str) → List[str]¶

Parse the references in a file where every line is a reference

Parameters:	filename (str) – The filename where the references are stored
Returns:	A list of parsed tags
Return type:	List[str]

predict_for_text(text: str, show=True) → str¶

Parse the citation string for the given text

Parameters:	text (str) – reference string to parse show (bool) – If True, then we print the stylized string - where the stylized string provides different colors for different tags If False - then we do not print the stylized string
Returns:	The parsed citation string
Return type:	str

Citation Intent Classification¶

class sciwing.models.citation_intent_clf.CitationIntentClassification¶

Bases: sphinx.ext.autodoc.importer._MockObject

interact()¶: Interact with the pretrained model

predict_for_file(filename: str) → List[str]¶

Predict the intents for all the citations in the filename The citations should be contained one per line

Parameters:	filename (str) – The filename where the citations are stored
Returns:	Returns the intents for each line of citation
Return type:	List[str]

predict_for_text(text: str) → str¶

Predict the intent for citation

Parameters:	text (str) – The citation string
Returns:	The predicted label for the citation
Return type:	str

Generic Section Header Classification¶

class sciwing.models.generic_sect.GenericSect¶

Bases: object

interact()¶: Interact with the pretrained model

predict_for_file(filename: str) → List[str]¶

Make predictions for every line in the file

Parameters:	filename (str) – The filename where section headers are stored one per line
Returns:	A list of predictions
Return type:	List[str]

predict_for_text(text: str, show=True) → str¶

Predicts the generic section headers of the text

Parameters:	text (str) – The section header string to be normalized show (bool) – If True then we print the prediction.
Returns:	The prediction for the section header
Return type:	str

I2B2 NER¶

class sciwing.models.i2b2.I2B2NER¶

Bases: sphinx.ext.autodoc.importer._MockObject

It defines a I2B2 clinical NER model trained using SciWING

For practitioners, we provide ways to obtain results quickly from a set of citations stored in a file or from a string. If you want to see the demo head over to our demo site.

interact()¶

predict_for_file(filename: str) → List[str]¶

predict_for_text(text: str)¶

SectLabel¶

class sciwing.models.sectlabel.SectLabel(log_file: str = None, device: str = 'cpu')¶

Bases: object

dehyphenate(lines: List[str]) → List[str]¶

Dehyphenates a list of strings

Parameters:	lines (List[str]) – A list of hyphenated strings
Returns:	A list of dehyphenated strings
Return type:	List[str]

extract_abstract_for_file(pdf_filename: pathlib.Path, dehyphenate: bool = True) → str¶

Extracts abstracts from a pdf using sectlabel. This is the python programmatic version of the API. The APIs can be found in sciwing/api. You can see that for more information

Parameters:	pdf_filename (pathlib.Path) – The path where the pdf is stored dehyphenate (bool) – Scientific documents are two columns sometimes and there are a lot of hyphenation introduced. If this is true, we remove the hyphens from the code
Returns:	The abstract of the pdf
Return type:	str

extract_abstract_for_folder(foldername: pathlib.Path, dehyphenate=True)¶

Extracts the abstracts for all the pdf fils stored in a folder

Parameters:	foldername (pathlib.Path) – THe path of the folder containing pdf files dehyphenate (bool) – We will try to dehyphenate the lines. Useful if the pdfs are two column research paper
Returns:	Writes the abstracts to files
Return type:	None

extract_all_info(pdf_filename: pathlib.Path)¶

Extracts information from the pdf file.

Parameters:	pdf_filename (pathlib.Path) – The path of the pdf file
Returns:	A dictionary containing information parsed from the pdf file
Return type:	Dict[str, Any]

interact()¶: Interact with the pre-trained model

predict_for_file(filename: str) → List[str]¶

Predicts the logical sections for all the sentences in a file, with one sentence per line

Parameters:	filename (str) – The path of the file
Returns:	The predictions for each line.
Return type:	List[str]

predict_for_pdf(pdf_filename: pathlib.Path) -> (typing.List[str], typing.List[str])¶

Predicts lines and labels given a pdf filename

Parameters:	pdf_filename (pathlib.Path) – The location where pdf files are stored
Returns:	The lines and labels inferred on the file
Return type:	List[str], List[str]

predict_for_text(text: str) → str¶

Predicts the logical section that the line belongs to

Parameters:	text (str) – A single line of text
Returns:	The logical section of the text.
Return type:	str

predict_for_text_batch(texts: List[str]) → List[str]¶

Predicts the logical section for a batch of text.

Parameters:	texts (List[str]) – A batch of text
Returns:	A batch of predictions
Return type:	List[str]