sciwing.utils¶

Amazon S3 Utils¶

class sciwing.utils.amazon_s3.S3Util(aws_cred_config_json_filename: str)¶

Bases: object

__init__(aws_cred_config_json_filename: str)¶

Some utilities that would be useful to upload folders/models to s3

Parameters:

aws_cred_config_json_filename (str) –

You need to instantiate this file with a aws configuration json file

The following will be the keys and values

aws_access_key_id : str: The access key id for the AWS account that you have
aws_access_secret : str: The access secret
region : str: The region in which your bucket is present
parsect_bucket_name : str: The name of the bucket where all the models/experiments will be sotred

download_file(filename_s3: str, local_filename: str)¶

Downloads a file from s3

Parameters:	filename_s3 (str) – A filename in s3 that needs to be downloaded local_filename (str) – The local filename that will be used

download_folder(folder_name_s3: str, download_only_best_checkpoint: bool = False, chkpoints_foldername: str = 'checkpoints', best_model_filename='best_model.pt', output_dir: str = '/home/docs/.sciwing.output_cache')¶

Downloads a folder from s3 recursively

Parameters:

folder_name_s3 (str) – The name of the folder in s3
download_only_best_checkpoint (bool) – If the folder being downloaded is an experiment folder, then you can download only the best model checkpoints for running test or inference
chkpoints_foldername (str) – The name of the checkpoints folder where the best model parameters are stored
best_model_filename (str) – The name of the file where the best model parameters are stored

get_client()¶

Returns boto3 client

Returns:	The client object that manages all the aws operations The client is the low level access to the connection with s3
Return type:	boto3.client

get_resource()¶

Returns a high level manager for the aws bucket

Returns:	Resource that manages connections with s3
Return type:	boto3.resource

load_credentials() → NamedTuple¶

Read the credentials from the json file

Returns:	a named tuple with access_key, access_secret, region and bucket_name as the keys and the corresponding values filled in
Return type:	NamedTuple

search_folders_with(pattern)¶

Searches for folders in the s3 bucket with specific pattern

Parameters:	pattern (str) – A regex pattern
Returns:	The list of foldernames that match the pattern
Return type:	List[str]

upload_file(filename: str, obj_name: str = None)¶

Parameters:	filename (str) – The filename in the local directory that needs to be uploaded to s3 obj_name (str) – The filename to be used in s3 bucket. If None then obj_name in s3 will be the same as the filename

upload_folder(folder_name: str, base_folder_name: str)¶

Recursively uploads a folder to s3

Parameters:	folder_name (str) – The name of the local folder that is uploaded base_folder_name (str) – The name of the folder from which the current folder being uploaded stems from. This is needed to associate appropriate files and directories to their hierarchies within the folder

Class Nursery¶

class sciwing.utils.class_nursery.ClassNursery¶

Bases: object

ClassNursery is the place where all the classes in SciWING are nursed

SciWING needs to get handle on the different classes that are being used. This is further useful for example, when we have to instantiate appropriate classes when the experiments are run from the TOML file

This uses a python 36 feature called __init_subclass__ that simplifies class creation. Whenever ClassNursery is mentioned as the parent class of a class, then init subclass is called. In SciWING we use it as a plugin registry where the mapping between the different class and their module is stored.

class_nursery = {'Adam': <sphinx.ext.autodoc.importer._MockObject object>, 'BOW_Encoder': 'sciwing.modules.bow_encoder', 'BertEmbedder': 'sciwing.modules.embedders.bert_embedder', 'BowElmoEmbedder': 'sciwing.modules.embedders.bow_elmo_embedder', 'CharEmbedder': 'sciwing.modules.embedders.char_embedder', 'CharLSTMEncoder': 'sciwing.modules.charlstm_encoder', 'CoNLLDatasetManager': 'sciwing.datasets.seq_labeling.conll_dataset', 'ConcatEmbedders': 'sciwing.modules.embedders.concat_embedders', 'ElmoEmbedder': 'sciwing.modules.embedders.elmo_embedder', 'Engine': 'sciwing.engine.engine', 'FlairEmbedder': 'sciwing.modules.embedders.flair_embedder', 'LSTM2VecEncoder': 'sciwing.modules.lstm2vecencoder', 'Lstm2SeqEncoder': 'sciwing.modules.lstm2seqencoder', 'PrecisionRecallFMeasure': 'sciwing.metrics.precision_recall_fmeasure', 'RnnSeqCrfTagger': 'sciwing.models.rnn_seq_crf_tagger', 'SGD': <sphinx.ext.autodoc.importer._MockObject object>, 'SimpleClassifier': 'sciwing.models.simpleclassifier', 'SimpleTagger': 'sciwing.models.simple_tagger', 'TextClassificationDatasetManager': 'sciwing.datasets.classification.text_classification_dataset', 'TokenClassificationAccuracy': 'sciwing.metrics.token_cls_accuracy', 'TrainableWordEmbedder': 'sciwing.modules.embedders.trainable_word_embedder', 'WordEmbedder': 'sciwing.modules.embedders.word_embedder'}¶

Common Utils¶

sciwing.utils.common.cached_path(path: Union[pathlib.Path, str], url: str, unzip=True) → pathlib.Path¶

sciwing.utils.common.chunks(seq, n)¶: Yield successive n-sized chunks from seq.

sciwing.utils.common.convert_generic_sect_to_json(filename: str) → Dict[str, Any]¶

Converts the Generic sect data file into more readable json format

Parameters:	filename (str) – The sectlabel file name available at WING-NUS website
Returns:	text The text of the line label The label of the file file_no A unique file number line_count A line count within the file
Return type:	Dict[str, Any]

sciwing.utils.common.convert_generic_sect_to_sciwing_clf_format(filename: str, out_dir: str)¶

Converts the generic sect original file to the sciwing classification format

Parameters:	filename (str) – The path of the file where the original generic section classification file is stored out_dir (str) – The output path where the train, dev and test files are written
Returns:
Return type:	None

sciwing.utils.common.convert_parscit_to_conll(parscit_train_filepath: pathlib.Path) → List[Dict[str, Any]]¶

Convert the parscit data available at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to a CONLL dummy version This is done so that we can use it with AllenNLPs built in data reader called conll2013 dataset reader

Parameters:	parscit_train_filepath (pathlib.Path) – The path where the train file path is stored

sciwing.utils.common.convert_parscit_to_sciwing_seqlabel_format(parscit_train_filepath: pathlib.Path, output_dir: str)¶

Convert the parscit data availabel at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to the format required for sciwing seqential labelling

Parameters:	parscit_train_filepath (pathlib.Path) – The local path where the files are stored output_dir (str) – The output dir where the train dev and test file will be written

sciwing.utils.common.convert_sectlabel_to_json(filename: str) → Dict[KT, VT]¶

Converts the secthead file into more readable json format

Parameters:	filename (str) – The sectlabel file name available at WING-NUS website
Returns:	text The text of the line label The label of the file file_no A unique file number line_count A line count within the file
Return type:	Dict[str, Any]

sciwing.utils.common.convert_sectlabel_to_sciwing_clf_format(filename: str, out_dir: str)¶

Writes the file in the format required for sciwing text classification dataset

Parameters:	filename (str) – The path of the sectlabel original format file. out_dir (str) – The path where the new files will be written

sciwing.utils.common.create_class(classname: str, module_name: str) → type¶

Given the classname and module, creates a class object and returns it

Parameters:	classname (str) – Class name to import module_name (str) – The module in which the class is present
Returns:
Return type:	type

sciwing.utils.common.download_file(url: str, dest_filename: str) → None¶

Download a file from the given url

Parameters:	url (str) – The url from which the file will be downloaded dest_filename (str) – The destination filename

sciwing.utils.common.extract_tar(filename: str, destination_dir: str, mode='r')¶

Extracts tar, targz and other files

Parameters:	filename (str) – The tar zipped file destination_dir (str) – The destination directory in which the files should be placed mode (str) – A valid tar mode. You can refer to https://docs.python.org/3/library/tarfile.html for the different modes.

sciwing.utils.common.extract_zip(filename: str, destination_dir: str)¶

Extracts a zipped file

Parameters:	filename (str) – The zipped filename destination_dir (str) – The directory where the zipped will be placed

sciwing.utils.common.flatten(list_items: List[Any]) → List[Any]¶

Flattens an arbitrarily long nesting of lists

Parameters:	list_items (List[Any]) – It can be an arbitrarily long nesting of lists
Returns:	Flattened list
Return type:	List

sciwing.utils.common.get_system_mem_in_gb()¶

Returns the total system memory in GB

Returns:	Memory size in GB
Return type:	float

sciwing.utils.common.get_train_dev_test_stratified_split(lines: List[str], labels: List[str], train_split: float = 0.8, dev_split: float = 0.1, test_split: float = 0.1, random_state: int = 1729) -> ((typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]))¶

Slits the lines and labels into train, dev and test splits using stratified and random shuffle

Parameters:

lines (List[str]) – A list of lines
labels (List[str]) – A list of labels
train_split (float) – The proportion of lines to be used for training
dev_split (float) – The proportion of lines to be used for validation
test_split (float) – The proportion of lines to be used for testing
random_state (int) – The seed to be used for randomization. Good for reproducing the same splits Passing None will cause the random number generator to be RandomState used by np.random

sciwing.utils.common.merge_dictionaries_with_sum(a: Dict[KT, VT], b: Dict[KT, VT]) → Dict[KT, VT]¶

sciwing.utils.common.pack_to_length(tokenized_text: List[str], max_length: int, pad_token: str = '<PAD>', add_start_end_token: bool = False, start_token: str = '<SOS>', end_token: str = '<EOS>') → List[str]¶

Packs tokenized text to maximum length

Parameters:

tokenized_text (List[str]) – A list of toekns
max_length (int) – The max length to pack to
pad_token (int) – The pad token to be used for the padding
add_start_end_token (bool) – Whether to add the start and end token to every sentence while packing
start_token (str) – The start token to be used if add_start_token is True.
end_token (str) – The end token to be used if add_end_token is True

sciwing.utils.common.pairwise(iterable: Iterable[T_co]) → Iterator[T_co]¶

Return the overlapping pairwise elements of the iterable

Parameters:	iterable (Iterable) – Anything that can be iterated
Returns:	Iterator over the paired sequence
Return type:	Iterator

sciwing.utils.common.write_cora_to_conll_file(cora_conll_filepath: pathlib.Path) → None¶

Writes cora file that is availabel at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train to CONLL format

Parameters:	cora_conll_filepath (The destination filepath where the CORA is converted to CONLL format) –

sciwing.utils.common.write_nfold_parscit_train_test(parscit_train_filepath: pathlib.Path, output_train_filepath: pathlib.Path, output_test_filepath: pathlib.Path, nsplits: int = 2) → bool¶

Convert the parscit train folder into different folds. This is useful for n-fold cross validation on the dataset. This method can be iterated over to get all the different folds of the data contained in the parscit_train_filepath

Parameters:	parscit_train_filepath (pathlib.Path) – The path where the Parscit file is stored The file is available at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train output_train_filepath (pathlib.Path) – The path where the train fold of the dataset will be stored output_test_filepath (pathlib.Path) – The path where the teset fold of the dataset will be stored nsplits (int) – The number of splits in the dataset.
Returns:	Indicates whether the particular fold has been written
Return type:	bool

sciwing.utils.common.write_parscit_to_conll_file(parscit_conll_filepath: pathlib.Path) → None¶

Write Parscit file to CONLL file format

Parameters:	parscit_conll_filepath (pathlib.Path) – The destination file where the parscit data is written to

Custom Spacy Tokenizers¶

This module implements custom spacy tokenizers if needed This can be useful for custom tokenization that is required for scientific domain

class sciwing.utils.custom_spacy_tokenizers.CustomSpacyWhiteSpaceTokenizer(vocab)¶

Bases: object

__init__(vocab)¶

White space tokenizer tokenizes the word according to spaces.

Parameters:	vocab (nlp.vocab) – Spacy vocab object

Custom Exceptions¶

exception sciwing.utils.exceptions.ClassInNurseryError¶

Bases: KeyError

The ClassNursery cannot have two classes of the same name. This error is raised when that happens

exception sciwing.utils.exceptions.DatasetPresentError(message: str)¶: Bases: Exception

exception sciwing.utils.exceptions.TOMLConfigurationError(message: str)¶

Bases: Exception

This error is raised for illegal configuration of TOML

Science IE Data Utils¶

class sciwing.utils.science_ie_data_utils.ScienceIEDataUtils(folderpath: pathlib.Path, ignore_warnings=False)¶

Bases: object

Science-IE is a SemEval Task that is aimed at extracting entities from scientific articles This class is a utility for various operations on the competitions data files.

__init__(folderpath: pathlib.Path, ignore_warnings=False)¶

Given the folderpath where the ScienceIE data is stored, this class provides various utilities. For more information on the dataset you can refer to https://scienceie.github.io/

Parameters:	folderpath (pathlib.Path) – The path where the ScienceIEDataset is stored ignore_warnings (bool) – If True, then all the warnings generated by this class for inconsistencies in the data is ignored

static _form_ann_line(idx: str, char_offset: Tuple[int, int, str], tag_name: str, doc: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681c50>)¶

Forms a ann line that can be used to write the ANN files for CoNLL format

Parameters:	idx (int) – The index for the entity being written char_offset (int) – THe start, end, tag for the line tag_name (str) – The tag to be used and is one of `[Task, Process, Material]` doc (str) – Spacy doc to query the appropriate characters
Returns:	An ANN line that is formed.
Return type:	str

_get_annotations_for_entity(file_id: str, entity: str) → List[Dict[str, Any]]¶

Parameters:

file_id (str) – A ScienceIE file id
entity (str) – One of [Task, Process, Material]

Returns:

A list of annotations where every annotation is

start: The start character index of the annotation
end: The end character index of the annotation
words: The set of words between the start and the end index
entity_number: The entity number
tag: The tag associated with the set of tags

Return type:

List[Dict[str, Any]]

_get_bilou_lines_for_entity(text: str, annotations: List[Dict[str, Any]], entity: str) → List[str]¶

The list of BILOU lines for entity

Parameters:	text (str) – The text for which BILOU lines need to be returned annotations (List[Dict[str, Any]]) – The list of annotations where every annotation is a dictionary entity (str) – A particular entity for which the BILOU lines are returned
Returns:	The list of BILOU tagged lines, where every line is a `word, tag, tag, tag` where the tag is decided by the entity.
Return type:	List[str]

get_bilou_lines_for_entity(file_id: str, entity: str)¶

Writes conll file for the entity type

Parameters:	file_id (str) – File id of the annotation file entity (str) – The entity for which conll file is written
Returns:	The list of BILOU lines for the entity
Return type:	List[str]

get_file_ids() → List[str]¶

Get all the file ids from the folder

Returns:	A List of File ids in the folder
Return type:	List[str]

get_sentence_wise_bilou_lines(file_id: str, entity_type: str) → List[List[str]]¶

Get BILOU lines sentence-wise

Parameters:	file_id (str) – File id from ScienceIE Dataset entity_type (str) – One of `['Task', 'Process', 'Material']`
Returns:	A list of sentences where every sentence is composed
Return type:	List[List[str]]

get_sents(text: str) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681050>]¶

Returns all the sentences in the text

Parameters:	text (str) –
Returns:	All the sentences in the text as a spacy span. A spacy span encodes more information within
Return type:	List[span.Span]

get_text_from_fileid(file_id: str) → str¶

Given a file id return the text from the file

Parameters:	file_id (str) – A ScienceIE data file id
Returns:	Text read from the file
Return type:	str

merge_files(task_filename: pathlib.Path, process_filename: pathlib.Path, material_filename: pathlib.Path, out_filename: pathlib.Path)¶

Merge different files to one conll file

Parameters:

task_filename (pathlib.Path) – The CONLL style file having TASK tags
process_filename (pathlib.Path) – The CONLL style file having Process tags
material_filename (pathlib.Path) – The CONLL style file having Material Tags
out_filename (pathlib.Path) – The output file where the different files will be merged and every line will consist of word Task-tag Process-tag Material-tag

write_ann_file_from_conll_file(conll_filepath: pathlib.Path, ann_filepath: pathlib.Path, text: str)¶

write_bilou_lines(out_filename: pathlib.Path, is_sentence_wise: bool = False)¶

Writes bilou lines in the out_filename for all the files in self.folderpath. The output file will contain every word on one line with their tag in BILOU format.

You can even opt to write the text in a sentence wise. The text which is possibly of multiple sentences, is broken down into sentences and then written into the output filename. Different sentences are separated by an empty line.

Parameters:	out_filename (pathlib.Path) – The output filename where the conll filename is written is_sentence_wise (bool) – You can write the BILOU lines sentence wise. The text in all the ScienceIE files will be broken into sentences, and the sentences will be tagged with BILOU tags

Sciwing TOML Runner¶

class sciwing.utils.sciwing_toml_runner.SciWingTOMLRunner(toml_filename: pathlib.Path, infer: bool = False)¶

Bases: object

_form_dag(section_name: str, section: Dict[KT, VT], parent: str)¶

Forms a DAG of the model section for execution

The model can be a complex structure with various other sub-components that can be used One depends on the other and the order of execution has to be decided DAG is a good abstract model to define the dependence between different modules This method instantiates a DAG given the section name, the TOML section that is being parsed with a directed edge between the parent and the child

Parameters:	section_name (str) – The name of the TOML section being parsed section (Dict) – The details of the actual section parent (str) – The node id of the parent graph

_instantiate_model_using_dag()¶

This is a key method that instantiates the DAG using topological order

THE DAG from the TOML model section should be instantiated with the submodules of a module instantiated before the parent module can be instantiated This method does it using topological sort. Topoloogical sort is the sorting of nodes of a DAG where if there is an edge between two nodes from u ->v , then u appears before v in the ordering.

We do exactly this for SciWING. We instantiate the children nodes that are used by parent nodes before we can instantiate the root node of the DAG that will represent the entire module.

Returns:	The instantiation of the root node
Return type:	nn.Module

_parse_toml_file()¶

Parses the toml file and returns the document

Returns:	The dictionary by parsing the toml file
Return type:	Dict[str, Any]

parse()¶: Parases the dataset, model and engine section of a toml file

parse_dataset_section()¶

Parse the dataset section of the toml file and instantiate the dataset

Returns:	The dataset manager for the experiment
Return type:	DatasetManager

parse_engine_section()¶

Parses the engine section of the TOML file

Returns:	Object of the Engine class
Return type:	Engine

parse_model_section()¶

Parses the Model section of the toml file

Returns:	A torch module representing the model
Return type:	nn.Module

run()¶

Tensor Utils¶

sciwing.utils.tensor_utils.get_mask(batch_size: int, max_size: int, lengths: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c11f3ded0>)¶

Returns mask given the lengths tensor. A convenience method

Given a lengths tensor as in

>> torch.LongTensor([3, 1, 2])

which often indicates the original length of the tensor without padding, get_mask() returns a tensor with 1 positions where there is no padding and 0 where there is padding

Parameters:	batch_size (int) – Batch size of the tensors max_size (int) – Maximum size or often Maximum number of time steps lengths (torch.LongTensor) – The original length of the tensors in the batch without padding
Returns:	Mask having 1 where there are no paddings and 0 where there are paddings
Return type:	torch.LongTensor

sciwing.utils.tensor_utils.has_tensor(obj) → bool¶: Given a possibly complex data structure, check if it has any torch.Tensors in it. From allennlp.nn.util

sciwing.utils.tensor_utils.move_to_device(obj, cuda_device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c11f3df50>)¶: Given a structure (possibly) containing Tensors on the CPU, move all the Tensors to the specified GPU (or do nothing, if they should be on the CPU). From allenlp.nn.util

NER Terminal Visualizer¶

class sciwing.utils.vis_seq_tags.VisTagging(colors: List[str] = None, colors_palette: str = None, tags: List[str] = None)¶

Bases: object

__init__(colors: List[str] = None, colors_palette: str = None, tags: List[str] = None)¶

Visualize Sequence Tagging

Parameters:

colors (List[str]) – The set of colors that will be used for tagging
colors_palette (str) – The color palette that should be used. We recommend For more information on color palettes you can refer to the documentation of the python package colorful
tags (List[str]) – The set of all labels that can be labelled If this is not given, then the tags will be infered using the labels during tagging

visualize_tags_from_json(json_annotation: Dict[str, Any], show_only_entities: List[str] = None)¶

Visualize the tags from json.

Parameters:	json_annotation (str) – You can send a json that has the following format {‘text’: str, ‘tags’: [{‘start’:int, ‘end’:str, ‘tag’: str}] } show_only_entities (List[str]) – You can filter to show only these entities.

visualize_tokens(text: List[str], labels: List[str]) → str¶

Visualizes sequential tagged data where the string is represented as a set of words and every word has a corresponding label. This can be extended to having different tagging schemes at a later point in time

Parameters:	text (List[str]) – to be tagged represented as a list of strings (String) – labels (List[str]) – labels corresponding to each word in the string (The) –
Returns:
Return type:	None