sciwing.utils

Amazon S3 Utils

class sciwing.utils.amazon_s3.S3Util(aws_cred_config_json_filename: str)

Bases: object

__init__(aws_cred_config_json_filename: str)

Some utilities that would be useful to upload folders/models to s3

Parameters:aws_cred_config_json_filename (str) –

You need to instantiate this file with a aws configuration json file

The following will be the keys and values
aws_access_key_id : str
The access key id for the AWS account that you have
aws_access_secret : str
The access secret
region : str
The region in which your bucket is present
parsect_bucket_name : str
The name of the bucket where all the models/experiments will be sotred
download_file(filename_s3: str, local_filename: str)

Downloads a file from s3

Parameters:
  • filename_s3 (str) – A filename in s3 that needs to be downloaded
  • local_filename (str) – The local filename that will be used
download_folder(folder_name_s3: str, download_only_best_checkpoint: bool = False, chkpoints_foldername: str = 'checkpoints', best_model_filename='best_model.pt', output_dir: str = '/home/docs/.sciwing.output_cache')

Downloads a folder from s3 recursively

Parameters:
  • folder_name_s3 (str) – The name of the folder in s3
  • download_only_best_checkpoint (bool) – If the folder being downloaded is an experiment folder, then you can download only the best model checkpoints for running test or inference
  • chkpoints_foldername (str) – The name of the checkpoints folder where the best model parameters are stored
  • best_model_filename (str) – The name of the file where the best model parameters are stored
get_client()

Returns boto3 client

Returns:The client object that manages all the aws operations The client is the low level access to the connection with s3
Return type:boto3.client
get_resource()

Returns a high level manager for the aws bucket

Returns:Resource that manages connections with s3
Return type:boto3.resource
load_credentials() → NamedTuple

Read the credentials from the json file

Returns:a named tuple with access_key, access_secret, region and bucket_name as the keys and the corresponding values filled in
Return type:NamedTuple
search_folders_with(pattern)

Searches for folders in the s3 bucket with specific pattern

Parameters:pattern (str) – A regex pattern
Returns:The list of foldernames that match the pattern
Return type:List[str]
upload_file(filename: str, obj_name: str = None)
Parameters:
  • filename (str) – The filename in the local directory that needs to be uploaded to s3
  • obj_name (str) – The filename to be used in s3 bucket. If None then obj_name in s3 will be the same as the filename
upload_folder(folder_name: str, base_folder_name: str)

Recursively uploads a folder to s3

Parameters:
  • folder_name (str) – The name of the local folder that is uploaded
  • base_folder_name (str) – The name of the folder from which the current folder being uploaded stems from. This is needed to associate appropriate files and directories to their hierarchies within the folder

Class Nursery

class sciwing.utils.class_nursery.ClassNursery

Bases: object

ClassNursery is the place where all the classes in SciWING are nursed

SciWING needs to get handle on the different classes that are being used. This is further useful for example, when we have to instantiate appropriate classes when the experiments are run from the TOML file

This uses a python 36 feature called __init_subclass__ that simplifies class creation. Whenever ClassNursery is mentioned as the parent class of a class, then init subclass is called. In SciWING we use it as a plugin registry where the mapping between the different class and their module is stored.

class_nursery = {'Adam': <sphinx.ext.autodoc.importer._MockObject object>, 'BOW_Encoder': 'sciwing.modules.bow_encoder', 'BertEmbedder': 'sciwing.modules.embedders.bert_embedder', 'BowElmoEmbedder': 'sciwing.modules.embedders.bow_elmo_embedder', 'CharEmbedder': 'sciwing.modules.embedders.char_embedder', 'CharLSTMEncoder': 'sciwing.modules.charlstm_encoder', 'CoNLLDatasetManager': 'sciwing.datasets.seq_labeling.conll_dataset', 'ConcatEmbedders': 'sciwing.modules.embedders.concat_embedders', 'ElmoEmbedder': 'sciwing.modules.embedders.elmo_embedder', 'Engine': 'sciwing.engine.engine', 'FlairEmbedder': 'sciwing.modules.embedders.flair_embedder', 'LSTM2VecEncoder': 'sciwing.modules.lstm2vecencoder', 'Lstm2SeqEncoder': 'sciwing.modules.lstm2seqencoder', 'PrecisionRecallFMeasure': 'sciwing.metrics.precision_recall_fmeasure', 'RnnSeqCrfTagger': 'sciwing.models.rnn_seq_crf_tagger', 'SGD': <sphinx.ext.autodoc.importer._MockObject object>, 'SimpleClassifier': 'sciwing.models.simpleclassifier', 'SimpleTagger': 'sciwing.models.simple_tagger', 'TextClassificationDatasetManager': 'sciwing.datasets.classification.text_classification_dataset', 'TokenClassificationAccuracy': 'sciwing.metrics.token_cls_accuracy', 'TrainableWordEmbedder': 'sciwing.modules.embedders.trainable_word_embedder', 'WordEmbedder': 'sciwing.modules.embedders.word_embedder'}

Common Utils

sciwing.utils.common.cached_path(path: Union[pathlib.Path, str], url: str, unzip=True) → pathlib.Path
sciwing.utils.common.chunks(seq, n)

Yield successive n-sized chunks from seq.

sciwing.utils.common.convert_generic_sect_to_json(filename: str) → Dict[str, Any]

Converts the Generic sect data file into more readable json format

Parameters:filename (str) – The sectlabel file name available at WING-NUS website
Returns:
text
The text of the line
label
The label of the file
file_no
A unique file number
line_count
A line count within the file
Return type:Dict[str, Any]
sciwing.utils.common.convert_generic_sect_to_sciwing_clf_format(filename: str, out_dir: str)

Converts the generic sect original file to the sciwing classification format

Parameters:
  • filename (str) – The path of the file where the original generic section classification file is stored
  • out_dir (str) – The output path where the train, dev and test files are written
Returns:

Return type:

None

sciwing.utils.common.convert_parscit_to_conll(parscit_train_filepath: pathlib.Path) → List[Dict[str, Any]]

Convert the parscit data available at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to a CONLL dummy version This is done so that we can use it with AllenNLPs built in data reader called conll2013 dataset reader

Parameters:parscit_train_filepath (pathlib.Path) – The path where the train file path is stored
sciwing.utils.common.convert_parscit_to_sciwing_seqlabel_format(parscit_train_filepath: pathlib.Path, output_dir: str)

Convert the parscit data availabel at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to the format required for sciwing seqential labelling

Parameters:
  • parscit_train_filepath (pathlib.Path) – The local path where the files are stored
  • output_dir (str) – The output dir where the train dev and test file will be written
sciwing.utils.common.convert_sectlabel_to_json(filename: str) → Dict[KT, VT]

Converts the secthead file into more readable json format

Parameters:filename (str) – The sectlabel file name available at WING-NUS website
Returns:
text
The text of the line
label
The label of the file
file_no
A unique file number
line_count
A line count within the file
Return type:Dict[str, Any]
sciwing.utils.common.convert_sectlabel_to_sciwing_clf_format(filename: str, out_dir: str)

Writes the file in the format required for sciwing text classification dataset

Parameters:
  • filename (str) – The path of the sectlabel original format file.
  • out_dir (str) – The path where the new files will be written
sciwing.utils.common.create_class(classname: str, module_name: str) → type

Given the classname and module, creates a class object and returns it

Parameters:
  • classname (str) – Class name to import
  • module_name (str) – The module in which the class is present
Returns:

Return type:

type

sciwing.utils.common.download_file(url: str, dest_filename: str) → None

Download a file from the given url

Parameters:
  • url (str) – The url from which the file will be downloaded
  • dest_filename (str) – The destination filename
sciwing.utils.common.extract_tar(filename: str, destination_dir: str, mode='r')

Extracts tar, targz and other files

Parameters:
  • filename (str) – The tar zipped file
  • destination_dir (str) – The destination directory in which the files should be placed
  • mode (str) – A valid tar mode. You can refer to https://docs.python.org/3/library/tarfile.html for the different modes.
sciwing.utils.common.extract_zip(filename: str, destination_dir: str)

Extracts a zipped file

Parameters:
  • filename (str) – The zipped filename
  • destination_dir (str) – The directory where the zipped will be placed
sciwing.utils.common.flatten(list_items: List[Any]) → List[Any]

Flattens an arbitrarily long nesting of lists

Parameters:list_items (List[Any]) – It can be an arbitrarily long nesting of lists
Returns:Flattened list
Return type:List
sciwing.utils.common.get_system_mem_in_gb()

Returns the total system memory in GB

Returns:Memory size in GB
Return type:float
sciwing.utils.common.get_train_dev_test_stratified_split(lines: List[str], labels: List[str], train_split: float = 0.8, dev_split: float = 0.1, test_split: float = 0.1, random_state: int = 1729) -> ((typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]))

Slits the lines and labels into train, dev and test splits using stratified and random shuffle

Parameters:
  • lines (List[str]) – A list of lines
  • labels (List[str]) – A list of labels
  • train_split (float) – The proportion of lines to be used for training
  • dev_split (float) – The proportion of lines to be used for validation
  • test_split (float) – The proportion of lines to be used for testing
  • random_state (int) – The seed to be used for randomization. Good for reproducing the same splits Passing None will cause the random number generator to be RandomState used by np.random
sciwing.utils.common.merge_dictionaries_with_sum(a: Dict[KT, VT], b: Dict[KT, VT]) → Dict[KT, VT]
sciwing.utils.common.pack_to_length(tokenized_text: List[str], max_length: int, pad_token: str = '<PAD>', add_start_end_token: bool = False, start_token: str = '<SOS>', end_token: str = '<EOS>') → List[str]

Packs tokenized text to maximum length

Parameters:
  • tokenized_text (List[str]) – A list of toekns
  • max_length (int) – The max length to pack to
  • pad_token (int) – The pad token to be used for the padding
  • add_start_end_token (bool) – Whether to add the start and end token to every sentence while packing
  • start_token (str) – The start token to be used if add_start_token is True.
  • end_token (str) – The end token to be used if add_end_token is True
sciwing.utils.common.pairwise(iterable: Iterable[T_co]) → Iterator[T_co]

Return the overlapping pairwise elements of the iterable

Parameters:iterable (Iterable) – Anything that can be iterated
Returns:Iterator over the paired sequence
Return type:Iterator
sciwing.utils.common.write_cora_to_conll_file(cora_conll_filepath: pathlib.Path) → None

Writes cora file that is availabel at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train to CONLL format

Parameters:cora_conll_filepath (The destination filepath where the CORA is converted to CONLL format) –
sciwing.utils.common.write_nfold_parscit_train_test(parscit_train_filepath: pathlib.Path, output_train_filepath: pathlib.Path, output_test_filepath: pathlib.Path, nsplits: int = 2) → bool

Convert the parscit train folder into different folds. This is useful for n-fold cross validation on the dataset. This method can be iterated over to get all the different folds of the data contained in the parscit_train_filepath

Parameters:
  • parscit_train_filepath (pathlib.Path) – The path where the Parscit file is stored The file is available at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train
  • output_train_filepath (pathlib.Path) – The path where the train fold of the dataset will be stored
  • output_test_filepath (pathlib.Path) – The path where the teset fold of the dataset will be stored
  • nsplits (int) – The number of splits in the dataset.
Returns:

Indicates whether the particular fold has been written

Return type:

bool

sciwing.utils.common.write_parscit_to_conll_file(parscit_conll_filepath: pathlib.Path) → None

Write Parscit file to CONLL file format

Parameters:parscit_conll_filepath (pathlib.Path) – The destination file where the parscit data is written to

Custom Spacy Tokenizers

This module implements custom spacy tokenizers if needed This can be useful for custom tokenization that is required for scientific domain

class sciwing.utils.custom_spacy_tokenizers.CustomSpacyWhiteSpaceTokenizer(vocab)

Bases: object

__init__(vocab)

White space tokenizer tokenizes the word according to spaces.

Parameters:vocab (nlp.vocab) – Spacy vocab object

Custom Exceptions

exception sciwing.utils.exceptions.ClassInNurseryError

Bases: KeyError

The ClassNursery cannot have two classes of the same name. This error is raised when that happens

exception sciwing.utils.exceptions.DatasetPresentError(message: str)

Bases: Exception

exception sciwing.utils.exceptions.TOMLConfigurationError(message: str)

Bases: Exception

This error is raised for illegal configuration of TOML

Science IE Data Utils

class sciwing.utils.science_ie_data_utils.ScienceIEDataUtils(folderpath: pathlib.Path, ignore_warnings=False)

Bases: object

Science-IE is a SemEval Task that is aimed at extracting entities from scientific articles This class is a utility for various operations on the competitions data files.

__init__(folderpath: pathlib.Path, ignore_warnings=False)

Given the folderpath where the ScienceIE data is stored, this class provides various utilities. For more information on the dataset you can refer to https://scienceie.github.io/

Parameters:
  • folderpath (pathlib.Path) – The path where the ScienceIEDataset is stored
  • ignore_warnings (bool) – If True, then all the warnings generated by this class for inconsistencies in the data is ignored
static _form_ann_line(idx: str, char_offset: Tuple[int, int, str], tag_name: str, doc: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb6cf6810>)

Forms a ann line that can be used to write the ANN files for CoNLL format

Parameters:
  • idx (int) – The index for the entity being written
  • char_offset (int) – THe start, end, tag for the line
  • tag_name (str) – The tag to be used and is one of [Task, Process, Material]
  • doc (str) – Spacy doc to query the appropriate characters
Returns:

An ANN line that is formed.

Return type:

str

_get_annotations_for_entity(file_id: str, entity: str) → List[Dict[str, Any]]
Parameters:
  • file_id (str) – A ScienceIE file id
  • entity (str) – One of [Task, Process, Material]
Returns:

A list of annotations where every annotation is
start

The start character index of the annotation

end

The end character index of the annotation

words

The set of words between the start and the end index

entity_number

The entity number

tag

The tag associated with the set of tags

Return type:

List[Dict[str, Any]]

_get_bilou_lines_for_entity(text: str, annotations: List[Dict[str, Any]], entity: str) → List[str]

The list of BILOU lines for entity

Parameters:
  • text (str) – The text for which BILOU lines need to be returned
  • annotations (List[Dict[str, Any]]) – The list of annotations where every annotation is a dictionary
  • entity (str) – A particular entity for which the BILOU lines are returned
Returns:

The list of BILOU tagged lines, where every line is a word, tag, tag, tag where the tag is decided by the entity.

Return type:

List[str]

get_bilou_lines_for_entity(file_id: str, entity: str)

Writes conll file for the entity type

Parameters:
  • file_id (str) – File id of the annotation file
  • entity (str) – The entity for which conll file is written
Returns:

The list of BILOU lines for the entity

Return type:

List[str]

get_file_ids() → List[str]

Get all the file ids from the folder

Returns:A List of File ids in the folder
Return type:List[str]
get_sentence_wise_bilou_lines(file_id: str, entity_type: str) → List[List[str]]

Get BILOU lines sentence-wise

Parameters:
  • file_id (str) – File id from ScienceIE Dataset
  • entity_type (str) – One of ['Task', 'Process', 'Material']
Returns:

A list of sentences where every sentence is composed

Return type:

List[List[str]]

get_sents(text: str) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb6cf6250>]

Returns all the sentences in the text

Parameters:text (str) –
Returns:All the sentences in the text as a spacy span. A spacy span encodes more information within
Return type:List[span.Span]
get_text_from_fileid(file_id: str) → str

Given a file id return the text from the file

Parameters:file_id (str) – A ScienceIE data file id
Returns:Text read from the file
Return type:str
merge_files(task_filename: pathlib.Path, process_filename: pathlib.Path, material_filename: pathlib.Path, out_filename: pathlib.Path)

Merge different files to one conll file

Parameters:
  • task_filename (pathlib.Path) – The CONLL style file having TASK tags
  • process_filename (pathlib.Path) – The CONLL style file having Process tags
  • material_filename (pathlib.Path) – The CONLL style file having Material Tags
  • out_filename (pathlib.Path) – The output file where the different files will be merged and every line will consist of word Task-tag Process-tag Material-tag
write_ann_file_from_conll_file(conll_filepath: pathlib.Path, ann_filepath: pathlib.Path, text: str)
write_bilou_lines(out_filename: pathlib.Path, is_sentence_wise: bool = False)

Writes bilou lines in the out_filename for all the files in self.folderpath. The output file will contain every word on one line with their tag in BILOU format.

You can even opt to write the text in a sentence wise. The text which is possibly of multiple sentences, is broken down into sentences and then written into the output filename. Different sentences are separated by an empty line.

Parameters:
  • out_filename (pathlib.Path) – The output filename where the conll filename is written
  • is_sentence_wise (bool) – You can write the BILOU lines sentence wise. The text in all the ScienceIE files will be broken into sentences, and the sentences will be tagged with BILOU tags

Sciwing TOML Runner

class sciwing.utils.sciwing_toml_runner.SciWingTOMLRunner(toml_filename: pathlib.Path, infer: bool = False)

Bases: object

_form_dag(section_name: str, section: Dict[KT, VT], parent: str)

Forms a DAG of the model section for execution

The model can be a complex structure with various other sub-components that can be used One depends on the other and the order of execution has to be decided DAG is a good abstract model to define the dependence between different modules This method instantiates a DAG given the section name, the TOML section that is being parsed with a directed edge between the parent and the child

Parameters:
  • section_name (str) – The name of the TOML section being parsed
  • section (Dict) – The details of the actual section
  • parent (str) – The node id of the parent graph
_instantiate_model_using_dag()

This is a key method that instantiates the DAG using topological order

THE DAG from the TOML model section should be instantiated with the submodules of a module instantiated before the parent module can be instantiated This method does it using topological sort. Topoloogical sort is the sorting of nodes of a DAG where if there is an edge between two nodes from u ->v , then u appears before v in the ordering.

We do exactly this for SciWING. We instantiate the children nodes that are used by parent nodes before we can instantiate the root node of the DAG that will represent the entire module.

Returns:The instantiation of the root node
Return type:nn.Module
_parse_toml_file()

Parses the toml file and returns the document

Returns:The dictionary by parsing the toml file
Return type:Dict[str, Any]
parse()

Parases the dataset, model and engine section of a toml file

parse_dataset_section()

Parse the dataset section of the toml file and instantiate the dataset

Returns:The dataset manager for the experiment
Return type:DatasetManager
parse_engine_section()

Parses the engine section of the TOML file

Returns:Object of the Engine class
Return type:Engine
parse_model_section()

Parses the Model section of the toml file

Returns:A torch module representing the model
Return type:nn.Module
run()

Tensor Utils

sciwing.utils.tensor_utils.get_mask(batch_size: int, max_size: int, lengths: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb498ca50>)

Returns mask given the lengths tensor. A convenience method

Given a lengths tensor as in

>> torch.LongTensor([3, 1, 2])

which often indicates the original length of the tensor without padding, get_mask() returns a tensor with 1 positions where there is no padding and 0 where there is padding

Parameters:
  • batch_size (int) – Batch size of the tensors
  • max_size (int) – Maximum size or often Maximum number of time steps
  • lengths (torch.LongTensor) – The original length of the tensors in the batch without padding
Returns:

Mask having 1 where there are no paddings and 0 where there are paddings

Return type:

torch.LongTensor

sciwing.utils.tensor_utils.has_tensor(obj) → bool

Given a possibly complex data structure, check if it has any torch.Tensors in it. From allennlp.nn.util

sciwing.utils.tensor_utils.move_to_device(obj, cuda_device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb498ca90>)

Given a structure (possibly) containing Tensors on the CPU, move all the Tensors to the specified GPU (or do nothing, if they should be on the CPU). From allenlp.nn.util

NER Terminal Visualizer

class sciwing.utils.vis_seq_tags.VisTagging(colors: List[str] = None, colors_palette: str = None, tags: List[str] = None)

Bases: object

__init__(colors: List[str] = None, colors_palette: str = None, tags: List[str] = None)

Visualize Sequence Tagging

Parameters:
  • colors (List[str]) – The set of colors that will be used for tagging
  • colors_palette (str) – The color palette that should be used. We recommend For more information on color palettes you can refer to the documentation of the python package colorful
  • tags (List[str]) – The set of all labels that can be labelled If this is not given, then the tags will be infered using the labels during tagging
visualize_tags_from_json(json_annotation: Dict[str, Any], show_only_entities: List[str] = None)

Visualize the tags from json.

Parameters:
  • json_annotation (str) – You can send a json that has the following format {‘text’: str, ‘tags’: [{‘start’:int, ‘end’:str, ‘tag’: str}] }
  • show_only_entities (List[str]) – You can filter to show only these entities.
visualize_tokens(text: List[str], labels: List[str]) → str

Visualizes sequential tagged data where the string is represented as a set of words and every word has a corresponding label. This can be extended to having different tagging schemes at a later point in time

Parameters:
  • text (List[str]) –
  • to be tagged represented as a list of strings (String) –
  • labels (List[str]) –
  • labels corresponding to each word in the string (The) –
Returns:

Return type:

None