sciwing.utils¶
Amazon S3 Utils¶
-
class
sciwing.utils.amazon_s3.
S3Util
(aws_cred_config_json_filename: str)¶ Bases:
object
-
__init__
(aws_cred_config_json_filename: str)¶ Some utilities that would be useful to upload folders/models to s3
Parameters: aws_cred_config_json_filename (str) – You need to instantiate this file with a aws configuration json file
- The following will be the keys and values
- aws_access_key_id : str
- The access key id for the AWS account that you have
- aws_access_secret : str
- The access secret
- region : str
- The region in which your bucket is present
- parsect_bucket_name : str
- The name of the bucket where all the models/experiments will be sotred
-
download_file
(filename_s3: str, local_filename: str)¶ Downloads a file from s3
Parameters: - filename_s3 (str) – A filename in s3 that needs to be downloaded
- local_filename (str) – The local filename that will be used
-
download_folder
(folder_name_s3: str, download_only_best_checkpoint: bool = False, chkpoints_foldername: str = 'checkpoints', best_model_filename='best_model.pt', output_dir: str = '/home/docs/.sciwing.output_cache')¶ Downloads a folder from s3 recursively
Parameters: - folder_name_s3 (str) – The name of the folder in s3
- download_only_best_checkpoint (bool) – If the folder being downloaded is an experiment folder, then you can download only the best model checkpoints for running test or inference
- chkpoints_foldername (str) – The name of the checkpoints folder where the best model parameters are stored
- best_model_filename (str) – The name of the file where the best model parameters are stored
-
get_client
()¶ Returns boto3 client
Returns: The client object that manages all the aws operations The client is the low level access to the connection with s3 Return type: boto3.client
-
get_resource
()¶ Returns a high level manager for the aws bucket
Returns: Resource that manages connections with s3 Return type: boto3.resource
-
load_credentials
() → NamedTuple¶ Read the credentials from the json file
Returns: a named tuple with access_key, access_secret, region and bucket_name as the keys and the corresponding values filled in Return type: NamedTuple
-
search_folders_with
(pattern)¶ Searches for folders in the s3 bucket with specific pattern
Parameters: pattern (str) – A regex pattern Returns: The list of foldernames that match the pattern Return type: List[str]
-
upload_file
(filename: str, obj_name: str = None)¶ Parameters: - filename (str) – The filename in the local directory that needs to be uploaded to s3
- obj_name (str) – The filename to be used in s3 bucket. If None then obj_name in s3 will be the same as the filename
-
upload_folder
(folder_name: str, base_folder_name: str)¶ Recursively uploads a folder to s3
Parameters: - folder_name (str) – The name of the local folder that is uploaded
- base_folder_name (str) – The name of the folder from which the current folder being uploaded stems from. This is needed to associate appropriate files and directories to their hierarchies within the folder
-
Class Nursery¶
-
class
sciwing.utils.class_nursery.
ClassNursery
¶ Bases:
object
ClassNursery is the place where all the classes in SciWING are nursed
SciWING needs to get handle on the different classes that are being used. This is further useful for example, when we have to instantiate appropriate classes when the experiments are run from the TOML file
This uses a python 36 feature called __init_subclass__ that simplifies class creation. Whenever ClassNursery is mentioned as the parent class of a class, then init subclass is called. In SciWING we use it as a plugin registry where the mapping between the different class and their module is stored.
-
class_nursery
= {'Adam': <sphinx.ext.autodoc.importer._MockObject object>, 'BOW_Encoder': 'sciwing.modules.bow_encoder', 'BertEmbedder': 'sciwing.modules.embedders.bert_embedder', 'BowElmoEmbedder': 'sciwing.modules.embedders.bow_elmo_embedder', 'CharEmbedder': 'sciwing.modules.embedders.char_embedder', 'CharLSTMEncoder': 'sciwing.modules.charlstm_encoder', 'CoNLLDatasetManager': 'sciwing.datasets.seq_labeling.conll_dataset', 'ConcatEmbedders': 'sciwing.modules.embedders.concat_embedders', 'ElmoEmbedder': 'sciwing.modules.embedders.elmo_embedder', 'Engine': 'sciwing.engine.engine', 'FlairEmbedder': 'sciwing.modules.embedders.flair_embedder', 'LSTM2VecEncoder': 'sciwing.modules.lstm2vecencoder', 'Lstm2SeqEncoder': 'sciwing.modules.lstm2seqencoder', 'PrecisionRecallFMeasure': 'sciwing.metrics.precision_recall_fmeasure', 'RnnSeqCrfTagger': 'sciwing.models.rnn_seq_crf_tagger', 'SGD': <sphinx.ext.autodoc.importer._MockObject object>, 'SimpleClassifier': 'sciwing.models.simpleclassifier', 'SimpleTagger': 'sciwing.models.simple_tagger', 'TextClassificationDatasetManager': 'sciwing.datasets.classification.text_classification_dataset', 'TokenClassificationAccuracy': 'sciwing.metrics.token_cls_accuracy', 'TrainableWordEmbedder': 'sciwing.modules.embedders.trainable_word_embedder', 'WordEmbedder': 'sciwing.modules.embedders.word_embedder'}¶
-
Common Utils¶
-
sciwing.utils.common.
cached_path
(path: Union[pathlib.Path, str], url: str, unzip=True) → pathlib.Path¶
-
sciwing.utils.common.
chunks
(seq, n)¶ Yield successive n-sized chunks from seq.
-
sciwing.utils.common.
convert_generic_sect_to_json
(filename: str) → Dict[str, Any]¶ Converts the Generic sect data file into more readable json format
Parameters: filename (str) – The sectlabel file name available at WING-NUS website Returns: - text
- The text of the line
- label
- The label of the file
- file_no
- A unique file number
- line_count
- A line count within the file
Return type: Dict[str, Any]
-
sciwing.utils.common.
convert_generic_sect_to_sciwing_clf_format
(filename: str, out_dir: str)¶ Converts the generic sect original file to the sciwing classification format
Parameters: - filename (str) – The path of the file where the original generic section classification file is stored
- out_dir (str) – The output path where the train, dev and test files are written
Returns: Return type: None
-
sciwing.utils.common.
convert_parscit_to_conll
(parscit_train_filepath: pathlib.Path) → List[Dict[str, Any]]¶ Convert the parscit data available at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to a CONLL dummy version This is done so that we can use it with AllenNLPs built in data reader called conll2013 dataset reader
Parameters: parscit_train_filepath (pathlib.Path) – The path where the train file path is stored
-
sciwing.utils.common.
convert_parscit_to_sciwing_seqlabel_format
(parscit_train_filepath: pathlib.Path, output_dir: str)¶ Convert the parscit data availabel at “https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/parsCit.train.data” to the format required for sciwing seqential labelling
Parameters: - parscit_train_filepath (pathlib.Path) – The local path where the files are stored
- output_dir (str) – The output dir where the train dev and test file will be written
-
sciwing.utils.common.
convert_sectlabel_to_json
(filename: str) → Dict[KT, VT]¶ Converts the secthead file into more readable json format
Parameters: filename (str) – The sectlabel file name available at WING-NUS website Returns: - text
- The text of the line
- label
- The label of the file
- file_no
- A unique file number
- line_count
- A line count within the file
Return type: Dict[str, Any]
-
sciwing.utils.common.
convert_sectlabel_to_sciwing_clf_format
(filename: str, out_dir: str)¶ Writes the file in the format required for sciwing text classification dataset
Parameters: - filename (str) – The path of the sectlabel original format file.
- out_dir (str) – The path where the new files will be written
-
sciwing.utils.common.
create_class
(classname: str, module_name: str) → type¶ Given the classname and module, creates a class object and returns it
Parameters: - classname (str) – Class name to import
- module_name (str) – The module in which the class is present
Returns: Return type: type
-
sciwing.utils.common.
download_file
(url: str, dest_filename: str) → None¶ Download a file from the given url
Parameters: - url (str) – The url from which the file will be downloaded
- dest_filename (str) – The destination filename
-
sciwing.utils.common.
extract_tar
(filename: str, destination_dir: str, mode='r')¶ Extracts tar, targz and other files
Parameters: - filename (str) – The tar zipped file
- destination_dir (str) – The destination directory in which the files should be placed
- mode (str) – A valid tar mode. You can refer to https://docs.python.org/3/library/tarfile.html for the different modes.
-
sciwing.utils.common.
extract_zip
(filename: str, destination_dir: str)¶ Extracts a zipped file
Parameters: - filename (str) – The zipped filename
- destination_dir (str) – The directory where the zipped will be placed
-
sciwing.utils.common.
flatten
(list_items: List[Any]) → List[Any]¶ Flattens an arbitrarily long nesting of lists
Parameters: list_items (List[Any]) – It can be an arbitrarily long nesting of lists Returns: Flattened list Return type: List
-
sciwing.utils.common.
get_system_mem_in_gb
()¶ Returns the total system memory in GB
Returns: Memory size in GB Return type: float
-
sciwing.utils.common.
get_train_dev_test_stratified_split
(lines: List[str], labels: List[str], train_split: float = 0.8, dev_split: float = 0.1, test_split: float = 0.1, random_state: int = 1729) -> ((typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]), (typing.List[str], typing.List[str]))¶ Slits the lines and labels into train, dev and test splits using stratified and random shuffle
Parameters: - lines (List[str]) – A list of lines
- labels (List[str]) – A list of labels
- train_split (float) – The proportion of lines to be used for training
- dev_split (float) – The proportion of lines to be used for validation
- test_split (float) – The proportion of lines to be used for testing
- random_state (int) – The seed to be used for randomization. Good for reproducing the same splits Passing None will cause the random number generator to be RandomState used by np.random
-
sciwing.utils.common.
merge_dictionaries_with_sum
(a: Dict[KT, VT], b: Dict[KT, VT]) → Dict[KT, VT]¶
-
sciwing.utils.common.
pack_to_length
(tokenized_text: List[str], max_length: int, pad_token: str = '<PAD>', add_start_end_token: bool = False, start_token: str = '<SOS>', end_token: str = '<EOS>') → List[str]¶ Packs tokenized text to maximum length
Parameters: - tokenized_text (List[str]) – A list of toekns
- max_length (int) – The max length to pack to
- pad_token (int) – The pad token to be used for the padding
- add_start_end_token (bool) – Whether to add the start and end token to every sentence while packing
- start_token (str) – The start token to be used if
add_start_token
is True. - end_token (str) – The end token to be used if
add_end_token
is True
-
sciwing.utils.common.
pairwise
(iterable: Iterable[T_co]) → Iterator[T_co]¶ Return the overlapping pairwise elements of the iterable
Parameters: iterable (Iterable) – Anything that can be iterated Returns: Iterator over the paired sequence Return type: Iterator
-
sciwing.utils.common.
write_cora_to_conll_file
(cora_conll_filepath: pathlib.Path) → None¶ Writes cora file that is availabel at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train to CONLL format
Parameters: cora_conll_filepath (The destination filepath where the CORA is converted to CONLL format) –
-
sciwing.utils.common.
write_nfold_parscit_train_test
(parscit_train_filepath: pathlib.Path, output_train_filepath: pathlib.Path, output_test_filepath: pathlib.Path, nsplits: int = 2) → bool¶ Convert the parscit train folder into different folds. This is useful for n-fold cross validation on the dataset. This method can be iterated over to get all the different folds of the data contained in the
parscit_train_filepath
Parameters: - parscit_train_filepath (pathlib.Path) – The path where the Parscit file is stored The file is available at https://github.com/knmnyn/ParsCit/blob/master/crfpp/traindata/cora.train
- output_train_filepath (pathlib.Path) – The path where the train fold of the dataset will be stored
- output_test_filepath (pathlib.Path) – The path where the teset fold of the dataset will be stored
- nsplits (int) – The number of splits in the dataset.
Returns: Indicates whether the particular fold has been written
Return type: bool
-
sciwing.utils.common.
write_parscit_to_conll_file
(parscit_conll_filepath: pathlib.Path) → None¶ Write Parscit file to CONLL file format
Parameters: parscit_conll_filepath (pathlib.Path) – The destination file where the parscit data is written to
Custom Spacy Tokenizers¶
This module implements custom spacy tokenizers if needed This can be useful for custom tokenization that is required for scientific domain
Custom Exceptions¶
-
exception
sciwing.utils.exceptions.
ClassInNurseryError
¶ Bases:
KeyError
The ClassNursery cannot have two classes of the same name. This error is raised when that happens
-
exception
sciwing.utils.exceptions.
DatasetPresentError
(message: str)¶ Bases:
Exception
-
exception
sciwing.utils.exceptions.
TOMLConfigurationError
(message: str)¶ Bases:
Exception
This error is raised for illegal configuration of TOML
Science IE Data Utils¶
-
class
sciwing.utils.science_ie_data_utils.
ScienceIEDataUtils
(folderpath: pathlib.Path, ignore_warnings=False)¶ Bases:
object
Science-IE is a SemEval Task that is aimed at extracting entities from scientific articles This class is a utility for various operations on the competitions data files.
-
__init__
(folderpath: pathlib.Path, ignore_warnings=False)¶ Given the folderpath where the ScienceIE data is stored, this class provides various utilities. For more information on the dataset you can refer to https://scienceie.github.io/
Parameters: - folderpath (pathlib.Path) – The path where the ScienceIEDataset is stored
- ignore_warnings (bool) – If True, then all the warnings generated by this class for inconsistencies in the data is ignored
-
static
_form_ann_line
(idx: str, char_offset: Tuple[int, int, str], tag_name: str, doc: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681c50>)¶ Forms a ann line that can be used to write the ANN files for CoNLL format
Parameters: - idx (int) – The index for the entity being written
- char_offset (int) – THe start, end, tag for the line
- tag_name (str) – The tag to be used and is one of
[Task, Process, Material]
- doc (str) – Spacy doc to query the appropriate characters
Returns: An ANN line that is formed.
Return type: str
-
_get_annotations_for_entity
(file_id: str, entity: str) → List[Dict[str, Any]]¶ Parameters: - file_id (str) – A ScienceIE file id
- entity (str) – One of
[Task, Process, Material]
Returns: - A list of annotations where every annotation is
- start
The start character index of the annotation
- end
The end character index of the annotation
- words
The set of words between the start and the end index
- entity_number
The entity number
- tag
The tag associated with the set of tags
Return type: List[Dict[str, Any]]
-
_get_bilou_lines_for_entity
(text: str, annotations: List[Dict[str, Any]], entity: str) → List[str]¶ The list of BILOU lines for entity
Parameters: - text (str) – The text for which BILOU lines need to be returned
- annotations (List[Dict[str, Any]]) – The list of annotations where every annotation is a dictionary
- entity (str) – A particular entity for which the BILOU lines are returned
Returns: The list of BILOU tagged lines, where every line is a
word, tag, tag, tag
where the tag is decided by the entity.Return type: List[str]
-
get_bilou_lines_for_entity
(file_id: str, entity: str)¶ Writes conll file for the entity type
Parameters: - file_id (str) – File id of the annotation file
- entity (str) – The entity for which conll file is written
Returns: The list of BILOU lines for the entity
Return type: List[str]
-
get_file_ids
() → List[str]¶ Get all the file ids from the folder
Returns: A List of File ids in the folder Return type: List[str]
-
get_sentence_wise_bilou_lines
(file_id: str, entity_type: str) → List[List[str]]¶ Get BILOU lines sentence-wise
Parameters: - file_id (str) – File id from ScienceIE Dataset
- entity_type (str) – One of
['Task', 'Process', 'Material']
Returns: A list of sentences where every sentence is composed
Return type: List[List[str]]
-
get_sents
(text: str) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14681050>]¶ Returns all the sentences in the text
Parameters: text (str) – Returns: All the sentences in the text as a spacy span. A spacy span encodes more information within Return type: List[span.Span]
-
get_text_from_fileid
(file_id: str) → str¶ Given a file id return the text from the file
Parameters: file_id (str) – A ScienceIE data file id Returns: Text read from the file Return type: str
-
merge_files
(task_filename: pathlib.Path, process_filename: pathlib.Path, material_filename: pathlib.Path, out_filename: pathlib.Path)¶ Merge different files to one conll file
Parameters: - task_filename (pathlib.Path) – The CONLL style file having TASK tags
- process_filename (pathlib.Path) – The CONLL style file having Process tags
- material_filename (pathlib.Path) – The CONLL style file having Material Tags
- out_filename (pathlib.Path) – The output file where the different files will be merged
and every line will consist of
word Task-tag Process-tag Material-tag
-
write_ann_file_from_conll_file
(conll_filepath: pathlib.Path, ann_filepath: pathlib.Path, text: str)¶
-
write_bilou_lines
(out_filename: pathlib.Path, is_sentence_wise: bool = False)¶ Writes bilou lines in the out_filename for all the files in
self.folderpath
. The output file will contain every word on one line with their tag in BILOU format.You can even opt to write the text in a sentence wise. The text which is possibly of multiple sentences, is broken down into sentences and then written into the output filename. Different sentences are separated by an empty line.
Parameters: - out_filename (pathlib.Path) – The output filename where the conll filename is written
- is_sentence_wise (bool) – You can write the BILOU lines sentence wise. The text in all the ScienceIE files will be broken into sentences, and the sentences will be tagged with BILOU tags
-
Sciwing TOML Runner¶
-
class
sciwing.utils.sciwing_toml_runner.
SciWingTOMLRunner
(toml_filename: pathlib.Path, infer: bool = False)¶ Bases:
object
-
_form_dag
(section_name: str, section: Dict[KT, VT], parent: str)¶ Forms a DAG of the model section for execution
The model can be a complex structure with various other sub-components that can be used One depends on the other and the order of execution has to be decided DAG is a good abstract model to define the dependence between different modules This method instantiates a DAG given the section name, the TOML section that is being parsed with a directed edge between the parent and the child
Parameters: - section_name (str) – The name of the TOML section being parsed
- section (Dict) – The details of the actual section
- parent (str) – The node id of the parent graph
-
_instantiate_model_using_dag
()¶ This is a key method that instantiates the DAG using topological order
THE DAG from the TOML model section should be instantiated with the submodules of a module instantiated before the parent module can be instantiated This method does it using topological sort. Topoloogical sort is the sorting of nodes of a DAG where if there is an edge between two nodes from u ->v , then u appears before v in the ordering.
We do exactly this for SciWING. We instantiate the children nodes that are used by parent nodes before we can instantiate the root node of the DAG that will represent the entire module.
Returns: The instantiation of the root node Return type: nn.Module
-
_parse_toml_file
()¶ Parses the toml file and returns the document
Returns: The dictionary by parsing the toml file Return type: Dict[str, Any]
-
parse
()¶ Parases the dataset, model and engine section of a toml file
-
parse_dataset_section
()¶ Parse the dataset section of the toml file and instantiate the dataset
Returns: The dataset manager for the experiment Return type: DatasetManager
-
parse_engine_section
()¶ Parses the engine section of the TOML file
Returns: Object of the Engine class Return type: Engine
-
parse_model_section
()¶ Parses the Model section of the toml file
Returns: A torch module representing the model Return type: nn.Module
-
run
()¶
-
Tensor Utils¶
-
sciwing.utils.tensor_utils.
get_mask
(batch_size: int, max_size: int, lengths: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c11f3ded0>)¶ Returns mask given the lengths tensor. A convenience method
Given a lengths tensor as in
>> torch.LongTensor([3, 1, 2])
which often indicates the original length of the tensor without padding, get_mask() returns a tensor with 1 positions where there is no padding and 0 where there is padding
Parameters: - batch_size (int) – Batch size of the tensors
- max_size (int) – Maximum size or often Maximum number of time steps
- lengths (torch.LongTensor) – The original length of the tensors in the batch without padding
Returns: Mask having 1 where there are no paddings and 0 where there are paddings
Return type: torch.LongTensor
-
sciwing.utils.tensor_utils.
has_tensor
(obj) → bool¶ Given a possibly complex data structure, check if it has any torch.Tensors in it. From
allennlp.nn.util
-
sciwing.utils.tensor_utils.
move_to_device
(obj, cuda_device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c11f3df50>)¶ Given a structure (possibly) containing Tensors on the CPU, move all the Tensors to the specified GPU (or do nothing, if they should be on the CPU). From
allenlp.nn.util
NER Terminal Visualizer¶
Bases:
object
Visualize Sequence Tagging
Parameters: - colors (List[str]) – The set of colors that will be used for tagging
- colors_palette (str) – The color palette that should be used. We recommend For more information on color palettes you can refer to the documentation of the python package colorful
- tags (List[str]) – The set of all labels that can be labelled If this is not given, then the tags will be infered using the labels during tagging
Visualize the tags from json.
Parameters: - json_annotation (str) – You can send a json that has the following format {‘text’: str, ‘tags’: [{‘start’:int, ‘end’:str, ‘tag’: str}] }
- show_only_entities (List[str]) – You can filter to show only these entities.
Visualizes sequential tagged data where the string is represented as a set of words and every word has a corresponding label. This can be extended to having different tagging schemes at a later point in time
Parameters: - text (List[str]) –
- to be tagged represented as a list of strings (String) –
- labels (List[str]) –
- labels corresponding to each word in the string (The) –
Returns: Return type: None