sciwing.modules.embedders

bert_embedder

class sciwing.modules.embedders.bert_embedder.BertEmbedder(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3390>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3390>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bert Embedder that embeds the given instance to BERT embeddings

Parameters:
  • dropout_value (float) – The amount of dropout to be added after the embedding
  • aggregation_type (str) –

    The kind of aggregation of different layers. BERT produces representations from different layers. This specifies the strategy to aggregating them One of

    sum
    Sum the representations from all the layers
    average
    Average the representations from all the layers
  • bert_type (type) –

    The kind of BERT embedding to be used

    bert-base-uncased
    12 layer transformer trained on lowercased vocab
    bert-large-uncased:
    24 layer transformer trained on lowercased vocab
    bert-base-cased:
    12 layer transformer trained on cased vocab
    bert-large-cased:
    24 layer transformer train on cased vocab
    scibert-base-cased
    12 layer transformer trained on scientific document on cased normal vocab
    scibert-sci-cased
    12 layer transformer trained on scientific documents on cased scientifc vocab
    scibert-base-uncased
    12 layer transformer trained on scientific docments on uncased normal vocab
    scibert-sci-uncased
    12 layer transformer train on scientific documents on ncased scientific vocab
  • word_tokens_namespace (str) – The namespace in the liens where the tokens are stored
  • device (Union[torch.device, str]) – The device on which the model is run.
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3610>
Parameters:lines (List[Line]) – A list of lines
Returns:The bert embeddings for all the words in the instances The size of the returned embedding is [batch_size, max_len_word_tokens, emb_dim]
Return type:torch.Tensor
get_embedding_dimension() → int

bow_elmo_embedder

class sciwing.modules.embedders.bow_elmo_embedder.BowElmoEmbedder(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8e50>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8e50>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')

Bag of words Elmo Embedder which aggregates elmo embedding for every token

Parameters:
  • layer_aggregation (str) –

    You can chose one of [sum, average, last, first] which decides how to aggregate different layers of ELMO. ELMO produces three layers of representations

    sum
    Representations from different layers are summed
    average
    Representations from different layers are average
    last
    Representations from last layer is considered
    first
    Representations from first layer is considered
  • device (Union[str, torch.device]) – device for running the model on
  • word_tokens_namespace (int) – Namespace where all the word tokens are stored
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e1050>
Parameters:lines (List[Line]) – Just a list of lines
Returns:Returns the representation for every token in the instance [batch_size, max_num_words, emb_dim]. In case of Elmo the emb_dim is 1024
Return type:torch.Tensor
get_embedding_dimension() → int

char_embedder

class sciwing.modules.embedders.char_embedder.CharEmbedder(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3d90>] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154e3d90>] = <sphinx.ext.autodoc.importer._MockObject object>)

This is a character embedder that takes in lines and collates the character embeddings for all the tokens in the lines.

Parameters:
  • char_embedding_dimension (int) – The dimension of the character embedding
  • word_tokens_namespace (int) – The name space where the words are saved
  • char_tokens_namespace (str) – The namespace where the character tokens are saved
  • datasets_manager (DatasetsManager) – The dataset manager that handles all the datasets
  • hidden_dimension (int) – The hidden dimension of the LSTM which will be used to get character embeddings
forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension() → int

concat_embedders

class sciwing.modules.embedders.concat_embedders.ConcatEmbedders(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8c10>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8c10>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)

Concatenates a set of embedders into a single embedder.

Parameters:embedders (List[nn.Module]) – A list of embedders that can be concatenated
forward(lines: List[sciwing.data.line.Line])
Parameters:lines (List[Line]) – A list of Lines.
Returns:Returns the concatenated embedding that is of the size [batch_size, time_steps, embedding_dimension] where the embedding_dimension is after the concatenation
Return type:torch.FloatTensor
get_embedding_dimension()

elmo_embedder

class sciwing.modules.embedders.elmo_embedder.ElmoEmbedder(dropout_value: float = 0.5, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154cd190> = <sphinx.ext.autodoc.importer._MockObject object>, fine_tune: bool = False)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension()

flair_embedder

class sciwing.modules.embedders.flair_embedder.FlairEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13f55f50>] = 'cpu', word_tokens_namespace: str = 'tokens')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery, sciwing.modules.embedders.base_embedders.BaseEmbedder

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c13f55f50>] = 'cpu', word_tokens_namespace: str = 'tokens')

Flair Embeddings. This is used to produce Named Entity Recognition. Note: This only works if your tokens are produced by splitting based on white space

Parameters:
  • embedding_type
  • datasets_manager
  • device
  • word_tokens_namespace
forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension()

trainable_word_embedder

class sciwing.modules.embedders.trainable_word_embedder.TrainableWordEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740410> = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740410> = <sphinx.ext.autodoc.importer._MockObject object>)

This represents trainable word embeddings which are trained along with the parameters of the network. The embeddings in the class WordEmbedder are not trainable. They are static

Parameters:embedding_type (str) – The type of embedding that you would want
datasets_manager: DatasetsManager
The datasets manager which is running your experiments
word_tokens_namespace: str
The namespace where the word tokens are stored in your data
device: Union[torch.device, str]
The device on which this embedder is run
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c14740a50>
get_embedding_dimension() → int

word_embedder

class sciwing.modules.embedders.word_embedder.WordEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8510>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8510>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Word Embedder embeds the tokens using the desired embeddings. These are static embeddings.

Parameters:
  • embedding_type (str) – The type of embedding that you would want
  • datasets_manager (DatasetsManager) – The datasets manager which is running your experiments
  • word_tokens_namespace (str) – The namespace where the word tokens are stored in your data
  • device (Union[torch.device, str]) – The device on which this embedder is run
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f6c154f8850>

This will only consider the “tokens” present in the line. The namespace for the tokens is set with the class instantiation

Parameters:lines (List[Line]) –
Returns:It returns the embedding of the size [batch_size, max_num_timesteps, embedding_dimension]
Return type:torch.FloatTensor
get_embedding_dimension() → int