sciwing.modules.embedders

bert_embedder

class sciwing.modules.embedders.bert_embedder.BertEmbedder(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7af1b50>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, dropout_value: float = 0.0, aggregation_type: str = 'sum', bert_type: str = 'bert-base-uncased', word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7af1b50>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bert Embedder that embeds the given instance to BERT embeddings

Parameters:
  • dropout_value (float) – The amount of dropout to be added after the embedding
  • aggregation_type (str) –

    The kind of aggregation of different layers. BERT produces representations from different layers. This specifies the strategy to aggregating them One of

    sum
    Sum the representations from all the layers
    average
    Average the representations from all the layers
  • bert_type (type) –

    The kind of BERT embedding to be used

    bert-base-uncased
    12 layer transformer trained on lowercased vocab
    bert-large-uncased:
    24 layer transformer trained on lowercased vocab
    bert-base-cased:
    12 layer transformer trained on cased vocab
    bert-large-cased:
    24 layer transformer train on cased vocab
    scibert-base-cased
    12 layer transformer trained on scientific document on cased normal vocab
    scibert-sci-cased
    12 layer transformer trained on scientific documents on cased scientifc vocab
    scibert-base-uncased
    12 layer transformer trained on scientific docments on uncased normal vocab
    scibert-sci-uncased
    12 layer transformer train on scientific documents on ncased scientific vocab
  • word_tokens_namespace (str) – The namespace in the liens where the tokens are stored
  • device (Union[torch.device, str]) – The device on which the model is run.
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7af1dd0>
Parameters:lines (List[Line]) – A list of lines
Returns:The bert embeddings for all the words in the instances The size of the returned embedding is [batch_size, max_len_word_tokens, emb_dim]
Return type:torch.Tensor
get_embedding_dimension() → int

bow_elmo_embedder

class sciwing.modules.embedders.bow_elmo_embedder.BowElmoEmbedder(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b24850>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, layer_aggregation: str = 'sum', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b24850>] = <sphinx.ext.autodoc.importer._MockObject object>, word_tokens_namespace='tokens')

Bag of words Elmo Embedder which aggregates elmo embedding for every token

Parameters:
  • layer_aggregation (str) –

    You can chose one of [sum, average, last, first] which decides how to aggregate different layers of ELMO. ELMO produces three layers of representations

    sum
    Representations from different layers are summed
    average
    Representations from different layers are average
    last
    Representations from last layer is considered
    first
    Representations from first layer is considered
  • device (Union[str, torch.device]) – device for running the model on
  • word_tokens_namespace (int) – Namespace where all the word tokens are stored
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b24890>
Parameters:lines (List[Line]) – Just a list of lines
Returns:Returns the representation for every token in the instance [batch_size, max_num_words, emb_dim]. In case of Elmo the emb_dim is 1024
Return type:torch.Tensor
get_embedding_dimension() → int

char_embedder

class sciwing.modules.embedders.char_embedder.CharEmbedder(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7af1650>] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(char_embedding_dimension: int, hidden_dimension: int, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', char_tokens_namespace: str = 'char_tokens', device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7af1650>] = <sphinx.ext.autodoc.importer._MockObject object>)

This is a character embedder that takes in lines and collates the character embeddings for all the tokens in the lines.

Parameters:
  • char_embedding_dimension (int) – The dimension of the character embedding
  • word_tokens_namespace (int) – The name space where the words are saved
  • char_tokens_namespace (str) – The namespace where the character tokens are saved
  • datasets_manager (DatasetsManager) – The dataset manager that handles all the datasets
  • hidden_dimension (int) – The hidden dimension of the LSTM which will be used to get character embeddings
forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension() → int

concat_embedders

class sciwing.modules.embedders.concat_embedders.ConcatEmbedders(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b24410>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedders: List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b24410>], datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None)

Concatenates a set of embedders into a single embedder.

Parameters:embedders (List[nn.Module]) – A list of embedders that can be concatenated
forward(lines: List[sciwing.data.line.Line])
Parameters:lines (List[Line]) – A list of Lines.
Returns:Returns the concatenated embedding that is of the size [batch_size, time_steps, embedding_dimension] where the embedding_dimension is after the concatenation
Return type:torch.FloatTensor
get_embedding_dimension()

elmo_embedder

class sciwing.modules.embedders.elmo_embedder.ElmoEmbedder(dropout_value: float = 0.5, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b34ad0> = <sphinx.ext.autodoc.importer._MockObject object>, fine_tune: bool = False)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension()

flair_embedder

class sciwing.modules.embedders.flair_embedder.FlairEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb653f850>] = 'cpu', word_tokens_namespace: str = 'tokens')

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.utils.class_nursery.ClassNursery, sciwing.modules.embedders.base_embedders.BaseEmbedder

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, device: Union[str, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb653f850>] = 'cpu', word_tokens_namespace: str = 'tokens')

Flair Embeddings. This is used to produce Named Entity Recognition. Note: This only works if your tokens are produced by splitting based on white space

Parameters:
  • embedding_type
  • datasets_manager
  • device
  • word_tokens_namespace
forward(lines: List[sciwing.data.line.Line])
get_embedding_dimension()

trainable_word_embedder

class sciwing.modules.embedders.trainable_word_embedder.TrainableWordEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb6de7a90> = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace: str = 'tokens', device: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb6de7a90> = <sphinx.ext.autodoc.importer._MockObject object>)

This represents trainable word embeddings which are trained along with the parameters of the network. The embeddings in the class WordEmbedder are not trainable. They are static

Parameters:embedding_type (str) – The type of embedding that you would want
datasets_manager: DatasetsManager
The datasets manager which is running your experiments
word_tokens_namespace: str
The namespace where the word tokens are stored in your data
device: Union[torch.device, str]
The device on which this embedder is run
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb6de7d10>
get_embedding_dimension() → int

word_embedder

class sciwing.modules.embedders.word_embedder.WordEmbedder(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b0ad10>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Bases: sphinx.ext.autodoc.importer._MockObject, sciwing.modules.embedders.base_embedders.BaseEmbedder, sciwing.utils.class_nursery.ClassNursery

__init__(embedding_type: str, datasets_manager: sciwing.data.datasets_manager.DatasetsManager = None, word_tokens_namespace='tokens', device: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b0ad10>, str] = <sphinx.ext.autodoc.importer._MockObject object>)

Word Embedder embeds the tokens using the desired embeddings. These are static embeddings.

Parameters:
  • embedding_type (str) – The type of embedding that you would want
  • datasets_manager (DatasetsManager) – The datasets manager which is running your experiments
  • word_tokens_namespace (str) – The namespace where the word tokens are stored in your data
  • device (Union[torch.device, str]) – The device on which this embedder is run
forward(lines: List[sciwing.data.line.Line]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2cb7b24050>

This will only consider the “tokens” present in the line. The namespace for the tokens is set with the class instantiation

Parameters:lines (List[Line]) –
Returns:It returns the embedding of the size [batch_size, max_num_timesteps, embedding_dimension]
Return type:torch.FloatTensor
get_embedding_dimension() → int