sciwing.numericalizer

numericalizer

class sciwing.numericalizers.numericalizer.Numericalizer(vocabulary: sciwing.vocab.vocab.Vocab = None)

Bases: sciwing.numericalizers.base_numericalizer.BaseNumericalizer

__init__(vocabulary: sciwing.vocab.vocab.Vocab = None)

Numericalizer converts tokens that are strings to numbers

Parameters:vocabulary (Vocab) – A vocabulary object that is built using a set of tokenized strings
get_mask_for_batch_instances(instances: List[List[int]]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2a7a600410>
get_mask_for_instance(instance: List[int]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2a7a6003d0>
numericalize_batch_instances(instances: List[List[str]]) → List[List[int]]

Numericalizes a batch of instances

Parameters:instances (List[List[str]]) – A list of tokenized sentences
Returns:A list of numericalized instances
Return type:List[List[int]]
numericalize_instance(instance: List[str]) → List[int]

Numericalize a single instance

Parameters:instance (List[str]) – An instance is a list of tokens
Returns:Numericalized instance
Return type:List[int]
pad_batch_instances(instances: List[List[int]], max_length: int, add_start_end_token: bool = True) → List[List[int]]

Pads a batch of instances according to the vocab object

Parameters:
  • instances (List[List[int]]) –
  • max_length (int) –
  • add_start_end_token (int) –
Returns:

Return type:

List[List[int]]

pad_instance(numericalized_text: List[int], max_length: int, add_start_end_token: bool = True) → List[int]

Pads the instance according to the vocab object

Parameters:
  • numericalized_text (List[int]) – Pads a numericalized instance
  • max_length (int) – The maximum length to pad to
  • add_start_end_token (bool) – If true, start and end token will be added to the tokenized text
Returns:

Padded instance

Return type:

List[int]

vocabulary

transformer_numericalizer

class sciwing.numericalizers.transformer_numericalizer.NumericalizerForTransformer(vocab: sciwing.vocab.vocab.Vocab = None, tokenizer: sciwing.tokenizers.bert_tokenizer.TokenizerForBert = None)

Bases: sciwing.numericalizers.base_numericalizer.BaseNumericalizer

get_mask_for_batch_instances(instances: List[List[int]])
get_mask_for_instance(instance: List[int])
numericalize_batch_instances(instances: List[List[str]]) → List[int]
numericalize_instance(instance: Union[List[str], List[sciwing.data.token.Token]]) → List[int]
pad_batch_instances(instances: List[List[int]], max_length: int, add_start_end_token: bool = True)

Pads a batch of instances according to the vocab object

Parameters:
  • instances (List[List[int]]) –
  • max_length (int) –
  • add_start_end_token (int) –
Returns:

Return type:

List[List[int]]

pad_instance(numericalized_text: List[int], max_length: int, add_start_end_token: bool = True) → List[int]

Pads the instance according to the vocab object

Parameters:
  • numericalized_text (List[int]) – Pads a numericalized instance
  • max_length (int) – The maximum length to pad to
  • add_start_end_token (bool) – If true, start and end token will be added to the tokenized text
Returns:

Padded instance

Return type:

List[int]