sciwing.tokenizers¶
BaseTokenizer¶
bert_tokenizer¶
-
class
sciwing.tokenizers.bert_tokenizer.
TokenizerForBert
(bert_type: str, do_basic_tokenize=True)¶ Bases:
sciwing.tokenizers.BaseTokenizer.BaseTokenizer
-
tokenize
(text: str) → List[str]¶
-
tokenize_batch
(texts: List[str]) → List[List[str]]¶
-
character_tokenizer¶
-
class
sciwing.tokenizers.character_tokenizer.
CharacterTokenizer
¶ Bases:
sciwing.tokenizers.BaseTokenizer.BaseTokenizer
-
tokenize
(text: str) → List[str]¶
-
tokenize_batch
(texts: List[str]) → List[List[str]]¶
-
word_tokenizer¶
-
class
sciwing.tokenizers.word_tokenizer.
WordTokenizer
(tokenizer: str = 'spacy')¶ Bases:
sciwing.tokenizers.BaseTokenizer.BaseTokenizer
-
__init__
(tokenizer: str = 'spacy')¶ WordTokenizers split the text into tokens
Parameters: tokenizer (str) – The type of tokenizer.
- spacy
- Tokenizer from spact
- nltk
- NLTK based tokenizer
- vanilla
- Tokenize words according to space
- spacy-whtiespace
- Same as vanilla but implemented using custom white space tokenizer from spacy
-
tokenize
(text: str) → List[str]¶ Tokenize text into a set of tokens
Parameters: text (str) – A single instance that is tokenized to a set of tokens Returns: A set of tokens Return type: List[str]
-
tokenize_batch
(texts: List[str]) → List[List[str]]¶ Tokenize a batch of sentences
Parameters: texts (List[List[str]]) – Returns: Return type: List[List[str]]
-