sciwing.tokenizers

BaseTokenizer

class sciwing.tokenizers.BaseTokenizer.BaseTokenizer

Bases: object

tokenize(text: str) → List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

bert_tokenizer

class sciwing.tokenizers.bert_tokenizer.TokenizerForBert(bert_type: str, do_basic_tokenize=True)

Bases: sciwing.tokenizers.BaseTokenizer.BaseTokenizer

tokenize(text: str) → List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

character_tokenizer

class sciwing.tokenizers.character_tokenizer.CharacterTokenizer

Bases: sciwing.tokenizers.BaseTokenizer.BaseTokenizer

tokenize(text: str) → List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

word_tokenizer

class sciwing.tokenizers.word_tokenizer.WordTokenizer(tokenizer: str = 'spacy')

Bases: sciwing.tokenizers.BaseTokenizer.BaseTokenizer

__init__(tokenizer: str = 'spacy')

WordTokenizers split the text into tokens

Parameters:tokenizer (str) –

The type of tokenizer.

spacy
Tokenizer from spact
nltk
NLTK based tokenizer
vanilla
Tokenize words according to space
spacy-whtiespace
Same as vanilla but implemented using custom white space tokenizer from spacy
tokenize(text: str) → List[str]

Tokenize text into a set of tokens

Parameters:text (str) – A single instance that is tokenized to a set of tokens
Returns:A set of tokens
Return type:List[str]
tokenize_batch(texts: List[str]) → List[List[str]]

Tokenize a batch of sentences

Parameters:texts (List[List[str]]) –
Returns:
Return type:List[List[str]]