CrazyTokenizer¶

CrazyTokenizer: spaCy-based tokenizer with Twitter- and Reddit-specific features

Splitting hashtags is based on the idea from https://stackoverflow.com/questions/11576779/how-to-extract-literal-words-from-a-consecutive-string-efficiently

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

class tokenizer.CrazyTokenizer(lowercase=True, keepcaps=False, normalize=3, ignore_quotes=False, ignore_reddit_quotes=False, ignore_stopwords=False, stem=False, remove_punct=True, remove_breaks=True, decontract=False, twitter_handles=False, urls=False, hashtags=False, numbers=False, subreddits=False, reddit_usernames=False, emails=False, extra_patterns=None, keep_untokenized=None, whitespaces_to_underscores=True, remove_nonunicode=False, pos_emojis=None, neg_emojis=None, neutral_emojis=None, print_url_warnings=False, latin_chars_fix=False, ngrams=1)[source]¶

Tokenizer with Reddit- and Twitter-specific options

Parameters:

lowercase (bool, optional) – If True, lowercase all tokens. Defaults to True.
keepcaps (bool, optional) – If True, keep ALL CAPS WORDS uppercased. Defaults to False.
normalize (int or bool, optional) – If not False, perform normalization of repeated charachers (“awesoooooome” -> “awesooome”). The value of parameter determines the number of occurences to keep. Defaults to 3.
ignore_quotes (bool, optional) – If True, ignore tokens contained within double quotes. Defaults to False.
ignore_reddit_quotes (bool, optional) – If True, remove quotes from the Reddit comments. Defaults to False.
ignore_stopwords (str, list, or boolean, optional) –
Whether to ignore stopwords
- str: language to get a list of stopwords for from NLTK package
- list: list of stopwords to remove
- True: use built-in list of the english stop words
- False: keep all tokens
Defaults to False
stem ({False, 'stem', 'lemm'}, optional) –
Whether to perform word stemming
- False: do not perform word stemming
- ’stem’: use PorterStemmer from NLTK package
- ’lemm’: use WordNetLemmatizer from NLTK package
remove_punct (bool, optional) – If True, remove punctuation tokens. Defaults to True.
remove_breaks (bool, optional) – If True, remove linebreak tokens. Defaults to True.
decontract (bool, optional) – If True, attempt to expand certain contractions. Defaults to False. Example: “‘ll” -> ” will”
subreddits, reddit_usernames, emails (numbers,) –
or str, optional (False) –
Replacement of the different types of tokens
- False: leaves these tokens intact
- str: replacement token
- ’‘: removes all occurrences of these tokens
twitter_handles (False, 'realname' or str, optional) –
Processing of twitter handles
- False: do nothing
- str: replacement token
- ’realname’: replace with the real screen name of Twitter account
- ’split’: split handles using Viterbi algorithm
Example: “#vladimirputinisthebest” -> “vladimir putin is the best”
hashtags (False or str, optional) –
Processing of hashtags
- False: do nothing
- str: replacement token
- ’split’: split hashtags according using Viterbi algorithm
urls (False or str, optional) –
Replacement of parsed URLs
- False: leave URL intact
- str: replacement token
- dict: replace all URLs stored in keys with the corresponding values
- ’‘: removes all occurrences of these tokens
- ’domain’: extract domain (“http://cnn.com” -> “cnn”)
- ’domain_unwrap_fast’: extract domain after unwraping links
for a list of URL shorteners (goo.gl, t.co, bit.ly, tinyurl.com) - ‘domain_unwrap’: extract domain after unwraping all links - ‘title’: extract and tokenize title of each link after unwraping it

Defaults to False.
extra_patterns (None or list of tuples, optional) –
Replacement of any user-supplied extra patterns. Tuples must have the following form: (name, re_pattern, replacement_token):
- name (str): name of the pattern
- re_pattern (_sre.SRE_Pattern): compiled re pattern
- replacement_token (str): replacement token
Defaults to None
keep_untokenized (None or list, optional) –
List of expressions to keep untokenized

Example: [“New York”, “Los Angeles”, “San Francisco”]
whitespaces_to_underscores (boolean, optional) – If True, replace all whitespace characters with underscores in the final tokens. Defaults to True.
remove_nonunicode (boolean, optional) – If True, remove all non-unicode characters. Defaults to False.
neg_emojis, neutral_emojis (pos_emojis,) –
Replace positive, negative, and neutral emojis with the special tokens
- None: do not perform replacement
- True: perform replacement of the default lists of emojis
- list: list of emojis to replace
print_url_warnings (bool, optional) – If True, print URL-related warnings. Defaults to False.
latin_chars_fix (bool, optional) – Try applying this fix if you have a lot of xe2x80x99-like or U+1F601-like strings in your data. Defaults to False.
ngrams (int, optional) – Add ngrams of tokens after tokenizing

tokenize(text)[source]¶

Tokenize document

Parameters:	text (str) – Document to tokenize
Returns:	List of tokens
Return type:	list

Examples

>>> from redditscore.tokenizer import CrazyTokenizer
>>> tokenizer = CrazyTokenizer(splithashtags=True, hashtags=False)
>>> tokenizer.tokenize("#makeamericagreatagain")
["make", "america", "great", "again"]