CrazyTokenizer¶
CrazyTokenizer: spaCy-based tokenizer with Twitter- and Reddit-specific features
Splitting hashtags is based on the idea from https://stackoverflow.com/questions/11576779/how-to-extract-literal-words-from-a-consecutive-string-efficiently
Author: Evgenii Nikitin <e.nikitin@nyu.edu>
Part of https://github.com/crazyfrogspb/RedditScore project
Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.
-
class
tokenizer.
CrazyTokenizer
(lowercase=True, keepcaps=False, normalize=3, ignore_quotes=False, ignore_reddit_quotes=False, ignore_stopwords=False, stem=False, remove_punct=True, remove_breaks=True, decontract=False, twitter_handles=False, urls=False, hashtags=False, numbers=False, subreddits=False, reddit_usernames=False, emails=False, extra_patterns=None, keep_untokenized=None, whitespaces_to_underscores=True, remove_nonunicode=False, pos_emojis=None, neg_emojis=None, neutral_emojis=None, print_url_warnings=False, latin_chars_fix=False, ngrams=1)[source]¶ Tokenizer with Reddit- and Twitter-specific options
Parameters: - lowercase (bool, optional) – If True, lowercase all tokens. Defaults to True.
- keepcaps (bool, optional) – If True, keep ALL CAPS WORDS uppercased. Defaults to False.
- normalize (int or bool, optional) – If not False, perform normalization of repeated charachers (“awesoooooome” -> “awesooome”). The value of parameter determines the number of occurences to keep. Defaults to 3.
- ignore_quotes (bool, optional) – If True, ignore tokens contained within double quotes. Defaults to False.
- ignore_reddit_quotes (bool, optional) – If True, remove quotes from the Reddit comments. Defaults to False.
- ignore_stopwords (str, list, or boolean, optional) –
Whether to ignore stopwords
- str: language to get a list of stopwords for from NLTK package
- list: list of stopwords to remove
- True: use built-in list of the english stop words
- False: keep all tokens
Defaults to False
- stem ({False, 'stem', 'lemm'}, optional) –
Whether to perform word stemming
- False: do not perform word stemming
- ’stem’: use PorterStemmer from NLTK package
- ’lemm’: use WordNetLemmatizer from NLTK package
- remove_punct (bool, optional) – If True, remove punctuation tokens. Defaults to True.
- remove_breaks (bool, optional) – If True, remove linebreak tokens. Defaults to True.
- decontract (bool, optional) – If True, attempt to expand certain contractions. Defaults to False. Example: “‘ll” -> ” will”
- subreddits, reddit_usernames, emails (numbers,) –
- or str, optional (False) –
Replacement of the different types of tokens
- False: leaves these tokens intact
- str: replacement token
- ’‘: removes all occurrences of these tokens
- twitter_handles (False, 'realname' or str, optional) –
Processing of twitter handles
- False: do nothing
- str: replacement token
- ’realname’: replace with the real screen name of Twitter account
- ’split’: split handles using Viterbi algorithm
Example: “#vladimirputinisthebest” -> “vladimir putin is the best”
- hashtags (False or str, optional) –
Processing of hashtags
- False: do nothing
- str: replacement token
- ’split’: split hashtags according using Viterbi algorithm
- urls (False or str, optional) –
Replacement of parsed URLs
- False: leave URL intact
- str: replacement token
- dict: replace all URLs stored in keys with the corresponding values
- ’‘: removes all occurrences of these tokens
- ’domain’: extract domain (“http://cnn.com” -> “cnn”)
- ’domain_unwrap_fast’: extract domain after unwraping links
for a list of URL shorteners (goo.gl, t.co, bit.ly, tinyurl.com) - ‘domain_unwrap’: extract domain after unwraping all links - ‘title’: extract and tokenize title of each link after unwraping it
Defaults to False.
- extra_patterns (None or list of tuples, optional) –
Replacement of any user-supplied extra patterns. Tuples must have the following form: (name, re_pattern, replacement_token):
- name (str): name of the pattern
- re_pattern (_sre.SRE_Pattern): compiled re pattern
- replacement_token (str): replacement token
Defaults to None
- keep_untokenized (None or list, optional) –
List of expressions to keep untokenized
Example: [“New York”, “Los Angeles”, “San Francisco”]
- whitespaces_to_underscores (boolean, optional) – If True, replace all whitespace characters with underscores in the final tokens. Defaults to True.
- remove_nonunicode (boolean, optional) – If True, remove all non-unicode characters. Defaults to False.
- neg_emojis, neutral_emojis (pos_emojis,) –
Replace positive, negative, and neutral emojis with the special tokens
- None: do not perform replacement
- True: perform replacement of the default lists of emojis
- list: list of emojis to replace
- print_url_warnings (bool, optional) – If True, print URL-related warnings. Defaults to False.
- latin_chars_fix (bool, optional) – Try applying this fix if you have a lot of xe2x80x99-like or U+1F601-like strings in your data. Defaults to False.
- ngrams (int, optional) – Add ngrams of tokens after tokenizing
-
tokenize
(text)[source]¶ Tokenize document
Parameters: text (str) – Document to tokenize Returns: List of tokens Return type: list Examples
>>> from redditscore.tokenizer import CrazyTokenizer >>> tokenizer = CrazyTokenizer(splithashtags=True, hashtags=False) >>> tokenizer.tokenize("#makeamericagreatagain") ["make", "america", "great", "again"]