CrazyTokenizer

CrazyTokenizer: spaCy-based tokenizer with Twitter- and Reddit-specific features

Splitting hashtags is based on the idea from https://stackoverflow.com/questions/11576779/how-to-extract-literal-words-from-a-consecutive-string-efficiently

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.

class tokenizer.CrazyTokenizer(lowercase=True, keepcaps=False, normalize=3, ignore_quotes=False, ignore_reddit_quotes=False, ignore_stopwords=False, stem=False, remove_punct=True, remove_breaks=True, decontract=False, twitter_handles=False, urls=False, hashtags=False, numbers=False, subreddits=False, reddit_usernames=False, emails=False, extra_patterns=None, keep_untokenized=None, whitespaces_to_underscores=True, remove_nonunicode=False, pos_emojis=None, neg_emojis=None, neutral_emojis=None, print_url_warnings=False, latin_chars_fix=False, ngrams=1)[source]

Tokenizer with Reddit- and Twitter-specific options

Parameters:
  • lowercase (bool, optional) – If True, lowercase all tokens. Defaults to True.
  • keepcaps (bool, optional) – If True, keep ALL CAPS WORDS uppercased. Defaults to False.
  • normalize (int or bool, optional) – If not False, perform normalization of repeated charachers (“awesoooooome” -> “awesooome”). The value of parameter determines the number of occurences to keep. Defaults to 3.
  • ignore_quotes (bool, optional) – If True, ignore tokens contained within double quotes. Defaults to False.
  • ignore_reddit_quotes (bool, optional) – If True, remove quotes from the Reddit comments. Defaults to False.
  • ignore_stopwords (str, list, or boolean, optional) –

    Whether to ignore stopwords

    • str: language to get a list of stopwords for from NLTK package
    • list: list of stopwords to remove
    • True: use built-in list of the english stop words
    • False: keep all tokens

    Defaults to False

  • stem ({False, 'stem', 'lemm'}, optional) –

    Whether to perform word stemming

    • False: do not perform word stemming
    • ’stem’: use PorterStemmer from NLTK package
    • ’lemm’: use WordNetLemmatizer from NLTK package
  • remove_punct (bool, optional) – If True, remove punctuation tokens. Defaults to True.
  • remove_breaks (bool, optional) – If True, remove linebreak tokens. Defaults to True.
  • decontract (bool, optional) – If True, attempt to expand certain contractions. Defaults to False. Example: “‘ll” -> ” will”
  • subreddits, reddit_usernames, emails (numbers,) –
  • or str, optional (False) –

    Replacement of the different types of tokens

    • False: leaves these tokens intact
    • str: replacement token
    • ’‘: removes all occurrences of these tokens
  • twitter_handles (False, 'realname' or str, optional) –

    Processing of twitter handles

    • False: do nothing
    • str: replacement token
    • ’realname’: replace with the real screen name of Twitter account
    • ’split’: split handles using Viterbi algorithm

    Example: “#vladimirputinisthebest” -> “vladimir putin is the best”

  • hashtags (False or str, optional) –

    Processing of hashtags

    • False: do nothing
    • str: replacement token
    • ’split’: split hashtags according using Viterbi algorithm
  • urls (False or str, optional) –

    Replacement of parsed URLs

    • False: leave URL intact
    • str: replacement token
    • dict: replace all URLs stored in keys with the corresponding values
    • ’‘: removes all occurrences of these tokens
    • ’domain’: extract domain (“http://cnn.com” -> “cnn”)
    • ’domain_unwrap_fast’: extract domain after unwraping links

    for a list of URL shorteners (goo.gl, t.co, bit.ly, tinyurl.com) - ‘domain_unwrap’: extract domain after unwraping all links - ‘title’: extract and tokenize title of each link after unwraping it

    Defaults to False.

  • extra_patterns (None or list of tuples, optional) –

    Replacement of any user-supplied extra patterns. Tuples must have the following form: (name, re_pattern, replacement_token):

    • name (str): name of the pattern
    • re_pattern (_sre.SRE_Pattern): compiled re pattern
    • replacement_token (str): replacement token

    Defaults to None

  • keep_untokenized (None or list, optional) –

    List of expressions to keep untokenized

    Example: [“New York”, “Los Angeles”, “San Francisco”]

  • whitespaces_to_underscores (boolean, optional) – If True, replace all whitespace characters with underscores in the final tokens. Defaults to True.
  • remove_nonunicode (boolean, optional) – If True, remove all non-unicode characters. Defaults to False.
  • neg_emojis, neutral_emojis (pos_emojis,) –

    Replace positive, negative, and neutral emojis with the special tokens

    • None: do not perform replacement
    • True: perform replacement of the default lists of emojis
    • list: list of emojis to replace
  • print_url_warnings (bool, optional) – If True, print URL-related warnings. Defaults to False.
  • latin_chars_fix (bool, optional) – Try applying this fix if you have a lot of xe2x80x99-like or U+1F601-like strings in your data. Defaults to False.
  • ngrams (int, optional) – Add ngrams of tokens after tokenizing
tokenize(text)[source]

Tokenize document

Parameters:text (str) – Document to tokenize
Returns:List of tokens
Return type:list

Examples

>>> from redditscore.tokenizer import CrazyTokenizer
>>> tokenizer = CrazyTokenizer(splithashtags=True, hashtags=False)
>>> tokenizer.tokenize("#makeamericagreatagain")
["make", "america", "great", "again"]