redditscore package

Submodules

redditscore.get_reddit_data module

redditscore.get_reddit_data.add_months(sourcedate, months)[source]
redditscore.get_reddit_data.check_input(subreddits, usernames)[source]
redditscore.get_reddit_data.construct_query(subreddits, usernames, month, score_limit=None)[source]
redditscore.get_reddit_data.construct_sample_query(subreddits, usernames, month, sample_size, score_limit=None)[source]
redditscore.get_reddit_data.construct_sample_score_query(subreddits, usernames, month, sample_size, score_limit=None)[source]
redditscore.get_reddit_data.diff_month(d1, d2)[source]
redditscore.get_reddit_data.get_comments(timerange, project_id, private_key, subreddits=None, usernames=None, score_limit=None, comments_per_month=None, top_scores=False, csv_directory=None, verbose=False, configuration=None)[source]

Obtain Reddit comments using Google BigQuery

Parameters:
  • timerange (iterable, shape (2,)) – Start and end dates in the ‘%Y_%m’ format. Example: (‘2016_08’, ‘2017_02’)
  • project_id (str) – Google BigQuery Account project ID
  • private_key (str) – File path to JSON file with service account private key https://cloud.google.com/bigquery/docs/reference/libraries
  • subreddits (list, optional) – List of subreddit names
  • usernames (list, optional) – List of usernames
  • score_limit (int, optional) – Score limit for comment retrieving. If None, retrieve all comments.
  • comments_per_month (int, optional) – Number of comments to sample from each subbredit per month. If None, retrieve all comments.
  • top_scores (bool, optional) – If True, sample top-scoring comments in each subreddit instead of random sampling.
  • csv_directory (str, optional) – CSV directory to save retrieved data. If None, return a DataFrame with all comments.
  • verobse (bool, optional) – If True, print the name of the table, which is being queried.
  • configuration (dict, optional) – Query config parameters for job processing.
Returns:

dfs – List of pd.DataFrames with comments

Return type:

list

redditscore.get_twitter_data module

redditscore.tokenizer module

CrazyTokenizer: spaCy-based tokenizer with Twitter- and Reddit-specific features

Splitting hashtags is based on the idea from https://stackoverflow.com/questions/11576779/how-to-extract-literal-words-from-a-consecutive-string-efficiently

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.

class redditscore.tokenizer.CrazyTokenizer(lowercase=True, keepcaps=False, normalize=3, ignore_quotes=False, ignore_reddit_quotes=False, ignore_stopwords=False, stem=False, remove_punct=True, remove_breaks=True, decontract=False, twitter_handles=False, urls=False, hashtags=False, numbers=False, subreddits=False, reddit_usernames=False, emails=False, extra_patterns=None, keep_untokenized=None, whitespaces_to_underscores=True, remove_nonunicode=False, pos_emojis=None, neg_emojis=None, neutral_emojis=None, print_url_warnings=False, latin_chars_fix=False, ngrams=1)[source]

Bases: object

Tokenizer with Reddit- and Twitter-specific options

Parameters:
  • lowercase (bool, optional) – If True, lowercase all tokens. Defaults to True.
  • keepcaps (bool, optional) – If True, keep ALL CAPS WORDS uppercased. Defaults to False.
  • normalize (int or bool, optional) – If not False, perform normalization of repeated charachers (“awesoooooome” -> “awesooome”). The value of parameter determines the number of occurences to keep. Defaults to 3.
  • ignore_quotes (bool, optional) – If True, ignore tokens contained within double quotes. Defaults to False.
  • ignore_reddit_quotes (bool, optional) – If True, remove quotes from the Reddit comments. Defaults to False.
  • ignore_stopwords (str, list, or boolean, optional) –

    Whether to ignore stopwords

    • str: language to get a list of stopwords for from NLTK package
    • list: list of stopwords to remove
    • True: use built-in list of the english stop words
    • False: keep all tokens

    Defaults to False

  • stem ({False, 'stem', 'lemm'}, optional) –

    Whether to perform word stemming

    • False: do not perform word stemming
    • ’stem’: use PorterStemmer from NLTK package
    • ’lemm’: use WordNetLemmatizer from NLTK package
  • remove_punct (bool, optional) – If True, remove punctuation tokens. Defaults to True.
  • remove_breaks (bool, optional) – If True, remove linebreak tokens. Defaults to True.
  • decontract (bool, optional) – If True, attempt to expand certain contractions. Defaults to False. Example: “‘ll” -> ” will”
  • subreddits, reddit_usernames, emails (numbers,) –
  • or str, optional (False) –

    Replacement of the different types of tokens

    • False: leaves these tokens intact
    • str: replacement token
    • ’‘: removes all occurrences of these tokens
  • twitter_handles (False, 'realname' or str, optional) –

    Processing of twitter handles

    • False: do nothing
    • str: replacement token
    • ’realname’: replace with the real screen name of Twitter account
    • ’split’: split handles using Viterbi algorithm

    Example: “#vladimirputinisthebest” -> “vladimir putin is the best”

  • hashtags (False or str, optional) –

    Processing of hashtags

    • False: do nothing
    • str: replacement token
    • ’split’: split hashtags according using Viterbi algorithm
  • urls (False or str, optional) –

    Replacement of parsed URLs

    • False: leave URL intact
    • str: replacement token
    • dict: replace all URLs stored in keys with the corresponding values
    • ’‘: removes all occurrences of these tokens
    • ’domain’: extract domain (“http://cnn.com” -> “cnn”)
    • ’domain_unwrap_fast’: extract domain after unwraping links

    for a list of URL shorteners (goo.gl, t.co, bit.ly, tinyurl.com) - ‘domain_unwrap’: extract domain after unwraping all links - ‘title’: extract and tokenize title of each link after unwraping it

    Defaults to False.

  • extra_patterns (None or list of tuples, optional) –

    Replacement of any user-supplied extra patterns. Tuples must have the following form: (name, re_pattern, replacement_token):

    • name (str): name of the pattern
    • re_pattern (_sre.SRE_Pattern): compiled re pattern
    • replacement_token (str): replacement token

    Defaults to None

  • keep_untokenized (None or list, optional) –

    List of expressions to keep untokenized

    Example: [“New York”, “Los Angeles”, “San Francisco”]

  • whitespaces_to_underscores (boolean, optional) – If True, replace all whitespace characters with underscores in the final tokens. Defaults to True.
  • remove_nonunicode (boolean, optional) – If True, remove all non-unicode characters. Defaults to False.
  • neg_emojis, neutral_emojis (pos_emojis,) –

    Replace positive, negative, and neutral emojis with the special tokens

    • None: do not perform replacement
    • True: perform replacement of the default lists of emojis
    • list: list of emojis to replace
  • print_url_warnings (bool, optional) – If True, print URL-related warnings. Defaults to False.
  • latin_chars_fix (bool, optional) – Try applying this fix if you have a lot of xe2x80x99-like or U+1F601-like strings in your data. Defaults to False.
  • ngrams (int, optional) – Add ngrams of tokens after tokenizing
tokenize(text)[source]

Tokenize document

Parameters:text (str) – Document to tokenize
Returns:List of tokens
Return type:list

Examples

>>> from redditscore.tokenizer import CrazyTokenizer
>>> tokenizer = CrazyTokenizer(splithashtags=True, hashtags=False)
>>> tokenizer.tokenize("#makeamericagreatagain")
["make", "america", "great", "again"]
redditscore.tokenizer.alpha_digits_check(text)[source]
redditscore.tokenizer.batch(iterable, n=1)[source]
redditscore.tokenizer.get_twitter_realname(twitter_handle)[source]
redditscore.tokenizer.get_url_title(url, verbose=False)[source]
redditscore.tokenizer.hashtag_check(text)[source]
redditscore.tokenizer.retokenize_check(text)[source]
redditscore.tokenizer.twitter_handle_check(text)[source]
redditscore.tokenizer.unshorten_url(url, url_shorteners=None, verbose=False)[source]

Module contents