redditscore package¶

Subpackages¶

redditscore.models package

Submodules¶

redditscore.get_reddit_data module¶

redditscore.get_reddit_data.add_months(sourcedate, months)[source]¶

redditscore.get_reddit_data.check_input(subreddits, usernames)[source]¶

redditscore.get_reddit_data.construct_query(subreddits, usernames, month, score_limit=None)[source]¶

redditscore.get_reddit_data.construct_sample_query(subreddits, usernames, month, sample_size, score_limit=None)[source]¶

redditscore.get_reddit_data.construct_sample_score_query(subreddits, usernames, month, sample_size, score_limit=None)[source]¶

redditscore.get_reddit_data.diff_month(d1, d2)[source]¶

redditscore.get_reddit_data.get_comments(timerange, project_id, private_key, subreddits=None, usernames=None, score_limit=None, comments_per_month=None, top_scores=False, csv_directory=None, verbose=False, configuration=None)[source]¶

Obtain Reddit comments using Google BigQuery

Parameters:	timerange (iterable, shape (2,)) – Start and end dates in the ‘%Y_%m’ format. Example: (‘2016_08’, ‘2017_02’) project_id (str) – Google BigQuery Account project ID private_key (str) – File path to JSON file with service account private key https://cloud.google.com/bigquery/docs/reference/libraries subreddits (list, optional) – List of subreddit names usernames (list, optional) – List of usernames score_limit (int, optional) – Score limit for comment retrieving. If None, retrieve all comments. comments_per_month (int, optional) – Number of comments to sample from each subbredit per month. If None, retrieve all comments. top_scores (bool, optional) – If True, sample top-scoring comments in each subreddit instead of random sampling. csv_directory (str, optional) – CSV directory to save retrieved data. If None, return a DataFrame with all comments. verobse (bool, optional) – If True, print the name of the table, which is being queried. configuration (dict, optional) – Query config parameters for job processing.
Returns:	dfs – List of pd.DataFrames with comments
Return type:	list

redditscore.get_twitter_data module¶

redditscore.tokenizer module¶

CrazyTokenizer: spaCy-based tokenizer with Twitter- and Reddit-specific features

Splitting hashtags is based on the idea from https://stackoverflow.com/questions/11576779/how-to-extract-literal-words-from-a-consecutive-string-efficiently

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

class redditscore.tokenizer.CrazyTokenizer(lowercase=True, keepcaps=False, normalize=3, ignore_quotes=False, ignore_reddit_quotes=False, ignore_stopwords=False, stem=False, remove_punct=True, remove_breaks=True, decontract=False, twitter_handles=False, urls=False, hashtags=False, numbers=False, subreddits=False, reddit_usernames=False, emails=False, extra_patterns=None, keep_untokenized=None, whitespaces_to_underscores=True, remove_nonunicode=False, pos_emojis=None, neg_emojis=None, neutral_emojis=None, print_url_warnings=False, latin_chars_fix=False, ngrams=1)[source]¶

Bases: object

Tokenizer with Reddit- and Twitter-specific options

Parameters:

lowercase (bool, optional) – If True, lowercase all tokens. Defaults to True.
keepcaps (bool, optional) – If True, keep ALL CAPS WORDS uppercased. Defaults to False.
normalize (int or bool, optional) – If not False, perform normalization of repeated charachers (“awesoooooome” -> “awesooome”). The value of parameter determines the number of occurences to keep. Defaults to 3.
ignore_quotes (bool, optional) – If True, ignore tokens contained within double quotes. Defaults to False.
ignore_reddit_quotes (bool, optional) – If True, remove quotes from the Reddit comments. Defaults to False.
ignore_stopwords (str, list, or boolean, optional) –
Whether to ignore stopwords
- str: language to get a list of stopwords for from NLTK package
- list: list of stopwords to remove
- True: use built-in list of the english stop words
- False: keep all tokens
Defaults to False
stem ({False, 'stem', 'lemm'}, optional) –
Whether to perform word stemming
- False: do not perform word stemming
- ’stem’: use PorterStemmer from NLTK package
- ’lemm’: use WordNetLemmatizer from NLTK package
remove_punct (bool, optional) – If True, remove punctuation tokens. Defaults to True.
remove_breaks (bool, optional) – If True, remove linebreak tokens. Defaults to True.
decontract (bool, optional) – If True, attempt to expand certain contractions. Defaults to False. Example: “‘ll” -> ” will”
subreddits, reddit_usernames, emails (numbers,) –
or str, optional (False) –
Replacement of the different types of tokens
- False: leaves these tokens intact
- str: replacement token
- ’‘: removes all occurrences of these tokens
twitter_handles (False, 'realname' or str, optional) –
Processing of twitter handles
- False: do nothing
- str: replacement token
- ’realname’: replace with the real screen name of Twitter account
- ’split’: split handles using Viterbi algorithm
Example: “#vladimirputinisthebest” -> “vladimir putin is the best”
hashtags (False or str, optional) –
Processing of hashtags
- False: do nothing
- str: replacement token
- ’split’: split hashtags according using Viterbi algorithm
urls (False or str, optional) –
Replacement of parsed URLs
- False: leave URL intact
- str: replacement token
- dict: replace all URLs stored in keys with the corresponding values
- ’‘: removes all occurrences of these tokens
- ’domain’: extract domain (“http://cnn.com” -> “cnn”)
- ’domain_unwrap_fast’: extract domain after unwraping links
for a list of URL shorteners (goo.gl, t.co, bit.ly, tinyurl.com) - ‘domain_unwrap’: extract domain after unwraping all links - ‘title’: extract and tokenize title of each link after unwraping it

Defaults to False.
extra_patterns (None or list of tuples, optional) –
Replacement of any user-supplied extra patterns. Tuples must have the following form: (name, re_pattern, replacement_token):
- name (str): name of the pattern
- re_pattern (_sre.SRE_Pattern): compiled re pattern
- replacement_token (str): replacement token
Defaults to None
keep_untokenized (None or list, optional) –
List of expressions to keep untokenized

Example: [“New York”, “Los Angeles”, “San Francisco”]
whitespaces_to_underscores (boolean, optional) – If True, replace all whitespace characters with underscores in the final tokens. Defaults to True.
remove_nonunicode (boolean, optional) – If True, remove all non-unicode characters. Defaults to False.
neg_emojis, neutral_emojis (pos_emojis,) –
Replace positive, negative, and neutral emojis with the special tokens
- None: do not perform replacement
- True: perform replacement of the default lists of emojis
- list: list of emojis to replace
print_url_warnings (bool, optional) – If True, print URL-related warnings. Defaults to False.
latin_chars_fix (bool, optional) – Try applying this fix if you have a lot of xe2x80x99-like or U+1F601-like strings in your data. Defaults to False.
ngrams (int, optional) – Add ngrams of tokens after tokenizing

tokenize(text)[source]¶

Tokenize document

Parameters:	text (str) – Document to tokenize
Returns:	List of tokens
Return type:	list

Examples

>>> from redditscore.tokenizer import CrazyTokenizer
>>> tokenizer = CrazyTokenizer(splithashtags=True, hashtags=False)
>>> tokenizer.tokenize("#makeamericagreatagain")
["make", "america", "great", "again"]

redditscore.tokenizer.alpha_digits_check(text)[source]¶

redditscore.tokenizer.batch(iterable, n=1)[source]¶

redditscore.tokenizer.get_twitter_realname(twitter_handle)[source]¶

redditscore.tokenizer.get_url_title(url, verbose=False)[source]¶

redditscore.tokenizer.hashtag_check(text)[source]¶

redditscore.tokenizer.retokenize_check(text)[source]¶

redditscore.tokenizer.twitter_handle_check(text)[source]¶

redditscore.tokenizer.unshorten_url(url, url_shorteners=None, verbose=False)[source]¶