Welcome to RedditScore’s documentation!¶
RedditScore Overview¶
RedditScore is a library that contains tools for building Reddit-based text classification models
- RedditScore includes:
- Document tokenizer with myriads of options, including Reddit- and Twitter-specific options
- Tools to build and tune most popular text classification models without any hassle
- Instruments to help you build more efficient Reddit-based models and to obtain RedditScores (Nikitin2018)
Usage example:
import os
import pandas as pd
from redditscore import tokenizer, models
df = pd.read_csv(os.path.join('redditscore', 'reddit_small_sample.csv'))
tokenizer = CrazyTokenizer(urls='domain', splithashtags=True)
df['tokens'] = df['body'].apply(tokenizer.tokenize)
X = df['tokens']
y = df['subreddit']
multi_model = sklearn.SklearnModel(
model_type='multinomial', alpha=0.1, random_state=24, tfidf=False, ngram_range=(1, 1))
fasttext_model = fasttext.FastTextModel(minCount=5)
multi_model.tune_params(X, y, cv=5, scoring='neg_log_loss')
fasttext_model.fit(X, y)
References:
[Nikitin2018] | Nikitin Evgenii, Identyifing Political Trends on Social Media Using Reddit Data, in progress |
Contents: