Modelling¶
You collected your training data, coverted it to the lists of tokens, now you can start training models!
RedditScore currently supports training the following types of models:
- Bag-of-Words models
- fastText model
- MLP, LSTM, CNN neural networks (Keras implementation) - TODO
Fitting models¶
All model wrappers have very similar interface:
from redditscore import tokenizer
from redditscore.models import fasttext_mod, bow_mod
from sklearn.naive_bayes import MultinomialNB
# reading and tokenizing data
df = pd.read_csv(os.path.join('redditscore', 'reddit_small_sample.csv'))
df = df.sample(frac=1.0, random_state=24) # shuffling data
tokenizer = CrazyTokenizer(splithashtags=True) # initializing tokenizer object
X = df['body'].apply(tokenizer.tokenize) # tokenizing Reddit comments
y = df['subreddit']
fasttext_model = fasttext_mod.FastTextModel(epochs=5, minCount=5)
multi_model = sklearn_mod.BoWModel(estimator=MultinomialNB(alpha=0.1), ngrams=2, tfidf=False)
fasttext_model.fit(X, y)
multi_model.fit(X, y)
Model persistence¶
To save the model, use save_model
method:
>>> fasttext_model.save_model('models/fasttext')
>>> multi_model.save_model('models/multi.pkl')
Each module has its own load_model
function:
>>> fasttext_model = fasttext_mod.load_model('models/fasttext')
>>> multi_model = sklearn_mod.load_model('models/multi.pkl')
Note: fastText and Keras models are saved into two files with ‘.pkl’ and ‘.bin’ extensions.read
Predictions and similarity scores¶
Using models for prediction is very straightforward:
>>> pred_labels = fasttext_model.predict(X)
>>> pred_probs = fasttext_model.predict_proba(X)
Both predict
and predict_proba
return pandas DataFrames with class labels as column names.
Model tuning and validation¶
Each model class has cv_score
and tune_params
methods. You can use these methods to assess the quality of your model
and to perform tuning of hyperparameters.
cv_score
method has two optional arguments -cv
andscoring
.cv
argument can be:- float: score will be calculated using a randomly sampled holdout set containing corresponding proportion of
X
- None: use 3-fold cross-validation
- integer: use n-fold cross-validation
- an object to be used as a cross-validation generator: for example, KFold from sklearn
- an iterable yielding train and test split: for example, list of tuples with train and test indices
- float: score will be calculated using a randomly sampled holdout set containing corresponding proportion of
scoring
can be either a string or a scorer callable object.
Examples:
>>> fasttext_model.cv_score(X, y, cv=0.2, scoring='accuracy')
>>> fasttext_model.cv_score(X, y, cv=5, scoring='neg_log_loss')
tune_params
can help you to perform grid search over the grid of hyperparameters. In addition to cv
and scoring
parameters it also
has three additional optional parameters: param_grid
, verbose
, and refit
.
param_grid
can have a few different structures:- None: if None, use default step grid for the corresponding model
- dictionary with parameters names as keys and lists of parameter settings as values.
- dictionary with enumerated steps of grid search. Each step has to be a dictionary with parameters names as keys and lists of parameter settings as values.
Examples:
>>> fasttext_model.tune_params(X, y)
>>> param_grid = {'epochs': [1,5,10], 'dim': [50,100,300]}
>>> fasttext_model.tune_params(X, y, param_grid=param_grid)
>>> param_grid = {'step0': param_grid, 'step1': {'t': [1e-4, 1e-3, 1e-3]}}
If verbose
is True, messages with grid process results will be printed.
If refit
is True, the model will be refit with the full data after grid search is over.
tune_params
returns a tuple: dictionary with the best found parameters and the best value of the chosen metric.
In addition, original model parameters will be replaced inplace with the best found parameters.
Visualization of the class embeddings¶
- For fastText and neural network models, you can visualize resulting class embeddings. This might be useful for a couple of reasons:
- It can be used as an informal way to confirm that the model was able to learn meaningful semantic differences between classes. In particular, classes that one expects to be more semantically similar should have similar class embeddings.
- It can help you to group different classes together. This is particularly useful for building Reddit-based models and calculating RedditScores. There are a lot of different subreddits, but a lot of them are quite similar to each other (say, /r/Conservaitve and /r/republicans). Visualizations can help you to identify similar subreddits, which can be grouped together for improved predictive performance.

Dengrogram for class embeddings

t-SNE visualization