redditscore.models package¶

Submodules¶

redditscore.models.doc2vec module¶

class redditscore.models.doc2vec.Doc2VecModel(random_state=24, dm=0, vector_size=100, window=5, negative=5, hs=0, min_count=5, sample=1e-05, epochs=10, dbow_words=0, workers=8, steps=1000, alpha=0.025)[source]¶

Bases: redditscore.models.redditmodel.RedditModel

fit(X, y)[source]¶

Fit model

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Fitted model object
Return type:	RedditModel

predict(X)[source]¶

Predict the most likely label

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Predicted class labels
Return type:	array, shape (n_samples, )

predict_proba(X)[source]¶

Predict the most likely label

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Predicted class probabilities
Return type:	array, shape (n_samples, num_classes)

redditscore.models.fasttext module¶

FastTextModel: A wrapper for Facebook fastText model

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

class redditscore.models.fasttext_mod.FastTextClassifier(lr=0.1, dim=100, ws=5, epoch=5, minCount=1, minCountLabel=0, minn=0, maxn=0, neg=5, wordNgrams=1, loss='softmax', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label='__label__', verbose=2)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

fit(X, y)[source]¶

predict(X)[source]¶

predict_proba(X)[source]¶

class redditscore.models.fasttext_mod.FastTextModel(random_state=24, **kwargs)[source]¶

Bases: redditscore.models.redditmodel.RedditModel

Facebook fastText classifier

Parameters:	random_state (int, optional) – Random seed (the default is 24). **kwargs – Other parameters for fastText model. Full description can be found here: https://github.com/facebookresearch/fastText

fit(X, y)[source]¶

Fit model

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Fitted model object
Return type:	FastTextModel

save_model(filepath)[source]¶

Save model to disk.

Parameters:	filepath (str) – Path to the file where the model will be saved. NOTE: The model will be saved in two files: with ‘.pkl’ and ‘bin’ file extensions.

redditscore.models.fasttext_mod.check_multilabel(y)[source]¶

redditscore.models.fasttext_mod.chunking_dot(big_matrix, small_matrix, chunk_size=50000)[source]¶

redditscore.models.fasttext_mod.load_model(filepath)[source]¶

Load pickled model.

Parameters:	filepath (str) – Path to the file where the model will be saved. NOTE: the directory has to contain two files with provided name: with ‘.pkl’ and ‘bin’ file extensions.
Returns:	Unpickled model object.
Return type:	FastTextModel

redditscore.models.neuralnet module¶

redditscore.models.redditmodel module¶

Generic RedditModel class for specific models to inherit

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

class redditscore.models.redditmodel.RedditModel(random_state=24)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sklearn-style wrapper for the different architectures

Parameters:	random_state (int, optional) – Random seed (the default is 24).

model_type¶: str – Model type name

model¶: model object – Model object that is being fitted

params¶: dict – Dictionary with model parameters

_classes¶: list – List of class labels

fitted¶: bool – Indicates whether model was fitted

class_embeddings¶: np.array, shape (num_classes, vector_size) – Matrix with class embeddings

random_state¶: int – Random seed used for validation splits and for models

cv_score(X, y, cv=0.2, scoring='accuracy', k=3)[source]¶

Calculate validation score

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels cv (float, int, cross-validation generator or an iterable, optional) – Determines the cross-validation splitting strategy. Possible inputs for cv are: float, to use holdout set of this size None, to use the default 3-fold cross validation, integer, to specify the number of folds in a StratifiedKFold, An object to be used as a cross-validation generator. An iterable yielding train, test splits. scoring (string, callable or None, optional, optional) – A string (see sklearn model evaluation documentation) or a scorer callable object or ‘top_k_accuracy’ k (int, optional) – k parameter for ‘top_k_accuracy’ scoring
Returns:	Average value of the validation metrics
Return type:	float

fit(X, y)[source]¶

Fit model

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Fitted model object
Return type:	RedditModel

get_params(deep=None)[source]¶

Get parameters of the model

Returns:	Dictionary with model parameters
Return type:	dict

plot_analytics(classes=None, fig_sizes=((20, 15), (20, 20)), linkage_pars=None, dendrogram_pars=None, clustering_pars=None, tsne_pars=None, legend_pars=None, label_font_size=17)[source]¶

Plot hieracical clustering dendrogram and T-SNE visualization based on the learned class embeddings

Parameters:

classes (iter, optional) – Iterable, contains list of class labels to include to the plots. If None, use all classes
fig_sizes (tuple of tuples, optional) – Figure sizes for plots
linkage_pars (dict, optional) – Dictionary of parameters for hieracical clustering. (scipy.cluster.hierarchy.linkage)
dendrogram_pars (dict, optional) – Dictionary of parameters for plotting dendrogram. (scipy.cluster.hierarchy.dendrogram)
clustering_pars (dict, optional) – Dictionary of parameters for producing flat clusters. (scipy.cluster.hierarchy.fcluster)
tsne_pars (dict, optional) – Dictionary of parameters for T-SNE. (sklearn.manifold.TSNE)
legend_pars (dict, optional) – Dictionary of parameters for legend plotting (matplotlib.pyplot.legend)
label_font_size (int, optional) – Font size for the labels on T-SNE plot

predict(X)[source]¶

Predict the most likely label

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Predicted class labels
Return type:	array, shape (n_samples, )

predict_proba(X)[source]¶

Predict the most likely label

Parameters:	X (iterable, shape (n_samples, )) – Sequence of tokenized documents y (iterable, shape (n_samples, )) – Sequence of labels
Returns:	Predicted class probabilities
Return type:	array, shape (n_samples, num_classes)

set_params(**params)[source]¶

Set parameters of the model

Parameters:	**params – Model parameters to update

tune_params(X, y, param_grid=None, verbose=False, cv=0.2, scoring='accuracy', k=3, refit=False)[source]¶

Find the best values of hyperparameters using chosen validation scheme

Parameters:

X (iterable, shape (n_samples, )) – Sequence of tokenized documents
y (iterable, shape (n_samples, )) – Sequence of labels
param_grid (dict, optional) – Dictionary with parameters names as keys and lists of parameter settings as values. If None, loads deafult values from JSON file
verbose (bool, optional) – If True, print scores after fitting each model
cv (float, int, cross-validation generator or an iterable, optional) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- float, to use holdout set of this size
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a StratifiedKFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
scoring (string, callable or None, optional) – A string (see sklearn model evaluation documentation) or a scorer callable object or ‘top_k_accuracy’
k (int, optional) – k parameter for ‘top_k_accuracy’ scoring
refit (boolean, optional) – If True, refit model with the best found parameters

Returns:

best_pars (dict) – Dictionary with the best combination of parameters
best_value (float) – Best value of the chosen metric

redditscore.models.redditmodel.fancy_dendrogram(z, labels, **kwargs)[source]¶

redditscore.models.redditmodel.flatten(l)[source]¶

redditscore.models.redditmodel.top_k_accuracy_score(y_true, y_pred, k=3, normalize=True)[source]¶

redditscore.models.redditmodel.word_ngrams(tokens, ngram_range, separator=' ')[source]¶

redditscore.models.bow_mod module¶

bow_mod: A wrapper for Bag-of-Words models

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

class redditscore.models.bow_mod.BoWModel(estimator, ngrams=1, tfidf=True, random_state=24)[source]¶

Bases: redditscore.models.redditmodel.RedditModel

A wrapper for Bag-of-Words models with or without tf-idf re-weighting

Parameters:

estimator (scikit-learn model) – Estimator object (classifier or regressor)
ngrams (int, optional) – The upper boundary of the range of n-values for different n-grams to be extracted
tfidf (bool, optional) – If true, use tf-idf re-weighting
random_state (integer, optional) – Random seed
**kwargs – Parameters of the multinomial model. For details check scikit-learn documentation.

params¶: dict – Dictionary with model parameters

save_model(filepath)[source]¶

Save model to disk.

Parameters:	filepath (str) – Path to the file where the model will be sabed.

set_params(**params)[source]¶

Set the parameters of the model.

Parameters:	*params ({'tfidf', 'ngrams', 'random_state'} or*) – parameters of the corresponding models

redditscore.models.bow_mod.load_model(filepath)[source]¶

Loan pickled instance of SklearnModel.

Parameters:	filepath (str) – Path to the pickled model file.
Returns:	Unpickled model.
Return type:	SklearnModel

redditscore.models package¶

Submodules¶

redditscore.models.doc2vec module¶

redditscore.models.fasttext module¶

redditscore.models.neuralnet module¶

redditscore.models.redditmodel module¶

redditscore.models.bow_mod module¶

Module contents¶