redditscore.models package¶
Submodules¶
redditscore.models.doc2vec module¶
-
class
redditscore.models.doc2vec.
Doc2VecModel
(random_state=24, dm=0, vector_size=100, window=5, negative=5, hs=0, min_count=5, sample=1e-05, epochs=10, dbow_words=0, workers=8, steps=1000, alpha=0.025)[source]¶ Bases:
redditscore.models.redditmodel.RedditModel
-
fit
(X, y)[source]¶ Fit model
Parameters: - X (iterable, shape (n_samples, )) – Sequence of tokenized documents
- y (iterable, shape (n_samples, )) – Sequence of labels
Returns: Fitted model object
Return type:
-
redditscore.models.fasttext module¶
FastTextModel: A wrapper for Facebook fastText model
Author: Evgenii Nikitin <e.nikitin@nyu.edu>
Part of https://github.com/crazyfrogspb/RedditScore project
Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.
-
class
redditscore.models.fasttext_mod.
FastTextClassifier
(lr=0.1, dim=100, ws=5, epoch=5, minCount=1, minCountLabel=0, minn=0, maxn=0, neg=5, wordNgrams=1, loss='softmax', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label='__label__', verbose=2)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
-
class
redditscore.models.fasttext_mod.
FastTextModel
(random_state=24, **kwargs)[source]¶ Bases:
redditscore.models.redditmodel.RedditModel
Facebook fastText classifier
Parameters: - random_state (int, optional) – Random seed (the default is 24).
- **kwargs – Other parameters for fastText model. Full description can be found here: https://github.com/facebookresearch/fastText
-
redditscore.models.fasttext_mod.
load_model
(filepath)[source]¶ Load pickled model.
Parameters: filepath (str) – Path to the file where the model will be saved. NOTE: the directory has to contain two files with provided name: with ‘.pkl’ and ‘bin’ file extensions. Returns: Unpickled model object. Return type: FastTextModel
redditscore.models.neuralnet module¶
redditscore.models.redditmodel module¶
Generic RedditModel class for specific models to inherit
Author: Evgenii Nikitin <e.nikitin@nyu.edu>
Part of https://github.com/crazyfrogspb/RedditScore project
Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.
-
class
redditscore.models.redditmodel.
RedditModel
(random_state=24)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sklearn-style wrapper for the different architectures
Parameters: random_state (int, optional) – Random seed (the default is 24). -
model_type
¶ str – Model type name
-
model
¶ model object – Model object that is being fitted
-
params
¶ dict – Dictionary with model parameters
-
_classes
¶ list – List of class labels
-
fitted
¶ bool – Indicates whether model was fitted
-
class_embeddings
¶ np.array, shape (num_classes, vector_size) – Matrix with class embeddings
-
random_state
¶ int – Random seed used for validation splits and for models
-
cv_score
(X, y, cv=0.2, scoring='accuracy', k=3)[source]¶ Calculate validation score
Parameters: - X (iterable, shape (n_samples, )) – Sequence of tokenized documents
- y (iterable, shape (n_samples, )) – Sequence of labels
- cv (float, int, cross-validation generator or an iterable, optional) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- float, to use holdout set of this size
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a StratifiedKFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
- scoring (string, callable or None, optional, optional) – A string (see sklearn model evaluation documentation) or a scorer callable object or ‘top_k_accuracy’
- k (int, optional) – k parameter for ‘top_k_accuracy’ scoring
Returns: Average value of the validation metrics
Return type: float
-
fit
(X, y)[source]¶ Fit model
Parameters: - X (iterable, shape (n_samples, )) – Sequence of tokenized documents
- y (iterable, shape (n_samples, )) – Sequence of labels
Returns: Fitted model object
Return type:
-
get_params
(deep=None)[source]¶ Get parameters of the model
Returns: Dictionary with model parameters Return type: dict
-
plot_analytics
(classes=None, fig_sizes=((20, 15), (20, 20)), linkage_pars=None, dendrogram_pars=None, clustering_pars=None, tsne_pars=None, legend_pars=None, label_font_size=17)[source]¶ Plot hieracical clustering dendrogram and T-SNE visualization based on the learned class embeddings
Parameters: - classes (iter, optional) – Iterable, contains list of class labels to include to the plots. If None, use all classes
- fig_sizes (tuple of tuples, optional) – Figure sizes for plots
- linkage_pars (dict, optional) – Dictionary of parameters for hieracical clustering. (scipy.cluster.hierarchy.linkage)
- dendrogram_pars (dict, optional) – Dictionary of parameters for plotting dendrogram. (scipy.cluster.hierarchy.dendrogram)
- clustering_pars (dict, optional) – Dictionary of parameters for producing flat clusters. (scipy.cluster.hierarchy.fcluster)
- tsne_pars (dict, optional) – Dictionary of parameters for T-SNE. (sklearn.manifold.TSNE)
- legend_pars (dict, optional) – Dictionary of parameters for legend plotting (matplotlib.pyplot.legend)
- label_font_size (int, optional) – Font size for the labels on T-SNE plot
-
predict
(X)[source]¶ Predict the most likely label
Parameters: - X (iterable, shape (n_samples, )) – Sequence of tokenized documents
- y (iterable, shape (n_samples, )) – Sequence of labels
Returns: Predicted class labels
Return type: array, shape (n_samples, )
-
predict_proba
(X)[source]¶ Predict the most likely label
Parameters: - X (iterable, shape (n_samples, )) – Sequence of tokenized documents
- y (iterable, shape (n_samples, )) – Sequence of labels
Returns: Predicted class probabilities
Return type: array, shape (n_samples, num_classes)
-
set_params
(**params)[source]¶ Set parameters of the model
Parameters: **params – Model parameters to update
-
tune_params
(X, y, param_grid=None, verbose=False, cv=0.2, scoring='accuracy', k=3, refit=False)[source]¶ Find the best values of hyperparameters using chosen validation scheme
Parameters: - X (iterable, shape (n_samples, )) – Sequence of tokenized documents
- y (iterable, shape (n_samples, )) – Sequence of labels
- param_grid (dict, optional) – Dictionary with parameters names as keys and lists of parameter settings as values. If None, loads deafult values from JSON file
- verbose (bool, optional) – If True, print scores after fitting each model
- cv (float, int, cross-validation generator or an iterable, optional) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- float, to use holdout set of this size
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a StratifiedKFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
- scoring (string, callable or None, optional) – A string (see sklearn model evaluation documentation) or a scorer callable object or ‘top_k_accuracy’
- k (int, optional) – k parameter for ‘top_k_accuracy’ scoring
- refit (boolean, optional) – If True, refit model with the best found parameters
Returns: - best_pars (dict) – Dictionary with the best combination of parameters
- best_value (float) – Best value of the chosen metric
-
redditscore.models.bow_mod module¶
bow_mod: A wrapper for Bag-of-Words models
Author: Evgenii Nikitin <e.nikitin@nyu.edu>
Part of https://github.com/crazyfrogspb/RedditScore project
Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.
-
class
redditscore.models.bow_mod.
BoWModel
(estimator, ngrams=1, tfidf=True, random_state=24)[source]¶ Bases:
redditscore.models.redditmodel.RedditModel
A wrapper for Bag-of-Words models with or without tf-idf re-weighting
Parameters: - estimator (scikit-learn model) – Estimator object (classifier or regressor)
- ngrams (int, optional) – The upper boundary of the range of n-values for different n-grams to be extracted
- tfidf (bool, optional) – If true, use tf-idf re-weighting
- random_state (integer, optional) – Random seed
- **kwargs – Parameters of the multinomial model. For details check scikit-learn documentation.
-
params
¶ dict – Dictionary with model parameters