FastTextModel

FastTextModel: A wrapper for Facebook fastText model

Author: Evgenii Nikitin <e.nikitin@nyu.edu>

Part of https://github.com/crazyfrogspb/RedditScore project

Copyright (c) 2018 Evgenii Nikitin. All rights reserved. This work is licensed under the terms of the MIT license.

class models.fasttext_mod.FastTextClassifier(lr=0.1, dim=100, ws=5, epoch=5, minCount=1, minCountLabel=0, minn=0, maxn=0, neg=5, wordNgrams=1, loss='softmax', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label='__label__', verbose=2)[source]
get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
score(X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
  • X (array-like, shape = (n_samples, n_features)) – Test samples.
  • y (array-like, shape = (n_samples) or (n_samples, n_outputs)) – True labels for X.
  • sample_weight (array-like, shape = [n_samples], optional) – Sample weights.
Returns:

score – Mean accuracy of self.predict(X) wrt. y.

Return type:

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
class models.fasttext_mod.FastTextModel(random_state=24, **kwargs)[source]

Facebook fastText classifier

Parameters:
cv_score(X, y, cv=0.2, scoring='accuracy', k=3)

Calculate validation score

Parameters:
  • X (iterable, shape (n_samples, )) – Sequence of tokenized documents
  • y (iterable, shape (n_samples, )) – Sequence of labels
  • cv (float, int, cross-validation generator or an iterable, optional) –

    Determines the cross-validation splitting strategy. Possible inputs for cv are:

    • float, to use holdout set of this size
    • None, to use the default 3-fold cross validation,
    • integer, to specify the number of folds in a StratifiedKFold,
    • An object to be used as a cross-validation generator.
    • An iterable yielding train, test splits.
  • scoring (string, callable or None, optional, optional) – A string (see sklearn model evaluation documentation) or a scorer callable object or ‘top_k_accuracy’
  • k (int, optional) – k parameter for ‘top_k_accuracy’ scoring
Returns:

Average value of the validation metrics

Return type:

float

fit(X, y)[source]

Fit model

Parameters:
  • X (iterable, shape (n_samples, )) – Sequence of tokenized documents
  • y (iterable, shape (n_samples, )) – Sequence of labels
Returns:

Fitted model object

Return type:

FastTextModel

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.
Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

get_params(deep=None)

Get parameters of the model

Returns:Dictionary with model parameters
Return type:dict
plot_analytics(classes=None, fig_sizes=((20, 15), (20, 20)), linkage_pars=None, dendrogram_pars=None, clustering_pars=None, tsne_pars=None, legend_pars=None, label_font_size=17)

Plot hieracical clustering dendrogram and T-SNE visualization based on the learned class embeddings

Parameters:
  • classes (iter, optional) – Iterable, contains list of class labels to include to the plots. If None, use all classes
  • fig_sizes (tuple of tuples, optional) – Figure sizes for plots
  • linkage_pars (dict, optional) – Dictionary of parameters for hieracical clustering. (scipy.cluster.hierarchy.linkage)
  • dendrogram_pars (dict, optional) – Dictionary of parameters for plotting dendrogram. (scipy.cluster.hierarchy.dendrogram)
  • clustering_pars (dict, optional) – Dictionary of parameters for producing flat clusters. (scipy.cluster.hierarchy.fcluster)
  • tsne_pars (dict, optional) – Dictionary of parameters for T-SNE. (sklearn.manifold.TSNE)
  • legend_pars (dict, optional) – Dictionary of parameters for legend plotting (matplotlib.pyplot.legend)
  • label_font_size (int, optional) – Font size for the labels on T-SNE plot
predict(X)

Predict the most likely label

Parameters:
  • X (iterable, shape (n_samples, )) – Sequence of tokenized documents
  • y (iterable, shape (n_samples, )) – Sequence of labels
Returns:

Predicted class labels

Return type:

array, shape (n_samples, )

predict_proba(X)

Predict the most likely label

Parameters:
  • X (iterable, shape (n_samples, )) – Sequence of tokenized documents
  • y (iterable, shape (n_samples, )) – Sequence of labels
Returns:

Predicted class probabilities

Return type:

array, shape (n_samples, num_classes)

save_model(filepath)[source]

Save model to disk.

Parameters:filepath (str) – Path to the file where the model will be saved. NOTE: The model will be saved in two files: with ‘.pkl’ and ‘bin’ file extensions.
set_params(**params)

Set parameters of the model

Parameters:**params – Model parameters to update
tune_params(X, y, param_grid=None, verbose=False, cv=0.2, scoring='accuracy', k=3, refit=False)

Find the best values of hyperparameters using chosen validation scheme

Parameters:
  • X (iterable, shape (n_samples, )) – Sequence of tokenized documents
  • y (iterable, shape (n_samples, )) – Sequence of labels
  • param_grid (dict, optional) – Dictionary with parameters names as keys and lists of parameter settings as values. If None, loads deafult values from JSON file
  • verbose (bool, optional) – If True, print scores after fitting each model
  • cv (float, int, cross-validation generator or an iterable, optional) –

    Determines the cross-validation splitting strategy. Possible inputs for cv are:

    • float, to use holdout set of this size
    • None, to use the default 3-fold cross validation,
    • integer, to specify the number of folds in a StratifiedKFold,
    • An object to be used as a cross-validation generator.
    • An iterable yielding train, test splits.
  • scoring (string, callable or None, optional) – A string (see sklearn model evaluation documentation) or a scorer callable object or ‘top_k_accuracy’
  • k (int, optional) – k parameter for ‘top_k_accuracy’ scoring
  • refit (boolean, optional) – If True, refit model with the best found parameters
Returns:

  • best_pars (dict) – Dictionary with the best combination of parameters
  • best_value (float) – Best value of the chosen metric

models.fasttext_mod.load_model(filepath)[source]

Load pickled model.

Parameters:filepath (str) – Path to the file where the model will be saved. NOTE: the directory has to contain two files with provided name: with ‘.pkl’ and ‘bin’ file extensions.
Returns:Unpickled model object.
Return type:FastTextModel