classification#

cleanlab package for multiclass, multi-label learning with noisy labels for any dataset and model.

The LearningWithNoisyLabels class wraps around an instance of a classifier class. Your classifier must adhere to the sklearn template, meaning it must define four functions:

  • clf.fit(X, y, sample_weight = None)

  • clf.predict_proba(X)

  • clf.predict(X)

  • clf.score(X, y, sample_weight = None)

where X (of length n) contains the data/examples, y (of length n) contains the contains targets formatted as 0, 1, 2, ..., K-1, and sample_weight (of length n) re-weights examples in the loss function while training.

Note

There are two new notions of confidence in this package:

1. Confident examples – examples we are confident are labeled correctly We prune everything else. Comptuationally, this means keeping the examples with high probability of belong to their provided label class.

2. Confident errors – examples we are confident are labeled erroneously. We prune these. Comptuationally, this means pruning the examples with high probability of belong to a different class.

Examples

>>> from cleanlab.classification import LearningWithNoisyLabels
>>> from sklearn.linear_model import LogisticRegression as LogReg
>>> rp = LearningWithNoisyLabels(clf=LogReg()) # Pass in any classifier.
>>> rp.fit(X_train, labels_maybe_with_errors)
>>> # Estimate the predictions as if you had trained without label issues.
>>> pred = rp.predict(X_test)

The easiest way to use any model (Tensorflow, caffe2, PyTorch, etc.) with cleanlab is to wrap it in a class that inherits the sklearn.base.BaseEstimator:

from sklearn.base import BaseEstimator
class YourModel(BaseEstimator): # Inherits sklearn base classifier
    def __init__(self, ):
        pass
    def fit(self, X, y, sample_weight = None):
        pass
    def predict(self, X):
        pass
    def predict_proba(self, X):
        pass
    def score(self, X, y, sample_weight = None):
        pass

Note

  • labels - The given (maybe noisy) labels in the original dataset, which may have errors.

  • Class labels (K classes) must be formatted as natural numbers: 0, 1, …, K-1

Note

Confident Learning is the state-of-the-art (Northcutt et al., 2021) for weak supervision, finding label issues in datasets, learning with noisy labels, uncertainty estimation, and more. It works with ANY classifier, including deep neural networks. See clf parameter.

Confident learning is a subfield of theory and algorithms of machine learning with noisy labels. Cleanlab achieves state-of-the-art performance of any open-sourced implementation of confident learning across a variety of tasks like multi-class classification, multi-label classification, and PU learning.

Given any classifier having the predict_proba() method, an input feature matrix, X, and a discrete vector of noisy labels, labels, Confident Learning estimates the classifications that would be obtained if the true_labels had instead been provided to the classifier during training. labels denotes the noisy label instead of tilde(y) (used in confident learning paper), for ASCII encoding reasons.

Classes:

LearningWithNoisyLabels([clf, seed, ...])

Automated and robust learning with noisy labels using any dataset and any model.

class cleanlab.classification.LearningWithNoisyLabels(clf=None, *, seed=None, cv_n_folds=5, filter_by='prune_by_noise_rate', converge_latent_estimates=False, pulearning=None, n_jobs=None)[source]#

Bases: sklearn.base.BaseEstimator

Automated and robust learning with noisy labels using any dataset and any model. This class trains a model clf with error-prone, noisy labels as if the model had been instead trained on a dataset with perfect labels. It achieves this by cleaning out the error and providing cleaned data while training.

Parameters
  • clf (sklearn.classifier compliant class (e.g. skorch wraps around PyTorch)) – See cleanlab.example_models for examples of sklearn wrappers around, e.g. PyTorch. The clf object must have the following three functions defined: 1. clf.predict_proba(X) # Predicted probabilities 2. clf.predict(X) # Predict labels 3. clf.fit(X, y, sample_weight) # Train classifier Stores the classifier used in Confident Learning. Default classifier used is logistic regression.

  • seed (int, default: None) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • cv_n_folds (int) – This class needs holdout predicted probabilities for every data example and if not provided, uses cross-validation to compute them. cv_n_folds sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • filter_by (str, default: 'prune_by_noise_rate') –

    Possible Values: {‘prune_by_class’, ‘prune_by_noise_rate’, ‘both’,

    ’confident_learning’, ‘predicted_neq_given’}

    Determines the method used to filter label issues.

    1. filter_by=prune_by_noise_rate: works by removing examples with high probability of being mislabeled for every non-diagonal in the confident joint (see prune_counts_matrix in filter.py). These are the examples where (with high confidence) the given label is unlikely to match the predicted label.

    2. filter_by=prune_by_class: works by removing the examples with smallest probability of belonging to their given class label for every class.

    3. filter_by=both: Finds the examples satisfying (1) AND (2) and removes their set conjunction.

    4. filter_by=confident_learning: Find examples that are confidently labeled as a different class from their given label while computing the confident joint.

    5. filter_by=predicted_neq_given: Find examples where the argmax prediction does not match the given label.

  • converge_latent_estimates (bool (Default: False)) – If true, forces numerical consistency of latent estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively enforce consistency.

  • pulearning (int (0 or 1, default: None)) – Only works for 2 class datasets. Set to the integer of the class that is perfectly labeled (certain no errors in that class).

  • n_jobs (int (Windows users may see a speed-up with n_jobs = 1)) – Number of processing threads used by multiprocessing. Default None sets to the number of processing threads on your CPU. Set this to 1 to REMOVE parallel processing (if it’s causing issues).

Methods:

fit(X, labels, *[, pred_probs, thresholds, ...])

This method trains the model self.clf with error-prone, noisy labels as if the model had been instead trained on a dataset with perfect labels.

get_label_issues()

Accessor.

get_params([deep])

Get parameters for this estimator.

predict(*args, **kwargs)

Returns a vector of predictions.

predict_proba(*args, **kwargs)

Returns a vector of probabilities P(true_label=k) for each example in X.

score(X, y[, sample_weight])

Returns the clf's score on a test set X with labels y.

set_params(**params)

Set the parameters of this estimator.

fit(X, labels, *, pred_probs=None, thresholds=None, noise_matrix=None, inverse_noise_matrix=None)[source]#

This method trains the model self.clf with error-prone, noisy labels as if the model had been instead trained on a dataset with perfect labels. It achieves this by cleaning out the error and providing cleaned data while training.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • pred_probs (np.array (shape (N, K))) –

  • probabilities. (P(label=k|x) is a matrix with K model-predicted) –

  • model-predicted (Each row of this matrix corresponds to an example x and contains the) –

  • class. (probabilities that x belongs to each possible) –

  • 0 (The columns must be ordered such that these probabilities correspond to class) –

  • 1

  • 2

  • ...

  • cross-validation. (pred_probs should have been computed using 3 (or higher) fold) –

    Note

    If you are not sure, leave pred_probs = None (default) and it will be computed for you using cross-validation with your model.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(label^=k|label=k). List of probabilities used to determine the cutoff predicted probability necessary to consider an example as a given class label. Default is None. These are computed for you automatically. If an example has a predicted probability “greater” than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts. Values in list should be between 0 and 1.

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(true_label=k_y|label=k_s). Contains the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from pred_probs and labels. Assumes columns of inverse_noise_matrix sum to 1.

Returns

(label_issues_mask, sample_weight)

Return type

tuple

get_label_issues()[source]#

Accessor. Returns self.label_issues_mask computed via filter.find_label_issues()

get_params(deep=True)#

Get parameters for this estimator.

Parameters

deep (bool, default True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(*args, **kwargs)[source]#

Returns a vector of predictions.

Parameters

X (np.array of shape (n, m)) – The test data as a feature matrix.

predict_proba(*args, **kwargs)[source]#

Returns a vector of probabilities P(true_label=k) for each example in X.

Parameters

X (np.array of shape (n, m)) – The test data as a feature matrix.

score(X, y, sample_weight=None)[source]#

Returns the clf’s score on a test set X with labels y. Uses the model/clf’s default scoring function.

Parameters
  • X (np.array of shape (n, m)) – The test data as a feature matrix.

  • y (np.array of shape (n,) or (n, 1)) – The test classification labels as an array.

  • sample_weight (np.array of shape (n,) or (n, 1)) – Weights each example when computing the score / accuracy.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance