Classification#

cleanlab package for multiclass learning with noisy labels for any model.

The LearningWithNoisyLabels class wraps around an instance of a classifier class. Your classifier must adhere to the sklearn template, meaning it must define four functions:

  • clf.fit(X, y, sample_weight = None)

  • clf.predict_proba(X)

  • clf.predict(X)

  • clf.score(X, y, sample_weight = None)

where X (of length n) contains the data/examples, y (of length n) contains the contains targets formatted as 0, 1, 2, ..., K-1, and sample_weight (of length n) re-weights examples in the loss function while training.

Note

There are two new notions of confidence in this package:

1. Confident examples – examples we are confident are labeled correctly We prune everything else. Comptuationally, this means keeping the examples with high probability of belong to their provided label class. 2. Confident errors – examples we are confident are labeled erroneously. We prune these. Comptuationally, this means pruning the examples with high probability of belong to a different class.

Examples

>>> from cleanlab.classification import LearningWithNoisyLabels
>>> from sklearn.linear_model import LogisticRegression as LogReg
>>> rp = LearningWithNoisyLabels(clf=LogReg()) # Pass in any classifier.
>>> rp.fit(X_train, y_may_have_label_errors)
>>> # Estimate the predictions as if you had trained without label errors.
>>> pred = rp.predict(X_test)

The easiest way to use any model (Tensorflow, caffe2, PyTorch, etc.) with cleanlab is to wrap it in a class that inherits the sklearn.base.BaseEstimator:

from sklearn.base import BaseEstimator
class YourModel(BaseEstimator): # Inherits sklearn base classifier
    def __init__(self, ):
        pass
    def fit(self, X, y, sample_weight = None):
        pass
    def predict(self, X):
        pass
    def predict_proba(self, X):
        pass
    def score(self, X, y, sample_weight = None):
        pass

Note

  • s - denotes noisy labels. This is just dataset labels, maybe with errors.

  • Class labels (K classes) must be formatted as natural numbers: 0, 1, .., K-1

Classes:

LearningWithNoisyLabels([clf, seed, ...])

Automated learning with noisy labels using any model.

class cleanlab.classification.LearningWithNoisyLabels(clf=None, seed=None, cv_n_folds=5, prune_method='prune_by_noise_rate', converge_latent_estimates=False, pulearning=None, n_jobs=None)[source]#

Bases: sklearn.base.BaseEstimator

Automated learning with noisy labels using any model.

Confident Learning is the state-of-the-art (Northcutt et al., 2021) for weak supervision, finding label errors in datasets, learning with noisy labels, uncertainty estimation, and more. It works with ANY classifier, including deep neural networks. See clf parameter.

This subfield of machine learning is referred to as Confident Learning. Confident Learning also achieves state-of-the-art performance for binary classification with noisy labels and positive-unlabeled learning (PU learning) where a subset of positive examples is given and all other examples are unlabeled and assumed to be negative examples. Confident Learning works by “learning from confident examples.” Confident examples are identified as examples with high predicted probability for their training label.

Given any classifier having the predict_proba() method, an input feature matrix, X, and a discrete vector of labels, s, which may contain mislabeling, Confident Learning estimates the classifications that would be obtained if the hidden, true labels, y, had instead been provided to the classifier during training. “s” denotes the noisy label instead of tilde(y), for ASCII encoding reasons.

Parameters
  • clf (sklearn.classifier compliant class (e.g. skorch wraps around PyTorch)) – See cleanlab.models for examples of sklearn wrappers around, e.g. PyTorch. The clf object must have the following three functions defined: 1. clf.predict_proba(X) # Predicted probabilities 2. clf.predict(X) # Predict labels 3. clf.fit(X, y, sample_weight) # Train classifier Stores the classifier used inConfident Learning. Default classifier used is logistic regression.

  • seed (int, default: None) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • cv_n_folds (int) – This class needs holdout predicted probabilities for every data example and if not provided, uses cross-validation to compute them. cv_n_folds sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • prune_method (str, default: prune_by_noise_rate) –

    Available options: ‘prune_by_class’, ‘prune_by_noise_rate’, or ‘both’. This str determines the method used for pruning.

    Note

    1. prune_method=prune_by_noise_rate: works by removing examples with high probability of being mislabeled for every non-diagonal in the prune_counts_matrix (see pruning.py).

    2. prune_method=prune_by_class: works by removing the examples with smallest probability of belonging to their given class label for every class.

    3. prune_method=both: Finds the examples satisfying (1) AND (2) and removes their set conjunction.

  • converge_latent_estimates (bool (Default: False)) – If true, forces numerical consistency of latent estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively enforce consistency.

  • pulearning (int (0 or 1, default: None)) – Only works for 2 class datasets. Set to the integer of the class that is perfectly labeled (certain no errors in that class).

  • n_jobs (int (Windows users may see a speed-up with n_jobs = 1)) – Number of processing threads used by multiprocessing. Default None sets to the number of processing threads on your CPU. Set this to 1 to REMOVE parallel processing (if its causing issues).

Methods:

fit(X, s[, psx, thresholds, noise_matrix, ...])

This method implements the confident learning.

get_params([deep])

Get parameters for this estimator.

predict(*args, **kwargs)

Returns a binary vector of predictions.

predict_proba(*args, **kwargs)

Returns a vector of probabilties P(y=k) for each example in X.

score(X, y[, sample_weight])

Returns the clf's score on a test set X with labels y.

set_params(**params)

Set the parameters of this estimator.

fit(X, s, psx=None, thresholds=None, noise_matrix=None, inverse_noise_matrix=None)[source]#

This method implements the confident learning. It counts examples that are likely labeled correctly and incorrectly and uses their ratio to create a predicted confusion matrix. This function fits the classifier (self.clf) to (X, s) accounting for the noise in both the positive and negative sets.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • s (np.array) – A binary vector of labels, s, which may contain mislabeling.

  • psx (np.array (shape (N, K))) – P(s=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation. If you are not sure, leave psx = None (default) and it will be computed for you using cross-validation.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). List of probabilities used to determine the cutoff predicted probability necessary to consider an example as a given class label. Default is None. These are computed for you automatically. If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. Values in list should be between 0 and 1.

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probablity matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probablity matrix of the form P(y=k_y|s=k_s). Contains the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from psx and s. Assumes columns of inverse_noise_matrix sum to 1.

Returns

(noise_mask, sample_weight)

Return type

tuple

get_params(deep=True)#

Get parameters for this estimator.

Parameters

deep (bool, default True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(*args, **kwargs)[source]#

Returns a binary vector of predictions.

Parameters

X (np.array of shape (n, m)) – The test data as a feature matrix.

predict_proba(*args, **kwargs)[source]#

Returns a vector of probabilties P(y=k) for each example in X.

Parameters

X (np.array of shape (n, m)) – The test data as a feature matrix.

score(X, y, sample_weight=None)[source]#

Returns the clf’s score on a test set X with labels y. Uses the models default scoring function.

Parameters
  • X (np.array of shape (n, m)) – The test data as a feature matrix.

  • y (np.array of shape (n,) or (n, 1)) – The test classification labels as an array.

  • sample_weight (np.array of shape (n,) or (n, 1)) – Weights each example when computing the score / accuracy.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

Latent Estimation#

Functions:

calibrate_confident_joint(confident_joint, s)

Calibrates any confident joint estimate P(s=i, y=j) such that np.sum(cj) == len(s) and np.sum(cj, axis = 1) == np.bincount(s).

compute_confident_joint(s, psx[, K, ...])

Estimates P(s,y), the confident counts of the latent joint distribution of true and noisy labels using observed s and predicted probabilities psx.

converge_estimates(ps, py, noise_matrix, ...)

Computes py := P(y=k) and both noise_matrix and inverse_noise_matrix, by numerically converging ps := P(s=k), py, and the noise matrices.

estimate_confident_joint_and_cv_pred_proba(X, s)

Estimates P(s,y), the confident counts of the latent joint distribution of true and noisy labels using observed s and predicted probabilities psx.

estimate_confident_joint_from_probabilities(s, psx)

DEPRECATED AS OF VERSION 0.0.8.

estimate_cv_predicted_probabilities(X, labels)

This function computes the out-of-sample predicted probability [P(s=k|x)] for every example in X using cross validation.

estimate_joint(s[, psx, confident_joint, ...])

Estimates the joint distribution of label noise P(s=i, y=j) guaranteed to

estimate_latent(confident_joint, s[, ...])

Computes the latent prior p(y), the noise matrix P(s|y) and the inverse noise matrix P(y|s) from the confident_joint count(s, y).

estimate_noise_matrices(X, s[, clf, ...])

Estimates the noise_matrix of shape (K, K).

estimate_py_and_noise_matrices_from_probabilities(s, psx)

Computes the confident counts estimate of latent variables py and the noise rates using observed s and predicted probabilities psx.

estimate_py_noise_matrices_and_cv_pred_proba(X, s)

This function computes the out-of-sample predicted probability P(s=k|x) for every example x in X using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.

num_label_errors(labels, psx[, confident_joint])

Estimates the number of label errors in labels.

cleanlab.latent_estimation.calibrate_confident_joint(confident_joint, s, multi_label=False)[source]#

Calibrates any confident joint estimate P(s=i, y=j) such that np.sum(cj) == len(s) and np.sum(cj, axis = 1) == np.bincount(s).

In other words, this function forces the confident joint to have the true noisy prior p(s) (summed over columns for each row) and also forces the confident joint to add up to the total number of examples.

This method makes the confident joint a valid counts estimate of the actual joint of noisy and true labels.

Parameters
  • confident_joint (np.array (shape (K, K))) – A K,K integer matrix of count(s=k, y=k). Estimates a confident subset of the joint disribution of the noisy and true labels P_{s,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (s=j, y=k) classes.

  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • multi_label (bool) – If true, s should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The MAJOR DIFFERENCE in how this is calibrated versus single_label, is the total number of errors considered is based on the number of labels, not the number of examples. So, the calibrated confident_joint will sum to the number of total labels.

Returns

  • An np.array of shape (K, K) of type float representing a valid

  • estimate of the joint COUNTS of noisy and true labels.

cleanlab.latent_estimation.compute_confident_joint(s, psx, K=None, thresholds=None, calibrate=True, multi_label=False, return_indices_of_off_diagonals=False)[source]#

Estimates P(s,y), the confident counts of the latent joint distribution of true and noisy labels using observed s and predicted probabilities psx.

This estimate is called the confident joint.

When calibrate = True, this method returns an estimate of the latent true joint counts of noisy and true labels.

Important! This function assumes that psx are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.

This function estimates the joint of shape (K, K). This is the confident counts of examples in every class, labeled as every other class.

Under certain conditions, estimates are exact, and in most conditions, the estimate is within 1 percent of the truth.

Parameters
  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

  • K (int (default: None)) – Number of unique classes. Calculated as len(np.unique(s)) when K == None

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • calibrate (bool (default: True)) – Calibrates confident joint estimate P(s=i, y=j) such that np.sum(cj) == len(s) and np.sum(cj, axis = 1) == np.bincount(s).

  • multi_label (bool) – If true, s should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label.

  • return_indices_of_off_diagonals (bool) – If true returns indices of examples that were counted in off-diagonals of confident joint as a baseline proxy for the label errors. This somtimes works as well as pruning.get_noise_indices(confident_joint).

Examples

We provide a for-loop based simplification of the confident joint below. This implementation is not efficient, not used in practice, and not complete, but covers the jist of how the confident joint is computed:

# Confident examples are those that we are confident have label y = k
# Estimate (K, K) matrix of confident examples with s = k_s and y = k_y
cj_ish = np.zeros((K, K))
for k_s in range(K): # k_s is the class value k of noisy label s
    for k_y in range(K): # k_y is the (guessed) class k of true label y
        cj_ish[k_s][k_y] = sum((psx[:,k_y] >= (thresholds[k_y] - 1e-8)) & (s == k_s))

The following is a vectorized (but non-parallelized) implementation of the confident joint, again slow, using for-loops/simplified for understanding. This implementation is 100% accurate, its just not optimized for speed.

confident_joint = np.zeros((K, K), dtype = int)
for i, row in enumerate(psx):
    s_label = s[i]
    confident_bins = row >= thresholds - 1e-6
    num_confident_bins = sum(confident_bins)
    if num_confident_bins == 1:
        confident_joint[s_label][np.argmax(confident_bins)] += 1
    elif num_confident_bins > 1:
        confident_joint[s_label][np.argmax(row)] += 1
cleanlab.latent_estimation.converge_estimates(ps, py, noise_matrix, inverse_noise_matrix, inv_noise_matrix_iterations=5, noise_matrix_iterations=3)[source]#

Computes py := P(y=k) and both noise_matrix and inverse_noise_matrix, by numerically converging ps := P(s=k), py, and the noise matrices.

Forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

py := P(y=k) and the inverse noise matrix P(y=k_y|s=k_s) specify one another, meaning one can be computed from the other and vice versa. When numerical discrepancy exists due to poor estimation, they can be made to agree by repeatedly computing one from the other, for some a certain number of iterations (3-10 works fine.)

Do not set iterations too high or performance will decrease as small deviations will get perturbed over and over and potentially magnified.

Note that we have to first converge the inverse_noise_matrix and py, then we can update the noise_matrix, then repeat. This is because the inverse noise matrix depends on py (which is unknown/latent), but the noise matrix depends on ps (which is known), so there will be no change in the noise matrix if we recompute it when py and inverse_noise_matrix change.

Parameters
  • ps (np.array (shape (K, ) or (1, K))) – The fraction (prior probability) of each observed, NOISY class P(s = k).

  • py (np.array (shape (K, ) or (1, K))) – The estimated fraction (prior probability) of each TRUE class P(y = k).

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(y=k_y|s=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from psx and s. Assumes columns of inverse_noise_matrix sum to 1.

  • inv_noise_matrix_iterations (int (Default: 5)) – Number of times to converge inverse noise matrix with py and noise mat.

  • noise_matrix_iterations (int (Default: 3)) – Number of times to converge noise matrix with py and inverse noise mat.

Returns

  • Three np.arrays of the form (py, noise_matrix, inverse_noise_matrix) all

  • having numerical agreement in terms of their mathematical relations.

cleanlab.latent_estimation.estimate_confident_joint_and_cv_pred_proba(X, s, clf=LogisticRegression(), cv_n_folds=5, thresholds=None, seed=None, calibrate=True)[source]#

Estimates P(s,y), the confident counts of the latent joint distribution of true and noisy labels using observed s and predicted probabilities psx.

The output of this function is a numpy array of shape (K, K).

Under certain conditions, estimates are exact, and in many conditions, estimates are within one percent of actual.

Notes: There are two ways to compute the confident joint with pros/cons. 1. For each holdout set, we compute the confident joint, then sum them up. 2. Compute pred_proba for each fold, combine, compute the confident joint. (1) is more accurate because it correctly computes thresholds for each fold (2) is more accurate when you have only a little data because it computes the confident joint using all the probabilities. For example if you had 100 examples, with 5-fold cross validation + uniform p(y) you would only have 20 examples to compute each confident joint for (1). Such small amounts of data is bound to result in estimation errors. For this reason, we implement (2), but we implement (1) as a commented out function at the end of this file.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • calibrate (bool (default: True)) – Calibrates confident joint estimate P(s=i, y=j) such that np.sum(cj) == len(s) and np.sum(cj, axis = 1) == np.bincount(s).

Returns

  • Returns a tuple of two numpy array matrices in the form

  • (joint counts matrix, predicted probability matrix)

cleanlab.latent_estimation.estimate_confident_joint_from_probabilities(s, psx, thresholds=None, force_ps=False, return_list_of_converging_cj_matrices=False)[source]#

DEPRECATED AS OF VERSION 0.0.8. REMOVED AS OF VERSION 0.0.10.

Estimates P(s,y), the confident counts of the latent joint distribution of true and noisy labels using observed s and predicted probabilities psx.

UNLIKE compute_confident_joint, this function calibrates the confident joint estimate P(s=i, y=j) such that np.sum(cj) == len(s) and np.sum(cj, axis = 1) == np.bincount(s).

Parameters
  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • force_ps (bool or int) – If true, forces the output confident_joint matrix to have p(s) closer to the true p(s). The method used is SGD with a learning rate of eta = 0.5. If force_ps is an integer, it represents the number of epochs. Setting this to True is not always good. To make p(s) match, fewer confident examples are used to estimate the confident_joint, resulting in poorer estimation of the overall matrix even if p(s) is more accurate.

  • return_list_of_converging_cj_matrices (bool (default = False)) – When force_ps is true, it converges the joint count matrix that is returned. Setting this to true will return the list of the converged matrices. The first item in the list is the original and the last item is the final result.

Returns

  • confident_joint matrix count(s, y) (np.array (shape (K, K)))

  • where np.sum(confident_joint) ~ len(s) and rows sum to np.bincount(s)

cleanlab.latent_estimation.estimate_cv_predicted_probabilities(X, labels, clf=LogisticRegression(), cv_n_folds=5, seed=None)[source]#

This function computes the out-of-sample predicted probability [P(s=k|x)] for every example in X using cross validation. Output is a np.array of shape (N, K) where N is the number of training examples and K is the number of classes.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • labels (np.array or list of ints from [0,1,..,K-1]) – A discrete vector of class labels which may or may not contain mislabeling

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

Returns

psx – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

Return type

np.array (shape (N, K))

cleanlab.latent_estimation.estimate_joint(s, psx=None, confident_joint=None, multi_label=False)[source]#
Estimates the joint distribution of label noise P(s=i, y=j) guaranteed to
  • sum to 1

  • np.sum(joint_estimate, axis = 1) == p(s)

Parameters

docstring. (See cleanlab.latent_estimation.calibrate_confident_joint) –

Returns

  • An np.array of shape (K, K) of type float representing a valid

  • estimate of the true joint of noisy and true labels.

cleanlab.latent_estimation.estimate_latent(confident_joint, s, py_method='cnt', converge_latent_estimates=False)[source]#

Computes the latent prior p(y), the noise matrix P(s|y) and the inverse noise matrix P(y|s) from the confident_joint count(s, y). The confident_joint estimated by compute_confident_joint by counting confident examples.

Parameters
  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • confident_joint (np.array (shape (K, K), type int)) – A K,K integer matrix of count(s=k, y=k). Estimates a a confident subset of the joint disribution of the noisy and true labels P_{s,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (s=j, y=k) classes.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – How to compute the latent prior p(y=k). Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

Return type

A tuple containing (py, noise_matrix, inv_noise_matrix).

cleanlab.latent_estimation.estimate_noise_matrices(X, s, clf=LogisticRegression(), cv_n_folds=5, thresholds=None, converge_latent_estimates=True, seed=None)[source]#

Estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(s=k_s|y=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

Return type

A two-item tuple containing (noise_matrix, inv_noise_matrix).

cleanlab.latent_estimation.estimate_py_and_noise_matrices_from_probabilities(s, psx, thresholds=None, converge_latent_estimates=True, py_method='cnt', calibrate=True)[source]#

Computes the confident counts estimate of latent variables py and the noise rates using observed s and predicted probabilities psx.

Important! This function assumes that psx are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.

This function estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(s=k_s|y=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters
  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – How to compute the latent prior p(y=k). Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • calibrate (bool (default: True)) – Calibrates confident joint estimate P(s=i, y=j) such that np.sum(cj) == len(s) and np.sum(cj, axis = 1) == np.bincount(s).

Return type

py, noise_matrix, inverse_noise_matrix

cleanlab.latent_estimation.estimate_py_noise_matrices_and_cv_pred_proba(X, s, clf=LogisticRegression(), cv_n_folds=5, thresholds=None, converge_latent_estimates=False, py_method='cnt', seed=None)[source]#

This function computes the out-of-sample predicted probability P(s=k|x) for every example x in X using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.

This function estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(s=k_s|y=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of ilde(y), for ASCII reasons.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(s^=k|s=k). If an example has a predicted probability “greater” than this threshold, it is counted as having hidden label y = k. This is not used for pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – How to compute the latent prior p(y=k). Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

Returns

  • Returns a tuple of five numpy array matrices in the form

  • (py, noise_matrix, inverse_noise_matrix,

  • joint count matrix i.e. confident joint, predicted probability matrix)

cleanlab.latent_estimation.num_label_errors(labels, psx, confident_joint=None)[source]#

Estimates the number of label errors in labels.

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous.

  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

  • confident_joint (np.array (shape (K, K), type int)) – A K,K integer matrix of count(s=k, y=k). Estimates a confident subset of the joint disribution of the noisy and true labels P_{s,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (s=j, y=k) classes.

Return type

An integer estimating the number of label errors.

Noise Generation#

Functions:

generate_n_rand_probabilities_that_sum_to_m(n, m)

When min_prob=0 and max_prob = 1.0, this method is deprecated.

generate_noise_matrix(K[, max_noise_rate, ...])

DEPRECATED - Use generate_noise_matrix_from_trace()

generate_noise_matrix_from_trace(K, trace[, ...])

Generates a K x K noise matrix P(s=k_s|y=k_y) with trace as the np.mean(np.diagonal(noise_matrix)).

generate_noisy_labels(y, noise_matrix[, verbose])

Generates noisy labels s (shape (N, 1)) from perfect labels y, 'exactly' yielding the provided noise_matrix between s and y.

noise_matrix_is_valid(noise_matrix, py[, ...])

Given a prior py = p(y=k), returns true if the given noise_matrix is a learnable matrix.

randomly_distribute_N_balls_into_K_bins(N, K)

Returns a uniformly random numpy integer array of length N that sums to K.

cleanlab.noise_generation.generate_n_rand_probabilities_that_sum_to_m(n, m, max_prob=1.0, min_prob=0.0)[source]#

When min_prob=0 and max_prob = 1.0, this method is deprecated. Instead use np.random.dirichlet(np.ones(n))*m

Generates ‘n’ random probabilities that sum to ‘m’.

Parameters
  • n (int) – Length of np.array of random probabilities to be returned.

  • m (float) – Sum of np.array of random probabilities that is returned.

  • max_prob (float (0.0, 1.0] | Default value is 1.0) – Maximum probability of any entry in the returned np.array.

  • min_prob (float [0.0, 1.0) | Default value is 0.0) – Minimum probability of any entry in the returned np.array.

cleanlab.noise_generation.generate_noise_matrix(K, max_noise_rate=1.0, frac_zero_noise_rates=0.0, verbose=False)[source]#

DEPRECATED - Use generate_noise_matrix_from_trace()

Generates a noise matrix by randomly assigning noise rates up to max_noise_rate, then setting noise rates to zero until P(s!=k|s=k) < 1 is satisfied. Additionally, frac_zero_noise_rates are set to zero.

Parameters
  • K (int) – Creates a noise matrix of shape (K, K). Implies there are K classes for learning with noisy labels.

  • max_noise_rate (float) – Smaller —> easier learning problem (less noise)

  • frac_zero_noise_rates (float) – Make problem more tractable by making a fraction of noise rates zero. Larger –> Easier learning problem

  • verbose (bool) – Print debugging output if set to True.

cleanlab.noise_generation.generate_noise_matrix_from_trace(K, trace, max_trace_prob=1.0, min_trace_prob=1e-05, max_noise_rate=0.99999, min_noise_rate=0.0, valid_noise_matrix=True, py=None, frac_zero_noise_rates=0.0, seed=0, max_iter=10000)[source]#

Generates a K x K noise matrix P(s=k_s|y=k_y) with trace as the np.mean(np.diagonal(noise_matrix)).

Parameters
  • K (int) – Creates a noise matrix of shape (K, K). Implies there are K classes for learning with noisy labels.

  • trace (float (0.0, 1.0]) – Sum of diagonal entries of np.array of random probabilities returned.

  • max_trace_prob (float (0.0, 1.0]) – Maximum probability of any entry in the trace of the return matrix.

  • min_trace_prob (float [0.0, 1.0)) – Minimum probability of any entry in the trace of the return matrix.

  • max_noise_rate (float (0.0, 1.0]) – Maximum noise_rate (non-diagonal entry) in the returned np.array.

  • min_noise_rate (float [0.0, 1.0)) – Minimum noise_rate (non-diagonal entry) in the returned np.array.

  • valid_noise_matrix (bool) – If True, returns a matrix having all necessary conditions for learning with noisy labels. In particular, p(y=k)p(s=k) < p(y=k,s=k) is satisfied. This requires that Trace > 1.

  • py (np.array (shape (K, 1))) – Fraction (prior probability) of each true/hidden class label, P(y = k). REQUIRED when valid_noise_matrix == True.

  • frac_zero_noise_rates (float) – The fraction of the n*(n-1) noise rates that will be set to 0. Note that if you set a high trace, it may be impossible to also have a low fraction of zero noise rates without forcing all non-“1” diagonal values. Instead, when this happens we only guarantee to produce a noise matrix with frac_zero_noise_rates or higher. The opposite occurs with a small trace.

  • seed (int) – Seeds the random number generator for numpy.

  • max_iter (int (default: 10000)) – The max number of tries to produce a valid matrix before returning False.

Returns

noise matrix P(s=k_s|y=k_y) with trace as the np.sum(np.diagonal(noise_matrix)). This a conditional probability matrix and a left stochastic matrix.

Return type

np.array (shape (K, K))

cleanlab.noise_generation.generate_noisy_labels(y, noise_matrix, verbose=False)[source]#

Generates noisy labels s (shape (N, 1)) from perfect labels y, ‘exactly’ yielding the provided noise_matrix between s and y.

Below we provide a for loop implementation of what this function does. We do not use this implementation as it is not a fast algorithm, but it explains as Python pseudocode what is happening in this function.

Parameters
  • y (np.array (shape (N, 1))) – Perfect labels, without any noise. Contains K distinct natural number classes, e.g. 0, 1,…, K-1

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probablity matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

Examples

# Generate s
count_joint = (noise_matrix * py * len(y)).round().astype(int)
s = np.array(y)
for k_s in range(K):
    for k_y in range(K):
        if k_s != k_y:
            idx_flip = np.where((s==k_y)&(y==k_y))[0]
            if len(idx_flip): # pragma: no cover
                s[np.random.choice(
                    idx_flip,
                    count_joint[k_s][k_y],
                    replace=False,
                )] = k_s
cleanlab.noise_generation.noise_matrix_is_valid(noise_matrix, py, verbose=False)[source]#

Given a prior py = p(y=k), returns true if the given noise_matrix is a learnable matrix. Learnability means that it is possible to achieve better than random performance, on average, for the amount of noise in noise_matrix.

cleanlab.noise_generation.randomly_distribute_N_balls_into_K_bins(N, K, max_balls_per_bin=None, min_balls_per_bin=None)[source]#

Returns a uniformly random numpy integer array of length N that sums to K.

Parameters
  • N (int) –

  • K (int) –

  • max_balls_per_bin (int) –

  • min_balls_per_bin (int) –

Baseline Methods#

Functions:

baseline_argmax(psx, s)

This is the simplest baseline approach.

baseline_argmax_calibrated_confusion_matrix(psx, s)

docstring is the same as baseline_argmax_confusion_matrix Except in this method, we calibrate the confident joint created using the confusion matrix before using cleanlab to find the label errors.

baseline_argmax_confusion_matrix(psx, s[, ...])

This is a baseline approach.

cleanlab.baseline_methods.baseline_argmax(psx, s)[source]#

This is the simplest baseline approach. Just consider anywhere argmax != s as a label error.

Parameters
  • s (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous.

  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

Returns

  • A boolean mask that is true if the example belong

  • to that index is label error..

cleanlab.baseline_methods.baseline_argmax_calibrated_confusion_matrix(psx, s, prune_method='prune_by_noise_rate')[source]#

docstring is the same as baseline_argmax_confusion_matrix Except in this method, we calibrate the confident joint created using the confusion matrix before using cleanlab to find the label errors.

cleanlab.baseline_methods.baseline_argmax_confusion_matrix(psx, s, calibrate=False, prune_method='prune_by_noise_rate')[source]#

This is a baseline approach. That uses the a confusion matrix of argmax(psx) and s as the confident joint and then uses cleanlab (confident learning) to find the label errors using this matrix.

Parameters
  • s (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous.

  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

Returns

  • A boolean mask that is true if the example belong

  • to that index is label error..

Co-Teaching#

Functions:

adjust_learning_rate(optimizer, epoch, ...)

Scheduler to adjust learning rate and betas for Adam Optimizer

evaluate(test_loader, model1, model2)

forget_rate_scheduler(epochs, forget_rate, ...)

Tells Co-Teaching what fraction of examples to forget at each epoch.

initialize_lr_scheduler([lr, epochs, ...])

Scheduler to adjust learning rate and betas for Adam Optimizer

loss_coteaching(y_1, y_2, t, forget_rate[, ...])

Co-Teaching Loss function.

train(train_loader, epoch, model1, ...)

PyTorch training function.

cleanlab.coteaching.adjust_learning_rate(optimizer, epoch, alpha_plan, beta1_plan)[source]#

Scheduler to adjust learning rate and betas for Adam Optimizer

cleanlab.coteaching.evaluate(test_loader, model1, model2)[source]#
cleanlab.coteaching.forget_rate_scheduler(epochs, forget_rate, num_gradual, exponent)[source]#

Tells Co-Teaching what fraction of examples to forget at each epoch.

cleanlab.coteaching.initialize_lr_scheduler(lr=0.001, epochs=250, epoch_decay_start=80)[source]#

Scheduler to adjust learning rate and betas for Adam Optimizer

cleanlab.coteaching.loss_coteaching(y_1, y_2, t, forget_rate, class_weights=None)[source]#

Co-Teaching Loss function.

Parameters
  • y_1 (Tensor array) – Output logits from model 1

  • y_2 (Tensor array) – Output logits from model 2

  • t (np.array) – List of Noisy Labels (t means targets)

  • forget_rate (float) – Decimal between 0 and 1 for how quickly the models forget what they learn. Just use rate_schedule[epoch] for this value

  • class_weights (Tensor array, shape (Number of classes x 1), Default: None) – A np.torch.tensor list of length number of classes with weights

cleanlab.coteaching.train(train_loader, epoch, model1, optimizer1, model2, optimizer2, args, forget_rate_schedule, class_weights, accuracy)[source]#

PyTorch training function.

Parameters
  • train_loader (torch.utils.data.DataLoader) –

  • epoch (int) –

  • model1 (PyTorch class inheriting nn.Module) – Must define __init__ and forward(self, x,)

  • optimizer1 (PyTorch torch.optim.Adam) –

  • model2 (PyTorch class inheriting nn.Module) – Must define __init__ and forward(self, x,)

  • optimizer2 (PyTorch torch.optim.Adam) –

  • args (parser.parse_args() object) – Must contain num_iter_per_epoch, print_freq, and epochs

  • forget_rate_schedule (np.array of length number of epochs) – Tells Co-Teaching loss what fraction of examples to forget about.

  • class_weights (Tensor array, shape (Number of classes x 1), Default: None) – A np.torch.tensor list of length number of classes with weights

  • accuracy (function) – A function of the form accuracy(output, target, topk=(1,)) for computing top1 and top5 accuracy given output and true targets.

Latent Algebra#

Functions:

compute_inv_noise_matrix(py, noise_matrix[, ps])

Compute the inverse noise matrix if py := P(y=k) is given.

compute_noise_matrix_from_inverse(ps, ...[, py])

Compute the noise matrix P(s=k_s|y=k_y).

compute_ps_py_inv_noise_matrix(s, noise_matrix)

Compute ps := P(s=k), py := P(y=k), and the inverse noise matrix.

compute_py(ps, noise_matrix, ...[, ...])

Compute py := P(y=k) from ps := P(s=k), noise_matrix, and the inverse noise matrix.

compute_py_inv_noise_matrix(ps, noise_matrix)

Compute py := P(y=k), and the inverse noise matrix.

compute_pyx(psx, noise_matrix, ...)

Compute pyx := P(y=k|x) from psx := P(s=k|x), and the noise_matrix and inverse noise matrix.

cleanlab.latent_algebra.compute_inv_noise_matrix(py, noise_matrix, ps=None)[source]#

Compute the inverse noise matrix if py := P(y=k) is given.

Parameters
  • py (np.array (shape (K, 1))) – The fraction (prior probability) of each TRUE class label, P(y = k)

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • ps (np.array (shape (K, 1))) – The fraction (prior probability) of each NOISY given label, P(s = k). ps is easily computable from py and should only be provided if it has already been precomputed, to increase code efficiency.

Examples

For loop based implementation:

# Number of classes
K = len(py)

# 'ps' is p(s=k) = noise_matrix * p(y=k)
# because in *vector computation*: P(s=k|y=k) * p(y=k) = P(s=k)
if ps is None:
    ps = noise_matrix.dot(py)

# Estimate the (K, K) inverse noise matrix P(y = k_y | s = k_s)
inverse_noise_matrix = np.empty(shape=(K,K))
# k_s is the class value k of noisy label s
for k_s in range(K):
    # k_y is the (guessed) class value k of true label y
    for k_y in range(K):
        # P(y|s) = P(s|y) * P(y) / P(s)
        inverse_noise_matrix[k_y][k_s] = noise_matrix[k_s][k_y] *                                                  py[k_y] / ps[k_s]
cleanlab.latent_algebra.compute_noise_matrix_from_inverse(ps, inverse_noise_matrix, py=None)[source]#

Compute the noise matrix P(s=k_s|y=k_y).

Parameters
  • py (np.array (shape (K, 1))) – The fraction (prior probability) of each TRUE class label, P(y = k)

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(y=k_y|s=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from psx and s. Assumes columns of inverse_noise_matrix sum to 1.

  • ps (np.array (shape (K, 1))) – The fraction (prior probability) of each observed NOISY label, P(s = k). ps is easily computable from py and should only be provided if it has already been precomputed, to increase code efficiency.

Returns

noise_matrix – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Columns of noise_matrix sum to 1.

Return type

np.array of shape (K, K), K = number of classes

Examples

For loop based implementation:

# Number of classes s
K = len(ps)

# 'py' is p(y=k) = inverse_noise_matrix * p(y=k)
# because in *vector computation*: P(y=k|s=k) * p(s=k) = P(y=k)
if py is None:
    py = inverse_noise_matrix.dot(ps)

# Estimate the (K, K) noise matrix P(s = k_s | y = k_y)
noise_matrix = np.empty(shape=(K,K))
# k_s is the class value k of noisy label s
for k_s in range(K):
    # k_y is the (guessed) class value k of true label y
    for k_y in range(K):
        # P(s|y) = P(y|s) * P(s) / P(y)
        noise_matrix[k_s][k_y] = inverse_noise_matrix[k_y][k_s] *                                          ps[k_s] / py[k_y]
cleanlab.latent_algebra.compute_ps_py_inv_noise_matrix(s, noise_matrix)[source]#

Compute ps := P(s=k), py := P(y=k), and the inverse noise matrix.

Parameters
  • s (np.array) – A discrete vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of tilde(y), for ASCII reasons.

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

cleanlab.latent_algebra.compute_py(ps, noise_matrix, inverse_noise_matrix, py_method='cnt', y_count=None)[source]#

Compute py := P(y=k) from ps := P(s=k), noise_matrix, and the inverse noise matrix.

This method is ** ROBUST ** when py_method = ‘cnt’ It may work well even when the noise matrices are estimated poorly by using the diagonals of the matrices instead of all the probabilities in the entire matrix.

Parameters
  • ps (np.array (shape (K, ) or (1, K))) – The fraction (prior probability) of each observed, noisy label, P(s = k)

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(y=k_y|s=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from psx and s. Assumes columns of inverse_noise_matrix sum to 1.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – How to compute the latent prior p(y=k). Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • y_count (np.array (shape (K, ) or (1, K))) – The marginal counts of the confident joint (like cj.sum(axis = 0))

Returns

py – The fraction (prior probability) of each TRUE class label, P(y = k).

Return type

np.array (shape (K, ) or (1, K))

cleanlab.latent_algebra.compute_py_inv_noise_matrix(ps, noise_matrix)[source]#

Compute py := P(y=k), and the inverse noise matrix.

Parameters
  • ps (np.array (shape (K, ) or (1, K))) – The fraction (prior probability) of each observed, NOISY class P(s = k).

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

cleanlab.latent_algebra.compute_pyx(psx, noise_matrix, inverse_noise_matrix)[source]#

Compute pyx := P(y=k|x) from psx := P(s=k|x), and the noise_matrix and inverse noise matrix.

This method is ROBUST - meaning it works well even when the noise matrices are estimated poorly by only using the diagonals of the matrices which tend to be easy to estimate correctly.

Parameters
  • psx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3 (or higher) fold cross-validation.

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(y=k_y|s=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from psx and s. Assumes columns of inverse_noise_matrix sum to 1.

Returns

pyx – P(y=k|x) is a matrix with K probabilities for all N examples x.

Return type

np.array (shape (N, K))

Pruning#

Functions:

get_noise_indices(s, psx[, ...])

Returns the indices of most likely (confident) label errors in s.

keep_at_least_n_per_class(prune_count_matrix, n)

Make sure every class has at least n examples after removing noise.

multiclass_crossval_predict(pyx, labels)

Returns an numpy 2D array of one-hot encoded multiclass predictions.

order_label_errors(label_errors_bool, psx, ...)

Sorts label errors by normalized margin.

reduce_prune_counts(prune_count_matrix[, ...])

Reduce (multiply) all prune counts (non-diagonal) by frac_noise and increase diagonal by the total amount reduced in each column to preserve column counts.

cleanlab.pruning.get_noise_indices(s, psx, inverse_noise_matrix=None, confident_joint=None, frac_noise=1.0, num_to_remove_per_class=None, prune_method='prune_by_noise_rate', sorted_index_method=None, multi_label=False, n_jobs=None, verbose=0)[source]#

Returns the indices of most likely (confident) label errors in s. The number of indices returned is specified by frac_of_noise. When frac_of_noise = 1.0, all “confident” estimated noise indices are returned. * If you encounter the error ‘psx is not defined’, try setting n_jobs = 1.

Parameters
  • s (np.array) – A binary vector of labels, s, which may contain mislabeling. “s” denotes the noisy label instead of tilde(y), for ASCII encoding reasons.

  • psx (np.array (shape (N, K))) – P(s=k|x) is a matrix with K (noisy) probabilities for each of the N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should have been computed using 3+ fold cross-validation.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(y=k_y|s=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from psx and s. Assumes columns of inverse_noise_matrix sum to 1.

  • confident_joint (np.array (shape (K, K), type int) (default: None)) – A K,K integer matrix of count(s=k, y=k). Estimates a a confident subset of the joint distribution of the noisy and true labels P_{s,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (s=j, y=k) classes.

  • frac_noise (float) – When frac_of_noise = 1.0, return all “confident” estimated noise indices. Value in range (0, 1] that determines the fraction of noisy example indices to return based on the following formula for example class k. frac_of_noise * number_of_mislabeled_examples_in_class_k, or equivalently frac_of_noise * inverse_noise_rate_class_k * num_examples_with_s_equal_k

  • num_to_remove_per_class (list of int of length K (# of classes)) –

    e.g. if K = 3, num_to_remove_per_class = [5, 0, 1] would return the indices of the 5 most likely mislabeled examples in class s = 0, and the most likely mislabeled example in class s = 1.

    Note

    Only set this parameter if prune_method == 'prune_by_class' You may use with prune_method == 'prune_by_noise_rate', but if num_to_remove_per_class == k, then either k-1, k, or k+1 examples may be removed for any class. This is because noise rates are floats, and rounding may cause a one-off. If you need exactly ‘k’ examples removed from every class, you should use 'prune_by_class'

  • prune_method (str (default: 'prune_by_noise_rate')) – Possible Values: ‘prune_by_class’, ‘prune_by_noise_rate’, or ‘both’. Method used for pruning. 1. ‘prune_by_noise_rate’: works by removing examples with high probability of being mislabeled for every non-diagonal in the prune_counts_matrix (see pruning.py). 2. ‘prune_by_class’: works by removing the examples with smallest probability of belonging to their given class label for every class. 3. ‘both’: Finds the examples satisfying (1) AND (2) and removes their set conjunction.

  • sorted_index_method ({:obj:`None`, :obj:`prob_given_label`, :obj:`normalized_margin`}) – If None, returns a boolean mask (true if example at index is label error) If not None, returns an array of the label error indices (instead of a bool mask) where error indices are ordered by the either: 'normalized_margin' := normalized margin (p(s = k) - max(p(s != k))) 'prob_given_label' := [psx[i][labels[i]] for i in label_errors_idx]

  • multi_label (bool) – If true, s should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label.

  • n_jobs (int (Windows users may see a speed-up with n_jobs = 1)) – Number of processing threads used by multiprocessing. Default None sets to the number of processing threads on your CPU. Set this to 1 to REMOVE parallel processing (if its causing issues).

  • verbose (int) – If 0, no print statements. If 1, prints when multiprocessing happens.

cleanlab.pruning.keep_at_least_n_per_class(prune_count_matrix, n, frac_noise=1.0)[source]#

Make sure every class has at least n examples after removing noise. Functionally, increase each column, increases the diagonal term #(y=k,s=k) of prune_count_matrix until it is at least n, distributing the amount increased by subtracting uniformly from the rest of the terms in the column. When frac_of_noise = 1.0, return all “confidently” estimated noise indices, otherwise this returns frac_of_noise fraction of all the noise counts, with diagonal terms adjusted to ensure column totals are preserved.

Parameters
  • prune_count_matrix (np.array of shape (K, K), K = number of classes) – A counts of mislabeled examples in every class. For this function. NOTE prune_count_matrix is transposed relative to confident_joint.

  • n (int) – Number of examples to make sure are left in each class.

  • frac_noise (float) – When frac_of_noise = 1.0, return all estimated noise indices. Value in range (0, 1] that determines the fraction of noisy example indices to return based on the following formula for example class k. frac_of_noise * number_of_mislabeled_examples_in_class_k, or frac_of_noise * inverse_noise_rate_class_k * num_examples_s_equal_k

Returns

prune_count_matrix – Number of examples to remove from each class, for every other class.

Return type

np.array of shape (K, K), K = number of classes

cleanlab.pruning.multiclass_crossval_predict(pyx, labels)[source]#

Returns an numpy 2D array of one-hot encoded multiclass predictions. Each row in the array provides the predictions for a particular example. The boundary condition used to threshold predictions is computed by maximizing the F1 ROC curve.

Parameters
  • pyx (np.array (shape (N, K))) – P(label=k|x) is a NxK matrix with K probs for each of N examples. This is the probability distribution over all K classes, for each pyx should have been computed out of sample (holdout or crossval).

  • labels (list of lists (length N)) – These are multiclass labels. Each list in the list contains all the labels for that example.

cleanlab.pruning.order_label_errors(label_errors_bool, psx, labels, sorted_index_method='normalized_margin')[source]#

Sorts label errors by normalized margin. See https://arxiv.org/pdf/1810.05369.pdf (eqn 2.2) eg. normalized_margin = prob_label - max_prob_not_label

Parameters
  • label_errors_bool (np.array (bool)) – Contains True if the index of labels is an error, o.w. false

  • psx (np.array (shape (N, K))) – P(s=k|x) is a matrix with K probabilities for all N examples x. This is the probability distribution over all K classes, for each example, regarding whether the example has label s==k P(s=k|x). psx should computed using 3 (or higher) fold cross-validation.

  • labels (np.array) – A binary vector of labels, which may contain label errors.

  • sorted_index_method (str [``’normalized_margin’, ``'prob_given_label']) –

    Method to order label error indices (instead of a bool mask), either:

    ’normalized_margin’ := normalized margin (p(s = k) - max(p(s != k))) ‘prob_given_label’ := [psx[i][labels[i]] for i in label_errors_idx]

Returns

label_errors_idx – Return the index integers of the label errors, ordered by the normalized margin.

Return type

np.array (int)

cleanlab.pruning.reduce_prune_counts(prune_count_matrix, frac_noise=1.0)[source]#

Reduce (multiply) all prune counts (non-diagonal) by frac_noise and increase diagonal by the total amount reduced in each column to preserve column counts.

Parameters
  • prune_count_matrix (np.array of shape (K, K), K = number of classes) – A counts of mislabeled examples in every class. For this function, it does not matter what the rows or columns are, but the diagonal terms reflect the number of correctly labeled examples.

  • frac_noise (float) – When frac_of_noise = 1.0, return all estimated noise indices. Value in range (0, 1] that determines the fraction of noisy example indices to return based on the following formula for example class k. frac_of_noise * number_of_mislabeled_examples_in_class_k, or frac_of_noise * inverse_noise_rate_class_k * num_examples_s_equal_k.

Utilities#

Classes:

VersionWarning(warning_str, ...)

Functor that calls _python_version_is_compatible and manages the state of the bool variable warning_already_issued to make sure the same warning is never displayed multiple times.

Functions:

assert_inputs_are_valid(X, s[, psx])

Checks that X, s, and psx are correctly formatted

clip_noise_rates(noise_matrix)

Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates.

clip_values(x[, low, high, new_sum])

Clip all values in p to range [low,high].

confusion_matrix(true, pred)

Implements a confusion matrix for true labels and predicted labels.

estimate_pu_f1(s, prob_s_eq_1)

Computes Claesen's estimate of f1 in the pulearning setting.

int2onehot(labels)

Convert list of lists to a onehot matrix for multi-labels

onehot2int(onehot_matrix)

Convert a onehot matrix for multi-labels to a list of lists of ints

print_inverse_noise_matrix(inverse_noise_matrix)

Pretty prints the inverse noise matrix.

print_joint_matrix(joint_matrix[, round_places])

Pretty prints the joint label noise matrix.

print_noise_matrix(noise_matrix[, round_places])

Pretty prints the noise matrix.

print_square_matrix(matrix[, left_name, ...])

Pretty prints a matrix.

remove_noise_from_class(noise_matrix, ...)

A helper function in the setting of PU learning.

round_preserving_row_totals(confident_joint)

Rounds confident_joint cj to type int while preserving the totals of reach row.

round_preserving_sum(iterable)

Rounds an iterable of floats while retaining the original summed value.

value_counts(x)

Returns an np.array of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.

class cleanlab.util.VersionWarning(warning_str, list_of_compatible_versions)[source]#

Bases: object

Functor that calls _python_version_is_compatible and manages the state of the bool variable warning_already_issued to make sure the same warning is never displayed multiple times.

Methods:

is_compatible()[source]#
cleanlab.util.assert_inputs_are_valid(X, s, psx=None)[source]#

Checks that X, s, and psx are correctly formatted

cleanlab.util.clip_noise_rates(noise_matrix)[source]#

Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates.

ASSUMES noise_matrix columns sum to 1.

Parameters

noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probablity matrix containing the fraction of examples in every class, labeled as every other class. Diagonal terms are not noise rates, but are consistency P(s=k|y=k) Assumes columns of noise_matrix sum to 1

cleanlab.util.clip_values(x, low=0.0, high=1.0, new_sum=None)[source]#

Clip all values in p to range [low,high]. Preserves sum of x.

Parameters
  • x (np.array) – An array / list of values to be clipped.

  • low (float) – values in x greater than ‘low’ are clipped to this value

  • high (float) – values in x greater than ‘high’ are clipped to this value

  • new_sum (float) – normalizes x after clipping to sum to new_sum

Returns

x – A list of clipped values, summing to the same sum as x.

Return type

np.array

cleanlab.util.confusion_matrix(true, pred)[source]#

Implements a confusion matrix for true labels and predicted labels. true and pred MUST BE the same length and have the same distinct set of class labels represtented.

Results are identical (and similar computation time) to:

“sklearn.metrics.confusion_matrix”

However, this function avoids the dependency on sklearn.

Parameters
  • y (np.array 1d) – Contains labels. Assumes s and y contains the same distinct set of labels.

  • s (np.array 1d) – Contains labels. Assumes s and y contains the same distinct set of labels.

Returns

confusion_matrix – matrix of confusion counts with true on rows and pred on columns.

Return type

np.array (2D)

cleanlab.util.estimate_pu_f1(s, prob_s_eq_1)[source]#

Computes Claesen’s estimate of f1 in the pulearning setting.

Parameters
  • s (iterable (list or np.array)) – Binary label (whether each element is labeled or not) in pu learning.

  • prob_s_eq_1 (iterable (list or np.array)) – The probability, for each example, whether it is s==1 P(s==1|x)

  • (float) (Output) –

  • ------

  • setting. (Claesen's estimate for f1 in the pulearning) –

cleanlab.util.int2onehot(labels)[source]#

Convert list of lists to a onehot matrix for multi-labels

Parameters

labels (list of lists of integers) – e.g. [[0,1], [3], [1,2,3], [1], [2]] All integers from 0,1,…,K-1 must be represented.

cleanlab.util.onehot2int(onehot_matrix)[source]#

Convert a onehot matrix for multi-labels to a list of lists of ints

Parameters

onehot_matrix (2D np.array of 0s and 1s) – A one hot encoded matrix representation of multi-labels.

Returns

labels – e.g. [[0,1], [3], [1,2,3], [1], [2]] All integers from 0,1,…,K-1 must be represented.

Return type

list of lists of integers

cleanlab.util.print_inverse_noise_matrix(inverse_noise_matrix, round_places=2)[source]#

Pretty prints the inverse noise matrix.

cleanlab.util.print_joint_matrix(joint_matrix, round_places=2)[source]#

Pretty prints the joint label noise matrix.

cleanlab.util.print_noise_matrix(noise_matrix, round_places=2)[source]#

Pretty prints the noise matrix.

cleanlab.util.print_square_matrix(matrix, left_name='s', top_name='y', title=' A square matrix', short_title='s,y', round_places=2)[source]#

Pretty prints a matrix.

Parameters
  • matrix (np.array) – the matrix to be printed

  • left_name (str) – the name of the variable on the left of the matrix

  • top_name (str) – the name of the variable on the top of the matrix

  • title (str) – Prints this string above the printed square matrix.

  • short_title (str) – A short title (6 characters or less) like P(s|y) or P(s,y).

  • round_places (int) – Number of decimals to show for each matrix value.

cleanlab.util.remove_noise_from_class(noise_matrix, class_without_noise)[source]#

A helper function in the setting of PU learning. Sets all P(s=class_without_noise|y=any_other_class) = 0 in noise_matrix for pulearning setting, where we have generalized the positive class in PU learning to be any class of choosing, denoted by class_without_noise.

Parameters
  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probablity matrix of the form P(s=k_s|y=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • class_without_noise (int) – Integer value of the class that has no noise. Traditionally, this is 1 (positive) for PU learning.

cleanlab.util.round_preserving_row_totals(confident_joint)[source]#

Rounds confident_joint cj to type int while preserving the totals of reach row. Assumes that cj is a 2D np.array of type float.

Parameters

confident_joint (2D np.array of shape (K, K)) – See compute_confident_joint docstring for details.

Returns

confident_joint – Rounded to int while preserving row totals.

Return type

2D np.array of shape (K,K)

cleanlab.util.round_preserving_sum(iterable)[source]#

Rounds an iterable of floats while retaining the original summed value. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious.

The while loop in this code was adapted from: https://github.com/cgdeboer/iteround

Parameters

iterable (list or np.array) – An iterable of floats

Returns

The iterable rounded to int, preserving sum.

Return type

list or np.array

cleanlab.util.value_counts(x)[source]#

Returns an np.array of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.

Why this matters? Here is an example:

x = [np.random.randint(0,100) for i in range(100000)]

%timeit np.bincount(x)

–Result: 100 loops, best of 3: 3.9 ms per loop

%timeit np.unique(x, return_counts=True)[1]

–Result: 100 loops, best of 3: 7.47 ms per loop

Parameters

x (list or np.array (one dimensional)) – A list of discrete objects, like lists or strings, for example, class labels ‘y’ when training a classifier. e.g. [“dog”,”dog”,”cat”] or [1,2,0,1,1,0,2]

Polyplex#

Functions:

joint_bounds(py)

Computes three lists: noise_matrix_trace, joint_trace_min, joint_trace_max that when plotted, the noise_matrix_trace values represent x-values and the joint_trace_min and joint_trace_max values represent the y-value min and maximium ranges.

joint_min_max(noise_matrix_trace, py)

Computes the min and max bounds on the trace(P_{s,y}), the trace of the joint distribution, given the trace of the noise matrix and p(y).

slope_intercept(point1, point2)

Returns the slope and intercept between point1 and point2.

cleanlab.polyplex.joint_bounds(py)[source]#

Computes three lists: noise_matrix_trace, joint_trace_min, joint_trace_max that when plotted, the noise_matrix_trace values represent x-values and the joint_trace_min and joint_trace_max values represent the y-value min and maximium ranges. Together, these three lists fully characterize the polyplex.

Parameters

py (np.array (shape (K, 1))) – The fraction (prior probability) of each true, hidden class label, P(y = k)

Returns

  • A tuple of lists of floats (noise_matrix_trace, joint_trace_min, joint_trace_max)

  • each of length K+1, where K = len(py). When plotted, the noise_matrix_trace values

  • represent x-values and the joint_trace_min and joint_trace_max values represent the

  • y-value min and maximium ranges. These three lists fully characterize the polyplex.

cleanlab.polyplex.joint_min_max(noise_matrix_trace, py)[source]#

Computes the min and max bounds on the trace(P_{s,y}), the trace of the joint distribution, given the trace of the noise matrix and p(y).

Parameters
  • noise_matrix_trace (float) – The sum of the diagonals of the noise matrix P(s = k’ | y = k)

  • py (np.array (shape (K, 1))) – The fraction (prior probability) of each true, hidden class label, P(y = k)

Return type

A tuple of two floats (y_min, y_max) representing the bounds on the trace of the joint.

cleanlab.polyplex.slope_intercept(point1, point2)[source]#

Returns the slope and intercept between point1 and point2.

Parameters
  • point1 (tuple) – e.g. (1.3, 4)

  • point2 (tuple) –

Return type

A tuple(slope, intercept)