count#

Functions:

calibrate_confident_joint(confident_joint, ...)

Calibrates any confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

compute_confident_joint(labels, pred_probs, *)

Estimates P(labels,y), the confident counts of the latent joint distribution of true and noisy labels using observed labels and predicted probabilities pred_probs.

converge_estimates(ps, py, noise_matrix, ...)

Computes py := P(true_label=k) and both noise_matrix and inverse_noise_matrix, by numerically converging ps := P(labels=k), py, and the noise matrices.

estimate_confident_joint_and_cv_pred_proba(X, ...)

Estimates P(labels,y), the confident counts of the latent joint distribution of true and noisy labels using observed labels and predicted probabilities pred_probs.

estimate_cv_predicted_probabilities(X, labels)

This function computes the out-of-sample predicted probability [P(label=k|x)] for every example in X using cross validation.

estimate_joint(labels[, pred_probs, ...])

Estimates the joint distribution of label noise P(label=i, true_label=j) guaranteed to

estimate_latent(confident_joint, labels, *)

Computes the latent prior p(y), the noise matrix P(labels|y) and the inverse noise matrix P(y|labels) from the confident_joint count(labels, y).

estimate_noise_matrices(X, labels[, clf, ...])

Estimates the noise_matrix of shape (K, K).

estimate_py_and_noise_matrices_from_probabilities(...)

Computes the confident counts estimate of latent variables py and the noise rates using observed labels and predicted probabilities, pred_probs.

estimate_py_noise_matrices_and_cv_pred_proba(X, ...)

This function computes the out-of-sample predicted probability P(label=k|x) for every example x in X using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.

num_label_issues(labels, pred_probs[, ...])

Estimates the number of label issues in labels.

cleanlab.count.calibrate_confident_joint(confident_joint, labels, *, multi_label=False)[source]#

Calibrates any confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

In other words, this function forces the confident joint to have the true noisy prior p(labels) (summed over columns for each row) and also forces the confident joint to add up to the total number of examples.

This method makes the confident joint a valid counts estimate of the actual joint of noisy and true labels.

Parameters
  • confident_joint (np.array (shape (K, K))) – A K,K integer matrix of count(label=k, true_label=k). Estimates a confident subset of the joint distribution of the noisy and true labels P_{labels,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (label=j, true_label=k) classes.

  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • multi_label (bool) – If true, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The MAJOR DIFFERENCE in how this is calibrated versus single_label, is the total number of errors considered is based on the number of labels, not the number of examples. So, the calibrated confident_joint will sum to the number of total labels. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], …]

Returns

  • An np.array of shape (K, K) of type float representing a valid

  • estimate of the joint COUNTS of noisy and true labels.

cleanlab.count.compute_confident_joint(labels, pred_probs, *, thresholds=None, calibrate=True, multi_label=False, return_indices_of_off_diagonals=False)[source]#

Estimates P(labels,y), the confident counts of the latent joint distribution of true and noisy labels using observed labels and predicted probabilities pred_probs.

This estimate is called the confident joint.

When calibrate = True, this method returns an estimate of the latent true joint counts of noisy and true labels.

Important! This function assumes that pred_probs are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.

This function estimates the joint of shape (K, K). This is the confident counts of examples in every class, labeled as every other class.

Under certain conditions, estimates are exact, and in most conditions, the estimate is within 1 percent of the truth.

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • pred_probs (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • K (int (default: None)) – Number of unique classes. Calculated as len(np.unique(labels)) when K == None

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(label^=k|label=k). If an example has a predicted probability “greater” than this threshold, it is counted as having true_label = k. This is not used for filtering/pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • calibrate (bool (default: True)) – Calibrates confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

  • multi_label (bool) – If true, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], …]

  • return_indices_of_off_diagonals (bool) – If true returns indices of examples that were counted in off-diagonals of confident joint as a baseline proxy for the label issues. This sometimes works as well as filter.find_label_issues(confident_joint).

Examples

We provide a for-loop based simplification of the confident joint below. This implementation is not efficient, not used in practice, and not complete, but covers the gist of how the confident joint is computed:

# Confident examples are those that we are confident have true_label = k
# Estimate (K, K) matrix of confident examples with label = k_s and true_label = k_y
cj_ish = np.zeros((K, K))
for k_s in range(K): # k_s is the class value k of noisy labels `s`
    for k_y in range(K): # k_y is the (guessed) class k of true_label k_y
        cj_ish[k_s][k_y] = sum((pred_probs[:,k_y] >= (thresholds[k_y] - 1e-8)) & (labels == k_s))

The following is a vectorized (but non-parallelized) implementation of the confident joint, again slow, using for-loops/simplified for understanding. This implementation is 100% accurate, it’s just not optimized for speed.

confident_joint = np.zeros((K, K), dtype = int)
for i, row in enumerate(pred_probs):
    s_label = labels[i]
    confident_bins = row >= thresholds - 1e-6
    num_confident_bins = sum(confident_bins)
    if num_confident_bins == 1:
        confident_joint[s_label][np.argmax(confident_bins)] += 1
    elif num_confident_bins > 1:
        confident_joint[s_label][np.argmax(row)] += 1
cleanlab.count.converge_estimates(ps, py, noise_matrix, inverse_noise_matrix, *, inv_noise_matrix_iterations=5, noise_matrix_iterations=3)[source]#

Computes py := P(true_label=k) and both noise_matrix and inverse_noise_matrix, by numerically converging ps := P(labels=k), py, and the noise matrices.

Forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

py := P(true_label=k) and the inverse noise matrix P(true_label=k_y|label=k_s) specify one another, meaning one can be computed from the other and vice versa. When numerical discrepancy exists due to poor estimation, they can be made to agree by repeatedly computing one from the other, for some a certain number of iterations (3-10 works fine.)

Do not set iterations too high or performance will decrease as small deviations will get perturbed over and over and potentially magnified.

Note that we have to first converge the inverse_noise_matrix and py, then we can update the noise_matrix, then repeat. This is because the inverse noise matrix depends on py (which is unknown/latent), but the noise matrix depends on ps (which is known), so there will be no change in the noise matrix if we recompute it when py and inverse_noise_matrix change.

Parameters
  • ps (np.array (shape (K, ) or (1, K))) – The fraction (prior probability) of each observed, NOISY class P(labels = k).

  • py (np.array (shape (K, ) or (1, K))) – The estimated fraction (prior probability) of each TRUE class P(true_label = k).

  • noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • inverse_noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(true_label=k_y|labels=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from pred_probs and labels. Assumes columns of inverse_noise_matrix sum to 1.

  • inv_noise_matrix_iterations (int (Default: 5)) – Number of times to converge inverse noise matrix with py and noise mat.

  • noise_matrix_iterations (int (Default: 3)) – Number of times to converge noise matrix with py and inverse noise mat.

Returns

  • Three np.arrays of the form (py, noise_matrix, inverse_noise_matrix) all

  • having numerical agreement in terms of their mathematical relations.

cleanlab.count.estimate_confident_joint_and_cv_pred_proba(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, seed=None, calibrate=True)[source]#

Estimates P(labels,y), the confident counts of the latent joint distribution of true and noisy labels using observed labels and predicted probabilities pred_probs.

The output of this function is a numpy array of shape (K, K).

Under certain conditions, estimates are exact, and in many conditions, estimates are within one percent of actual.

Notes: There are two ways to compute the confident joint with pros/cons. 1. For each holdout set, we compute the confident joint, then sum them up. 2. Compute pred_proba for each fold, combine, compute the confident joint. (1) is more accurate because it correctly computes thresholds for each fold (2) is more accurate when you have only a little data because it computes the confident joint using all the probabilities. For example if you had 100 examples, with 5-fold cross validation + uniform p(y) you would only have 20 examples to compute each confident joint for (1). Such small amounts of data is bound to result in estimation errors. For this reason, we implement (2), but we implement (1) as a commented out function at the end of this file.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(label^=k|label=k). If an example has a predicted probability “greater” than this threshold, it is counted as having true_label = k. This is not used for filtering/pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • calibrate (bool (default: True)) – Calibrates confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

Returns

  • Returns a tuple of two numpy array matrices in the form

  • (joint counts matrix, predicted probability matrix)

cleanlab.count.estimate_cv_predicted_probabilities(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, seed=None)[source]#

This function computes the out-of-sample predicted probability [P(label=k|x)] for every example in X using cross validation. Output is a np.array of shape (N, K) where N is the number of training examples and K is the number of classes.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • labels (np.array or list of ints from [0,1,..,K-1]) – A discrete vector of class labels which may or may not contain mislabeling. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

Returns

pred_probs – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.

Return type

np.array (shape (N, K))

cleanlab.count.estimate_joint(labels, pred_probs=None, *, confident_joint=None, multi_label=False)[source]#
Estimates the joint distribution of label noise P(label=i, true_label=j) guaranteed to
  • sum to 1

  • np.sum(joint_estimate, axis = 1) == p(labels)

Parameters

docstring. (See cleanlab.count.calibrate_confident_joint) –

Returns

  • An np.array of shape (K, K) of type float representing a valid

  • estimate of the true joint of noisy and true labels.

cleanlab.count.estimate_latent(confident_joint, labels, *, py_method='cnt', converge_latent_estimates=False)[source]#

Computes the latent prior p(y), the noise matrix P(labels|y) and the inverse noise matrix P(y|labels) from the confident_joint count(labels, y). The confident_joint estimated by compute_confident_joint by counting confident examples.

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • confident_joint (np.array (shape (K, K), type int)) – A K,K integer matrix of count(label=k, true_label=k). Estimates a confident subset of the joint distribution of the noisy and true labels P_{labels,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (label=j, true_label=k) classes.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – py is shorthand for the class proportions (a.k.a prior) of the true labels This method defines how to compute the latent prior p(true_label=k). Default is “cnt”. “cnt” works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

Return type

A tuple containing (py, noise_matrix, inv_noise_matrix).

cleanlab.count.estimate_noise_matrices(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, converge_latent_estimates=True, seed=None)[source]#

Estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(label=k_s|true_label=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(label^=k|label). If an example has a predicted probability “greater” than this threshold, it is counted as having true_label = k. This is not used for filtering/pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

Return type

A two-item tuple containing (noise_matrix, inv_noise_matrix).

cleanlab.count.estimate_py_and_noise_matrices_from_probabilities(labels, pred_probs, *, thresholds=None, converge_latent_estimates=True, py_method='cnt', calibrate=True)[source]#

Computes the confident counts estimate of latent variables py and the noise rates using observed labels and predicted probabilities, pred_probs.

Important! This function assumes that pred_probs are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.

This function estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(label=k_s|true_label=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • pred_probs (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(label^=k|label=k). If an example has a predicted probability “greater” than this threshold, it is counted as having true_label = k. This is not used for filtering/pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – How to compute the latent prior p(true_label=k). Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • calibrate (bool (default: True)) – Calibrates confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

Return type

py, noise_matrix, inverse_noise_matrix

cleanlab.count.estimate_py_noise_matrices_and_cv_pred_proba(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, converge_latent_estimates=False, py_method='cnt', seed=None)[source]#

This function computes the out-of-sample predicted probability P(label=k|x) for every example x in X using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.

This function estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(label=k_s|true_label=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters
  • X (np.array) – Input feature matrix (N, D), 2D numpy array

  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • clf (sklearn.classifier or equivalent) – Default classifier used is logistic regression. Assumes clf has predict_proba() and fit() defined.

  • cv_n_folds (int) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (iterable (list or np.array) of shape (K, 1) or (K,)) – P(label^=k|label=k). If an example has a predicted probability “greater” than this threshold, it is counted as having true_label = k. This is not used for filtering/pruning, only for estimating the noise rates using confident counts. This value should be between 0 and 1. Default is None.

  • converge_latent_estimates (bool) – If true, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • py_method (str (Options: [``”cnt”, ``"eqn", "marginal", "marginal_ps"])) – How to compute the latent prior p(true_label=k). Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • seed (int (default = None)) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

Returns

  • Returns a tuple of five numpy array matrices in the form

  • (py, noise_matrix, inverse_noise_matrix,

  • joint count matrix i.e. confident joint, predicted probability matrix)

cleanlab.count.num_label_issues(labels, pred_probs, confident_joint=None)[source]#

Estimates the number of label issues in labels.

This method is more accurate than sum(find_label_issues()) because its computed using only the Trace(joint), ignoring all off-diagonals (used by find_label_issues and harder to estimate). Here we sum over only diagonal elements in the joint (which have more data are more constrained, and therefore easier to compute).

tl;dr - Use this method to get the most accurate estimate of number of label issues when you don’t need the indices of the label issues. This is the canonical way to find errors simply by combining a ranking/scoring function from rank.py with num_label_issues().

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

  • pred_probs (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • confident_joint (np.array (shape (K, K), type int)) – A K,K integer matrix of count(label=k, true_label=k). Estimates a confident subset of the joint distribution of the noisy and true labels P_{labels,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (label=j, true_label=k) classes.

Return type

An integer estimating the number of label issues.