filter#

Filter (previously in Cleanlab 1.0, this module was called Pruning)

Contains methods for estimating the latent indices of all label issues. This code uses advanced multiprocessing to speed up computation. see: https://research.wmz.ninja/articles/2018/03/ (link continued below) on-sharing-large-arrays-when-using-pythons-multiprocessing.html

Functions:

`find_label_issues`(labels, pred_probs, *[, ...])	By default, this method returns a boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.
`find_label_issues_using_argmax_confusion_matrix`(...)	This is a baseline approach.
`find_predicted_neq_given`(labels, pred_probs, *)	This is the simplest baseline approach.
`keep_at_least_n_per_class`(...[, frac_noise])	Make sure every class has at least n examples after removing noise.
`multiclass_crossval_predict`(labels, pyx)	Returns a numpy 2D array of one-hot encoded multiclass predictions.
`reduce_prune_counts`(prune_count_matrix[, ...])	Reduce (multiply) all prune counts (non-diagonal) by frac_noise and increase diagonal by the total amount reduced in each column to preserve column counts.

cleanlab.filter.find_label_issues(labels, pred_probs, *, confident_joint=None, filter_by='prune_by_noise_rate', return_indices_ranked_by=None, multi_label=False, frac_noise=1.0, num_to_remove_per_class=None, n_jobs=None, verbose=0)[source]#

By default, this method returns a boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.

You can return ONLY the indices of the label issues in your dataset, by setting return_indices_ranked_by = {self_confidence, normalized_margin}.

number of indices returned is specified by frac_noise. When frac_noise = 1.0, all “confident” estimated noise indices are returned. * If you encounter the error ‘pred_probs is not defined’, try setting n_jobs = 1.

WARNING! is a matrix with K model-predicted probabilities and num_to_remove_per_class parameters are only supported when filter_by is either ‘prune_by_noise_rate’, ‘prune_by_class’, or ‘both’. They are not supported for methods ‘confident_learning’ or ‘predicted_neq_given’. TODO.

Parameters

labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.
pred_probs (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.
confident_joint (np.array (shape (K, K), type int) (default: None)) – A K,K integer matrix of count(label=k, true_label=k). Estimates a a confident subset of the joint distribution of the noisy and true labels P_{labels,y}. Each entry in the matrix contains the number of examples confidently counted into every pair (label=j, true_label=k) classes. The confident joint can be computed using
count.compute_confident_joint –
filter_by (str (default: 'prune_by_noise_rate') TODO: change default to cl_off_diag?) –

Possible Values: {‘prune_by_class’, ‘prune_by_noise_rate’, ‘both’,
’confident_learning’, ‘predicted_neq_given’}

Method used for filtering/pruning out the label issues. 1. ‘prune_by_noise_rate’: works by removing examples with high probability of being mislabeled for every non-diagonal in the confident joint (see prune_counts_matrix in filter.py). These are the examples where (with high confidence) the given label is unlikely to match the predicted label for the example. 2. ‘prune_by_class’: works by removing the examples with smallest probability of belonging to their given class label for every class. 3. ‘both’: Removes only the examples that would be filtered by both (1) AND (2). 4. ‘confident_learning’: Returns the examples in the off-diagonals of the confident joint. These are the examples that are confidently predicted to be a different label than their given label. that’s different from their given label while computing the confident joint. 5. ‘predicted_neq_given’: Find examples where the predicted class (i.e. argmax of the predicted probabilities) does not match the given label.
return_indices_ranked_by ({:obj:`None`, :obj:`self_confidence`, :obj:`normalized_margin`}) – If None, returns a boolean mask (true if example at index is label error) If not None, returns an array of the label error indices (instead of a bool mask) where error indices are ordered by the either: 'normalized_margin' := normalized margin (p(label = k) - max(p(label != k))) 'self_confidence' := [pred_probs[i][labels[i]] for i in label_issues_idx]
multi_label (bool) – If true, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], …]
frac_noise (float) – When frac_noise = 1.0, return all “confident” estimated noise indices. Value in range (0, 1] that determines the fraction of noisy example indices to return based on the following formula for example class k. frac_noise * number_of_mislabeled_examples_in_class_k, or equivalently frac_noise * inverse_noise_rate_class_k * num_examples_with_s_equal_k
num_to_remove_per_class (list of int of length K (# of classes)) –
e.g. if K = 3, num_to_remove_per_class = [5, 0, 1] would return the indices of the 5 most likely mislabeled examples in class labels = 0, and the most likely mislabeled example in class labels = 1.

Note

Only set this parameter if filter_by == 'prune_by_class' You may use with filter_by == 'prune_by_noise_rate', but if num_to_remove_per_class == k, then either k-1, k, or k+1 examples may be removed for any class. This is because noise rates are floats, and rounding may cause a one-off. If you need exactly ‘k’ examples removed from every class, you should use 'prune_by_class'
n_jobs (int (Windows users may see a speed-up with n_jobs = 1)) – Number of processing threads used by multiprocessing. Default None sets to the number of processing threads on your CPU. Set this to 1 to REMOVE parallel processing (if its causing issues).
verbose (int) – If 0, no print statements. If 1, prints when multiprocessing happens.

Returns

label_issues_mask – This method returns a boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.

Note

You can also return ONLY the indices of the label issues in your dataset, by setting return_indices_ranked_by = {self_confidence, normalized_margin}.

Return type

np.array

cleanlab.filter.find_label_issues_using_argmax_confusion_matrix(labels, pred_probs, *, calibrate=True, filter_by='prune_by_noise_rate')[source]#

This is a baseline approach. That uses the confusion matrix of argmax(pred_probs) and labels as the confident joint and then uses cleanlab (confident learning) to find the label issues using this matrix.

The only difference between this and find_label_issues is that it uses the confusion matrix based on the argmax and given label instead of using the confident joint from count.compute_confident_joint.

This method does not support multi-label labels.

Parameters

labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.
pred_probs (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.
calibrate (bool) – Set to True to calibrate the confusion matrix created by pred != given labels. This calibration adjusts the confusion matrix / confident joint so that the prior(given noisy labels) is correct based on the original labels.
filter_by (str (default: 'prune_by_noise_rate')) – Possible Values: ‘prune_by_class’, ‘prune_by_noise_rate’, or ‘both’. Method used for pruning/filtering out the label issues: 1. ‘prune_by_noise_rate’: works by removing examples with high probability of being mislabeled for every non-diagonal in the confident joint (see prune_counts_matrix in filter.py). These are the examples where (with high confidence) the given label is unlikely to match the predicted label. 2. ‘prune_by_class’: works by removing the examples with smallest probability of belonging to their given class label for every class. 3. ‘both’: Finds the examples satisfying (1) AND (2) and removes their set conjunction. 4. ‘confident_learning’: Find examples that are confidently labeled as a class that’s different from their given label while computing the confident joint.

Returns

A boolean mask

Return type

true if the example at that index is label issue.

cleanlab.filter.find_predicted_neq_given(labels, pred_probs, *, multi_label=False)[source]#

This is the simplest baseline approach. Just consider anywhere argmax != labels as a label error.

Parameters

labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.
pred_probs (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.
multi_label (bool) – Set to True if labels is multi-label (list of lists, or np.array of np.array) The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], …]

Returns

A boolean mask that is true if the example belong
to that index is label error.

cleanlab.filter.keep_at_least_n_per_class(prune_count_matrix, n, *, frac_noise=1.0)[source]#

Make sure every class has at least n examples after removing noise. Functionally, increase each column, increases the diagonal term #(true_label=k,label=k) of prune_count_matrix until it is at least n, distributing the amount increased by subtracting uniformly from the rest of the terms in the column. When frac_noise = 1.0, return all “confidently” estimated noise indices, otherwise this returns frac_noise fraction of all the noise counts, with diagonal terms adjusted to ensure column totals are preserved.

Parameters

prune_count_matrix (np.array of shape (K, K), K = number of classes) – A counts of mislabeled examples in every class. For this function. NOTE prune_count_matrix is transposed relative to confident_joint.
n (int) – Number of examples to make sure are left in each class.
frac_noise (float) – When frac_noise = 1.0, return all estimated noise indices. Value in range (0, 1] that determines the fraction of noisy example indices to return based on the following formula for example class k. frac_noise * number_of_mislabeled_examples_in_class_k, or frac_noise * inverse_noise_rate_class_k * num_examples_s_equal_k

Returns

prune_count_matrix – Number of examples to remove from each class, for every other class.

Return type

np.array of shape (K, K), K = number of classes

cleanlab.filter.multiclass_crossval_predict(labels, pyx)[source]#

Returns a numpy 2D array of one-hot encoded multiclass predictions. Each row in the array provides the predictions for a particular example. The boundary condition used to threshold predictions is computed by maximizing the F1 ROC curve.

Parameters

labels (list of lists (length N)) – These are multiclass labels. Each list in the list contains all the labels for that example.
pyx (np.array (shape (N, K))) – P(label=k|x) is a matrix with K model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold cross-validation.

cleanlab.filter.reduce_prune_counts(prune_count_matrix, frac_noise=1.0)[source]#

Reduce (multiply) all prune counts (non-diagonal) by frac_noise and increase diagonal by the total amount reduced in each column to preserve column counts.

Parameters

prune_count_matrix (np.array of shape (K, K), K = number of classes) – A counts of mislabeled examples in every class. For this function, it does not matter what the rows or columns are, but the diagonal terms reflect the number of correctly labeled examples.
frac_noise (float) – When frac_noise = 1.0, return all estimated noise indices. Value in range (0, 1] that determines the fraction of noisy example indices to return based on the following formula for example class k. frac_noise * number_of_mislabeled_examples_in_class_k, or frac_noise * inverse_noise_rate_class_k * num_examples_s_equal_k.