noise_generation#

Functions:

`generate_n_rand_probabilities_that_sum_to_m`(n, m, *)	When min_prob=0 and max_prob = 1.0, this method is deprecated.
`generate_noise_matrix`(K, *[, ...])	DEPRECATED - Use generate_noise_matrix_from_trace()
`generate_noise_matrix_from_trace`(K, trace, *)	Generates a K x K noise matrix P(label=k_s\|true_label=k_y) with trace as the np.mean(np.diagonal(noise_matrix)).
`generate_noisy_labels`(true_labels, noise_matrix)	Generates noisy labels (shape (N, 1)) from perfect labels y, 'exactly' yielding the provided noise_matrix between labels and y.
`noise_matrix_is_valid`(noise_matrix, py, *[, ...])	Given a prior py = p(true_label=k), returns true if the given noise_matrix is a learnable matrix.
`randomly_distribute_N_balls_into_K_bins`(N, K, *)	Returns a uniformly random numpy integer array of length N that sums to K.

cleanlab.noise_generation.generate_n_rand_probabilities_that_sum_to_m(n, m, *, max_prob=1.0, min_prob=0.0)[source]#

When min_prob=0 and max_prob = 1.0, this method is deprecated. Instead use np.random.dirichlet(np.ones(n))*m

Generates ‘n’ random probabilities that sum to ‘m’.

Parameters

n (int) – Length of np.array of random probabilities to be returned.
m (float) – Sum of np.array of random probabilities that is returned.
max_prob (float (0.0, 1.0] | Default value is 1.0) – Maximum probability of any entry in the returned np.array.
min_prob (float [0.0, 1.0) | Default value is 0.0) – Minimum probability of any entry in the returned np.array.

cleanlab.noise_generation.generate_noise_matrix(K, *, max_noise_rate=1.0, frac_zero_noise_rates=0.0, verbose=False)[source]#

DEPRECATED - Use generate_noise_matrix_from_trace()

Generates a noise matrix by randomly assigning noise rates up to max_noise_rate, then setting noise rates to zero until P(label!=k|label=k) < 1 is satisfied. Additionally, frac_zero_noise_rates are set to zero.

Parameters

K (int) – Creates a noise matrix of shape (K, K). Implies there are K classes for learning with noisy labels.
max_noise_rate (float) – Smaller —> easier learning problem (less noise)
frac_zero_noise_rates (float) – Make problem more tractable by making a fraction of noise rates zero. Larger –> Easier learning problem
verbose (bool) – Print debugging output if set to True.

cleanlab.noise_generation.generate_noise_matrix_from_trace(K, trace, *, max_trace_prob=1.0, min_trace_prob=1e-05, max_noise_rate=0.99999, min_noise_rate=0.0, valid_noise_matrix=True, py=None, frac_zero_noise_rates=0.0, seed=0, max_iter=10000)[source]#

Generates a K x K noise matrix P(label=k_s|true_label=k_y) with trace as the np.mean(np.diagonal(noise_matrix)).

Parameters

K (int) – Creates a noise matrix of shape (K, K). Implies there are K classes for learning with noisy labels.
trace (float (0.0, 1.0]) – Sum of diagonal entries of np.array of random probabilities returned.
max_trace_prob (float (0.0, 1.0]) – Maximum probability of any entry in the trace of the return matrix.
min_trace_prob (float [0.0, 1.0)) – Minimum probability of any entry in the trace of the return matrix.
max_noise_rate (float (0.0, 1.0]) – Maximum noise_rate (non-diagonal entry) in the returned np.array.
min_noise_rate (float [0.0, 1.0)) – Minimum noise_rate (non-diagonal entry) in the returned np.array.
valid_noise_matrix (bool) – If True, returns a matrix having all necessary conditions for learning with noisy labels. In particular, p(true_label=k)p(label=k) < p(true_label=k,label=k) is satisfied. This requires that Trace > 1.
py (np.array (shape (K, 1))) – Fraction (prior probability) of each true class label, P(true_label = k). REQUIRED when valid_noise_matrix == True.
frac_zero_noise_rates (float) – The fraction of the n*(n-1) noise rates that will be set to 0. Note that if you set a high trace, it may be impossible to also have a low fraction of zero noise rates without forcing all non-“1” diagonal values. Instead, when this happens we only guarantee to produce a noise matrix with frac_zero_noise_rates or higher. The opposite occurs with a small trace.
seed (int) – Seeds the random number generator for numpy.
max_iter (int (default: 10000)) – The max number of tries to produce a valid matrix before returning False.

Returns

noise matrix P(label=k_s|true_label=k_y) with trace as the np.sum(np.diagonal(noise_matrix)). This a conditional probability matrix and a left stochastic matrix.

Return type

np.array (shape (K, K))

cleanlab.noise_generation.generate_noisy_labels(true_labels, noise_matrix)[source]#

Generates noisy labels (shape (N, 1)) from perfect labels y, ‘exactly’ yielding the provided noise_matrix between labels and y.

Below we provide a for loop implementation of what this function does. We do not use this implementation as it is not a fast algorithm, but it explains as Python pseudocode what is happening in this function.

Parameters

true_labels (np.array (shape (N, 1))) – Perfect labels, without any noise. Contains K distinct natural number classes, e.g. 0, 1,…, K-1
noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

Examples

# Generate labels
count_joint = (noise_matrix * py * len(y)).round().astype(int)
labels = np.array(y)
for k_s in range(K):
    for k_y in range(K):
        if k_s != k_y:
            idx_flip = np.where((labels==k_y)&(true_label==k_y))[0]
            if len(idx_flip): # pragma: no cover
                labels[np.random.choice(
                    idx_flip,
                    count_joint[k_s][k_y],
                    replace=False,
                )] = k_s

cleanlab.noise_generation.noise_matrix_is_valid(noise_matrix, py, *, verbose=False)[source]#: Given a prior py = p(true_label=k), returns true if the given noise_matrix is a learnable matrix. Learnability means that it is possible to achieve better than random performance, on average, for the amount of noise in noise_matrix.

cleanlab.noise_generation.randomly_distribute_N_balls_into_K_bins(N, K, *, max_balls_per_bin=None, min_balls_per_bin=None)[source]#

Returns a uniformly random numpy integer array of length N that sums to K.

Parameters

N (int) –
K (int) –
max_balls_per_bin (int) –
min_balls_per_bin (int) –