Audio Classification with SpeechBrain and Cleanlab#

In this quickstart tutorial, we will use Cleanlab to find label issues in the Spoken Digit dataset (it’s like MNIST for audio). The dataset contains 2,500 audio clips with English pronunciations of the digits 0 to 9 (these are the labels to predict from the audio).

Overview of what we’ll do in this tutorial:

Extract features from audio clips (.wav files) using a pre-trained Pytorch model from HuggingFace that was previously fit to the VoxCeleb speech dataset.
Train a cross-validated linear model using the extracted features and generate out-of-sample predicted probabilities.
Use cleanlab to identify a list of audio clips with potential label errors.

1. Install dependencies and import them#

You can use pip to install all packages required for this tutorial as follows:

!pip install speechbrain tensorflow_io tensorflow sklearn
!pip install cleanlab
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git

Let’s import some of the packages needed throughout this tutorial.

[2]:

import os
import pandas as pd
import numpy as np
import random
import tensorflow as tf
import torch

SEED = 456

def set_seed(seed=0):
    """Ensure reproducibility."""
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.cuda.manual_seed_all(seed)


set_seed(SEED)
pd.options.display.max_colwidth = 500
tf.get_logger().setLevel('ERROR')  # suppress TF warnings

2. Load the data#

We must first fetch the dataset. To run the below command, you’ll need to have wget installed; alternatively you can manually navigate to the link in your browser and download from there.

[3]:

%%capture

!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz
!mkdir spoken_digits
!tar -xf v1.0.9.tar.gz -C spoken_digits

The audio data are .wav files in the recordings/ folder. Note that the label for each audio clip (i.e. digit from 0 to 9) is indicated in the prefix of the file name (e.g. 6_nicolas_32.wav has the label 6). If instead applying cleanlab to your own dataset, its classes should be represented as integer indices 0, 1, …, num_classes - 1.

[4]:

DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"

# Get list of .wav file names
#
# os.listdir order is nondeterministic, so for reproducibility,
# we sort first and then do a deterministic shuffle
file_names = sorted(i for i in os.listdir(DATA_PATH) if i.endswith(".wav"))
random.Random(SEED).shuffle(file_names)

file_paths = [os.path.join(DATA_PATH, name) for name in file_names]

# Check out first 3 files
file_paths[:3]

[4]:

['spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav']

Let’s listen to some example audio clips from the dataset. We introduce a display_example function to process the .wav file so we can listen to it in this notebook (can skip these details).

Below is the code used to make audio files playable.

# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

import tensorflow_io as tfio
from pathlib import Path
from IPython import display

# Utility function for loading audio files and making sure the sample rate is correct.
@tf.function
def load_wav_16k_mono(filename):
    """Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio."""
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav


def display_example(wav_file_name, audio_rate=16000):
    """Allows us to listen to any wav file and displays its given label in the dataset."""
    wav_file_example = load_wav_16k_mono(wav_file_name)
    label = Path(wav_file_name).parts[-1].split("_")[0]
    print(f"Given label for this example: {label}")
    display.display(display.Audio(wav_file_example, rate=audio_rate))

[5]:

import tensorflow_io as tfio
from pathlib import Path
from IPython import display

# Utility function for loading audio files and making sure the sample rate is correct.
@tf.function
def load_wav_16k_mono(filename):
    """Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio."""
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav


def display_example(wav_file_name, audio_rate=16000):
    """Allows us to listen to any wav file and displays its given label in the dataset."""
    wav_file_example = load_wav_16k_mono(wav_file_name)
    label = Path(wav_file_name).parts[-1].split("_")[0]
    print(f"Given label for this example: {label}")
    display.display(display.Audio(wav_file_example, rate=audio_rate))

Click the play button below to listen to this example .wav file. Feel free to change the wav_file_name_example variable below to listen to other audio clips in the dataset.

[6]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav"  # change this to hear other examples
display_example(wav_file_name_example)

2022-06-07 07:55:22.768062: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2022-06-07 07:55:22.768593: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
2022-06-07 07:55:23.139901: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.12/x64/lib
2022-06-07 07:55:23.139940: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-07 07:55:23.139974: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az196-570): /proc/driver/nvidia/version does not exist
2022-06-07 07:55:23.140227: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Given label for this example: 7

3. Use pre-trained SpeechBrain model to featurize audio#

The SpeechBrain package offers many Pytorch neural networks that have been pretrained for speech recognition tasks. Here we instantiate an audio feature extractor using SpeechBrain’s EncoderClassifier. We’ll use the “spkrec-xvect-voxceleb” network which has been pre-trained on the VoxCeleb speech dataset.

[7]:

%%capture

from speechbrain.pretrained import EncoderClassifier

feature_extractor = EncoderClassifier.from_hparams(
  "speechbrain/spkrec-xvect-voxceleb",
  # run_opts={"device":"cuda"}  # Uncomment this to run on GPU if you have one (optional)
)

Next, we run the audio clips through the pre-trained model to extract vector features (aka embeddings).

[8]:

# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split("_")[0]))
# Note: Classes must be represented as integer indices 0, 1, ..., num_classes - 1
df.head(3)

[8]:

	wav_audio_file_path	label
0	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav	7
1	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav	0
2	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav	0

[9]:

import torchaudio

def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, fs = torchaudio.load(wav_audio_file_path)  # Reformat audio signal into a tensor
    embeddings = model.encode_batch(
        signal
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings

[10]:

# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path):  # for each .wav file name
    embeddings = extract_audio_embeddings(feature_extractor, file_name)
    embeddings_list.append(embeddings.cpu().numpy())

embeddings_array = np.squeeze(np.array(embeddings_list))

Now we have our features in a 2D numpy array. Each row in the array corresponds to an audio clip. We’re now able to represent each audio clip as a 512-dimensional feature vector!

[11]:

print(embeddings_array)
print("Shape of array: ", embeddings_array.shape)

[[-14.196308     7.319454    12.47899    ...   2.289091     2.817013
  -10.892642  ]
 [-24.898056     5.2561927   12.559636   ...  -3.5597174    9.6206665
  -10.285249  ]
 [-21.709625     7.5033684    7.913807   ...  -6.819826     3.1831462
  -17.208761  ]
 ...
 [-16.08425      6.321053    12.005463   ...   1.216175     9.478231
  -10.682177  ]
 [-15.053815     5.2424726    1.091422   ...  -0.78335106   9.039538
  -23.569181  ]
 [-19.76109      1.1258249   16.75323    ...   3.3508852   11.598273
  -16.237118  ]]
Shape of array:  (2500, 512)

4. Fit linear model and compute out-of-sample predicted probabilities#

A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted network embeddings.

To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be overfit (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with out-of-sample predicted probabilities, i.e. on datapoints held-out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via the cross_val_predict wrapper provided in scikit-learn.

[12]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-1, random_state=SEED)

num_crossval_folds = 5  # can decrease this value to reduce runtime, or increase it to get better results
cv_pred_probs = cross_val_predict(
    estimator=model, X=embeddings_array, y=df.label.values, cv=num_crossval_folds, method="predict_proba"
)

For each audio clip, the corresponding predicted probabilities in cv_pred_probs are produced by a copy of our LogisticRegression model that has never been trained on this audio clip. Hence we call these predictions out-of-sample. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split.

[13]:

from sklearn.metrics import accuracy_score

predicted_labels = cv_pred_probs.argmax(axis=1)
cv_accuracy = accuracy_score(df.label.values, predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {cv_accuracy}")

Cross-validated estimate of accuracy on held-out data: 0.9772

5. Use cleanlab to find label issues#

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues should be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned it in our model’s prediction.

[14]:

import cleanlab

label_issues_indices = cleanlab.filter.find_label_issues(
    labels=df.label.values,
    pred_probs=cv_pred_probs,
    return_indices_ranked_by="self_confidence",  # ranks the label issues
)

print(label_issues_indices)

[1946  469  516 1871 1955 2132]

The examples flagged by cleanlab are those worth inspecting more closely.

[15]:

df.iloc[label_issues_indices]

[15]:

	wav_audio_file_path	label
1946	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav	6
469	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav	6
516	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav	6
1871	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_theo_27.wav	6
1955	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/4_george_31.wav	4
2132	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav	6

Let’s listen to some audio clips below of label issues that were identified in this list.

In this example, the given label is 6 but it sounds like 8.

[16]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
display_example(wav_file_name_example)

Given label for this example: 6

In the three examples below, the given label is 6 but they sound quite ambiguous.

[17]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav"
display_example(wav_file_name_example)

Given label for this example: 6

[18]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav"
display_example(wav_file_name_example)

Given label for this example: 6

[19]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav"
display_example(wav_file_name_example)

Given label for this example: 6

You can see that even widely-used datasets like Spoken Digit contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be easily identified by cleanlab.