Speech Emotion Recognition With Openai Whisper Large V3

Speech emotion recognition

The Speech Emotion Recognition With Openai Whisper Large V3 model is designed to recognize emotions in speech. It's trained on a dataset with over 4,000 audio recordings labeled with various emotions, including happy, sad, angry, and surprised. Using the Whisper Large V3 model, it can classify audio recordings into different emotional categories with high accuracy. But what makes this model unique is its ability to handle a wide range of emotions and its efficiency in processing audio data. With a high F1 score of 0.9198, it's clear that this model is effective in identifying emotional states from speech data. So, how does it work? Simply put, it uses the Whisper Feature Extractor to preprocess audio data, and then it's fine-tuned for audio classification tasks. This model is not only accurate but also efficient, making it a practical choice for real-world applications.

Firdhokk apache-2.0 Updated 4 months ago

Table of Contents

Model Overview

The Speech Emotion Recognition with Whisper model is a cutting-edge AI tool that can recognize emotions in speech. Imagine being able to understand how someone is feeling just by listening to their voice! This model can classify audio recordings into different emotional categories, such as Happy, Sad, Surprised, and more.

Capabilities

Capable of recognizing emotions in speech, this model can classify audio recordings into different emotional categories. It can identify emotional states from speech data with high accuracy.

Primary Tasks

  • Classify audio recordings into emotional categories
  • Identify emotional states from speech data

Strengths

  • High accuracy in recognizing emotions (0.9199)
  • Effective in identifying emotional states from speech data
  • Trained on a diverse dataset with multiple emotions

Unique Features

  • Utilizes the Whisper Feature Extractor to standardize and normalize audio features
  • Fine-tuned for audio classification tasks
  • Supports multiple emotions, including Happy, Sad, Surprised, and more

How it Works

So, how does this model work its magic? Here’s a simplified overview:

  1. Audio Loading: Loads audio files and converts them to numpy arrays using Librosa.
  2. Feature Extraction: Processes audio data using the Whisper Feature Extractor.
  3. Model Training: Trains the model using a dataset of labeled audio recordings.
  4. Emotion Recognition: Classifies audio recordings into emotional categories using the trained model.
Examples
Recognize the emotion from the audio recording: /content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav Happy
Classify the emotion of the audio: /content/drive/MyDrive/Audio/Speech_URDU/Sad/SM5_M4_S058.wav Sad
Identify the emotional state from the audio file: /content/drive/MyDrive/Audio/Speech_URDU/Angry/SM5_F4_A058.wav Angry

Performance Metrics

But how well does it perform? Let’s take a look at some impressive performance metrics:

MetricValue
Loss0.5008
Accuracy0.9199
Precision0.9230
Recall0.9199
F1 Score0.9198

These metrics demonstrate the model’s ability to effectively identify emotional states from speech data.

Using the Model

To use the model, you can follow these steps:

  1. Install Required Libraries: Install the necessary libraries, including transformers, librosa, and torch.
  2. Load the Model: Load the pre-trained model using the AutoModelForAudioClassification class.
  3. Preprocess Audio: Preprocess the audio data using the preprocess_audio function.
  4. Make Predictions: Make predictions using the predict_emotion function.

Here’s an example code snippet to get you started:

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

# Load the model and feature extractor
model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)

# Preprocess audio and make predictions
audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"
predicted_emotion = predict_emotion(audio_path, model, feature_extractor)
print(f"Predicted Emotion: {predicted_emotion}")

Note that this is just a simplified example, and you may need to modify the code to suit your specific use case.

Limitations

While this model is powerful, it’s not perfect. Let’s take a closer look at some of its limitations:

Limited Emotion Coverage

The model is trained on a dataset with a specific set of emotions, which might not cover all possible emotional states. For example, the dataset only includes 8 emotions: Happy, Sad, Angry, Neutral, Disgust, Fearful, Surprised, and Calm. What about other emotions like Excited, Bored, or Confused?

Imbalanced Dataset

The dataset used to train the model is not perfectly balanced. Some emotions, like Happy and Sad, have more samples than others, like Disgust and Fearful. This imbalance might affect the model’s performance on underrepresented emotions.

Limited Audio Quality

The model is trained on audio recordings with a specific quality and format. What if the audio input is noisy, distorted, or in a different format? The model might struggle to recognize emotions in such cases.

Dependence on Feature Extraction

The model relies on the Whisper Feature Extractor to preprocess audio data. If the feature extractor is not accurate or robust, the model’s performance might suffer.

Limited Generalizability

The model is trained on a specific dataset and might not generalize well to other datasets or real-world scenarios. What if the audio recordings are from a different culture, language, or context?

Overfitting

The model is trained with a relatively small batch size and a large number of epochs, which might lead to overfitting. This means the model might perform well on the training data but not as well on new, unseen data.

Computational Requirements

The model requires significant computational resources, especially when dealing with large audio files or long recordings. This might limit its use in real-time applications or on devices with limited processing power.

By acknowledging these limitations, we can better understand the strengths and weaknesses of this model and work towards improving its performance and robustness.

Format

The model uses a specific format to process and classify audio recordings into different emotional categories. Let’s break it down:

Audio Format

The model accepts audio files in various formats, but they need to be preprocessed using the Whisper Feature Extractor. This extractor standardizes and normalizes the audio features for input to the model.

Input Requirements

  • Audio files should be loaded using Librosa and converted to numpy arrays.
  • The audio data should be processed using the Whisper Feature Extractor.
  • The preprocessed audio data is then passed to the model as input.

Output Format

The model outputs emotion labels, which are mapped to numeric IDs. The supported emotion labels are:

Emotion LabelNumeric ID
Angry0
Disgust1
Fearful2
Happy3
Neutral4
Sad5
Surprised6

Code Example

Here’s an example of how to preprocess audio data and pass it to the model:

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

# Load the audio file
audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"
audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)

# Preprocess the audio data
max_length = int(feature_extractor.sampling_rate * 30.0)
if len(audio_array) > max_length:
    audio_array = audio_array[:max_length]
else:
    audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

inputs = feature_extractor(
    audio_array,
    sampling_rate=feature_extractor.sampling_rate,
    max_length=max_length,
    truncation=True,
    return_tensors="pt",
)

# Pass the preprocessed audio data to the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    print(f"Predicted Emotion: {predicted_label}")

Note that this code example assumes you have the transformers library installed and have loaded the pre-trained model.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.