Use Case — Speech Emotion recognition system

7 min readJan 23, 2021

In my previous article, I had explained the steps of the data science methodology, and as promised, we’d be looking into a real use case today.

Let’s get to it!

You’re a data scientist — a very happy one at that — going to the cafeteria for lunch. On getting there, Sarah, the head of the marketing department looks really gloomy. You want to look away and mind your business, but she brings cakes to work everyday for you, and so you deem it fit to approach her.

You: “Hey Sarah, what’s wrong?”

Sarah: “Just finished a meeting with the boss. We keep on losing customers! Apparently, they are displeased about something, but I don’t know what it is! If only there was a way to tell how they feel about our services”.

Business understanding

You: “What if I could help with something?”

Sarah: (Wide eyed) ‘Yes please! What exactly can be done?”

You: (Hands on your chin) “Do you have a section where you ask your customers for review?”

Sarah: Mmm, yeah, but the kind of customers we have usually feel lazy to type.

You: “You need something like…mmm, haha! An SER!”

Sarah: “A what?”

Analytics approach

You: (thinking out loud and ignoring Sarah) “I could build an SER (Speech Emotion Recognition) System. For every of our services, we’d ask them to say something. We’d build a classification model! That way, we’d be able to recognize the emotions of the customers and analyze it.”

In recognizing the customer’s emotions, you’d want to know whether they are ‘happy’, ‘sad’, ‘angry’, ‘disgusted’, and so on. This therefore makes it a classification problem. Any problem that you have to determine what class or section they belong to is a classification problem.

Data requirements and collection

Sarah: “Yeah, let’s do that! I’m kind of lost though. How do we build this classification model you speak of?”

You look at her with your hands still on your chin.

You: “ What I’d need to train the model is a recording of a looooot of people showing different emotions.”

Sarah: “Oh” (frowns slightly) “The boss needs results soon. Would there be time to collect so many recordings?”

You: “Not to worry, Sarah. There are lots of available datasets already. I’d be using the RAVDESS (Ryerson Audio-Visual Database Emotional Speech and Song) dataset.

Click on the ‘dataset’ to download it if you want to. The entire dataset is 24.9GB from 24 actors, but the link takes you to a lowered sample rate on all the files.

I’d be sharing my code with you, and don’t worry if you don’t understand them. Why?

Besides many lines of the code, I put comments there to help you see what the code does.
In coming articles, I’d be explaining some of the concepts better, okurrr, fam?

If you want to go ahead and try the code, here’s how you get started.

Install anaconda here (choose your operating system). When you’re done with that, go to the anaconda prompt and type.

conda install -c conda-forge librosa. Installed? Great. Now type your code in Jupyter lab. If you need help finding your way around anaconda, after your installation, you’d see a short tutorial video.

Data understanding and preparation

Normally, in data preparation, you’re to clean the data, but the traditional methods for cleaning the data usually applies to structured data. Our data is unstructured, i.e. It’s not in rows and columns. This isn’t the standard definition, but this is what I kept in my heart right from day 1.

A standard definition goes like this:

Structured data is data that is highly organized and formatted in a way that makes it easily searchable in relational databases and spreadsheets. Unstructured data is the opposite of that — it has no predefined format which makes it difficult to collect, process and analyze. Examples of unstructured data are videos and audios.

Okay, that’s that.

Librosa is a python library that extracts features from speech. You’d soon see some of the code for extracting the features. Because I’m a firm believer that if you understand exactly what a line of code does, you’d become an expert in using it, let’s do a little background study.

The thing is this — your computer doesn’t ‘listen’ to sounds the way you do. Right now, as I type this, I’m playing a song I like, but I can also hear cars passing.

If a computer is trying to record my song, the sounds of the car passing is an unwanted signal, that is, it’s not the signal we are tracing, and so that’s noise.

The stage of removing noise is called preprocessing, and thankfully, RAVDESS has done this for us. What we have to do for ourselves is the Feature extraction.

There are so many features in speech, but we don’t need all of them, just some, like MFCC(Mel Frequency Cepstral Coefficient, Chroma and Mel)

MFCC refers to the short-term power spectrum, and the calculation is more boring than the name

Chroma refers to the 12 different pitch classes.

In my code, I also used ‘Mel,’ ‘Contrast’ and ‘Tonnetz.’ I know, I know, all these terms aren’t so exciting, but as a data scientist, you have to do research for every project you work on.

Next, we want to identify the emotions as shown above, and we indicate using dictionaries. A dictionary is the code in the curly {} brackets.

Next, we create a function ‘load data,’ create empty lists ‘X’ and ‘y’ and so we store the features in X and the emotions in y.

It’s just one of those things.

What this simply does is to load the dataset, use 75% of the data to train the model and 25% to test the model.

Note! In the line “for line in glob.glob(“data/Actor_*/*.wav”)”, you must have a folder called ‘data’ in the same directory where your code is saved.

Next, we print out some details.

Modeling

In machine learning modeling, we need to use something called ‘parameters’ and ‘hyper parameters.’ I’d explain better in the model article.

We set our hyper parameters (which was determined by a grid search),

and then train the model

Evaluation

Great! We’ve trained our model! Let’s see how well it performs. (This is why we imported accuracy_score from sklearn.metrics)

This means the model is going to be correct 69% of the time.

Now, let’s save our model.

Deployment

Now you’re done training your model! You’re so excited and start typing to Sarah on Slack that you’re done. You want to attach a file to show her what you’ve done so she can see for herself, but you pause and think.

I can’t possibly send her a Jupyter Notebook! She wouldn’t understand it.

Here comes Flask — the tool you’d use to deploy the model. Working with flask requires an entire article dedicated to it.

Feedback

Sarah gets the model, and tries it out with 4 of her friends. The 5 of them display the 4 different emotions (which makes it 20 emotions altogether) and for those 20 emotions, the model only got 10 right.

That’s not so good now, is it? You go back and train the model. Maybe add more samples to increase the accuracy? You do this on and on till you, Sarah and the boss are satisfied.

You did great, even if you didn’t use the code. I hope you got a better understanding of the data science methodology, and maybe extra knowledge on some terms. Thanks for reading!