One Thing You Can’t Skip — Data Science Methodology

Apata oyinlade
The Startup
Published in
8 min readJan 15, 2021

--

Hadoop, Jupyter Lab, Tableau, pretty sweet names for the amazing data science tools and you just can’t wait to start doing the magic!

Amazing warrior spirit, but let’s probably keep all that energy for later.

You must know the data science methodology before you conquer the ‘inconquerables.’ You’d see why.

Let’s get straight to it, and do pardon my rather petty case scenario (or in data science terms, use case).

Use case : A new girl moves into your compound, and she’s so rich. You heard she gives out cheques to her friends everyday and you honestly want to be her friend but it’s not just that simple.

(This is the first part of a 2-article series, so stay tuned for a more realistic use case)

The data science methodology

Business understanding

Has this ever happened to you before? In an examination hall, you spend so much time answering a particular question, only to discover that you answered the wrong question? It has happened to me a couple of times and I felt sad, brokenhearted, crestfallen and other synonyms of sad I can’t remember right now.

Business understanding is the stage where you want to make sure you’re not wasting energy and time solving the wrong problem.

Q — What problem are you trying to solve?

From our slightly petty (or a lot petty ) case study, the answer is

A- Trying to get Lizzy to be your friend

You have to ask a looooooot of questions from your client, or yourself (if you’re just trying to do something cool), so as to really understand what the problem is.

Analytic approach

It’s like this..there are different kinds of people and so you can’t possibly use the same approach to get them to be friends with you.

Gifts? A song? Maybe even get down on one knee and ask her to choose this day to be thine friend?

That’s the same thing here- different kind of problems, different kind of methods to address them.

Soon enough, you’d get used to them and know which one to use. For now, let’s just do a little background study on it (You’d get it, I’m so sure)

You need to know these:

  1. Descriptive analysis: Here, you’re trying to figure out ‘what has gone wrong.’ Say, you go to a doctor ‘cause your head aches. You tell him 'Hey doc, my head aches.’ This is like the problem statement (remember Q1?) but he still has to use his tools to further explain what’s wrong. Maybe something’s wrong with the arteries or so..I’m clueless here.

2. Diagnostic analysis: ‘Hey doc, why did this happen?’

Here, the doctor (data scientist) uses his tools (statistical analysis) to investigate why it happened.

3. Predictive analysis: You havedisastrous-funny-sounding disease and if it isn’t treated now, it’s going to escalate to a ‘more disastrous funny-sounding disease’

Doctor says. This is what the data scientist does in this stage; answering the question 'what will happen next if this isn’t fixed?’

4. Prescriptive analysis : Get XYZ drug, and come next week for brain surgery

This prescription is too intense 😂 but you know what I mean. At this stage, the data scientist decides what’s best for solving the problem.

I took your sweet time to give you the background because you’d see that just like how the doctor needed to do series of test before prescription, you’d need to carry out several analysis before you answer this:

Q — How can you use data to answer the question?

Or in other words, what medical tool (data science method) is best for treating the patient?

To get here, the doctor had to ask a lot of questions. So, as I’ve heard, a good data scientist is someone who is really inquisitive.

There are different methods in data science like k-means clustering, logistic regression, and others, and with time, we’d figure out when to use them. For now, let’s focus on Lizzy and her money.

The most sensible questions to ask are:

How did the others get to be friends with her? Does she have preferences in gender? What time is the best time to meet her if you want to be friends with her?

Data requirements

So now we know we are to get information about what she likes, but from who?

That’s obvious — her present friends, and the analytic approach really helped in laying that part bare for us.

So, the analytic approach determines the data requirements stage, or better put, when you figure out your analytics approach, you’d easily know what kind of data you need.

Q— what kind of data would you need to solve the problem?

A — A comprehensive list of the names of her friends, their gender and a detailed explanation of how they became friends with her

Data collection

You now have to figure out how to get the data and the structure/form of the data.

Where do her friends stay? What language do they speak?

In data science it’s more like:

'Where do I get the data from — is it on the web? Am I to collect it manually?’ ‘ Is the data structured or unstructured data?’

Data understanding

You’ve put in a lot of effort — I see that! Maybe Lizzy isn’t even worth it at all, but she’s rich so you man up and continue.

It’s time to ‘figure out' the data. First you answer the question

Q — Is this data going to solve the problem?

If the answer is a yes, you need to carry out EDA (Exploratory Data Analysis) where you use graphs, histograms and other visualization tools to figure out patterns or relationships in the data.

You also need to ask yourself

Do we need more data? This is where you conduct Feature Engineering.

What you do in feature engineering is look for more data that can be useful in answering the question.

Maybe the age of her friends?

Data preparation

Here, you’d do a lot of data cleaning i.e. You check for missing values, remove duplicates, check if everything is in the right format — is there a number in the name’s section? Is there a name in the date’s section?

Why? Imagine that 5 out of 10 people put a number instead of their gender in the column 'gender.' This would affect your answer at the end

It’s quite boring and takes a huge chunk of time in the whole data science process (pretty much like cleaning our rooms, no?), but you just have to clean your data.

Modeling

In another article, I’d talk about modeling, but let’s just scratch the surface for now.

Modeling is a term in machine learning for this thing that you train to help you predict the future. Your model learns from your training data set (in this case, the list of all Lizzy’s friends and their information), and so we can predict to a large extent what Lizzy wants in a friend.

Keep these in mind till we talk about them in details.

Supervised machine learning — you know the input and output

Unsupervised machine learning — you have no idea what the output is- the machine figure would figure it out.

Reinforcement learning — The machine does guess work/trial and error, and learns from past experience.

Moving on, we’ve collected the data, analyzed and what can we predict from the data?

7 out of 10 of her friends say that what she likes in a friend is their ability to cry.

You probably wouldn’t have guessed that, and that’s what data science does. It provides you with insights — some you may never have thought of.

She also prefers people whose names start with Z. You noticed this from the data because all of them are Zoe, Zander, Zayne, Z….

Interesting.

Evaluation

If at the end of your modeling, your answer is her date of birth, not what she likes, you’d most likely feel like.

Q- Does the model answer the initial question or is there need for adjustment?

Deployment

You knock on Lizzy’s door, so confident that you’d get it right. Isn’t it to cry? You’ve got this.

Usually in deployment, you 'deploy' the model to a limited number of people so they can help you see if it’s good.

This is like going to meet 3 people who are not Lizzy but are like Lizzy and ask them

‘ If I cry thrice a day and change my name to Zebra, would you be my friend?’

Not the best example, but you get the point.

Feedback

What did they say ? Yes? No? Ignoring our use case, if at this stage, you figure out that the model just doesn’t work, you go back, make some changes to the model, deploy again and wait for more feedback.

That’s the loop as you’d see in the image.

Now back to Lizzy’s door. You are now so sure after asking around that crying would work, and you smile because you know changing your name to ‘Zebra' would pay after all.

Go ahead and knock, warrior.

'Who’s there?’ she asks.

'Zebra,’ you say, while sobbing.

She opens the door, chequebook in one hand, tissues in the other.

“Why zebra, you poor thing! Come into my house and take all my money.”

Thank me later with a cheque in return. I hope you learnt at least the basics of data science methodology — do use the comments section to tell me what you feel.

…and oh, don’t try the Lizzy thing. I shall deny you, because it was not my idea.

--

--