Importance of Statistics and Probability in Data Science

4 min readOct 26, 2020

They are essentials for getting into Data Science. It is said that you cannot learn data science without having knowledge of statistics and probability. People usually don’t get much interest in these topics.

However, I would change this thing for you today and will introduce you to explain the basics of statistics and probability with respect to data science.

“Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” — Josh Wills

Therefore, statistics are a group of principles that are used to attain information about the data so as to make decisions. It unveils the secret hidden in the data.

Probability and Statistics are involved in different predictive algorithms that are there in Machine Learning. They help in deciding how much data is reliable, etc.

Different terms used in Statistics?

A person should have knowledge about often-used terminologies, broadly practiced in Statistics for data science. Let us understand the same -

Population — The place or a source from where the data has to be fetched or collected.
Sample — It is defined as a subset of the population.
Variable — It Data item that can be either a number or thing that can be measured.
Statistical Parameter — It is defined as the quantity that leads to probability distribution like mean, median, and mode.

What is Statistical Analysis?

Statistical Analysis is the science of the exploration of the collection of large datasets to find different hidden patterns and trends. These types of analyses are used in every sort of data for example in research and multiple industries, etc so as to come to decisions that are to be modeled. There is mainly two types of Statistical Analysis-

1. Quantitative Analysis: The type of analysis is defined as the science of fetching and interpreting the data with graphs and numbers to search for underlying hidden trends.
2. Qualitative Analysis: The type of Statistical analysis that gives the common information by making use of text and other forms of media.

Measures of Central Tendency

It is defined as the single value that aims to explore a set of data by recognizing the central position within the set of data. It is also called a measure of a central location that is also categorized as summary statistics.

Mean — It is calculated by taking the sum of all the values that are present in the dataset and dividing that by the number of values in the data.
Median — It is the middle value in the dataset that gets in order of magnitude. It is considered over mean as it is least influenced by outliers and skewness of the data.
Mode — It is the most occurring value in the dataset.

What is Skewness?

The curve that is distorted or skewed towards left or to the right. Asymmetry in statistical distribution is known as Skewness that specifies whether the data is intensive on one side. It tells about the distribution of the data.

Skewness is divided into two parts -

Positive Skewness: It occurs when the mean>median<mode. The tail is skewed to the right in this case.
Negative Skewness: It occurs when the mean<median<mode. The tail is skewed to the left.

What is Probability?

It is the base and language needed for most of the statistics. It is also defined as the phenomenon of a particular outcome by computing its importance in daily life. One cannot do data science problems without the knowledge of probability. It is considered to be an important factor in predictive analytics.

The probability is the measure of the likelihood of an event to happen. It measures the certainty of the event. The formula for probability is given by;

P(E) = Number of Favourable Outcomes/Number of total outcomes

Null Hypothesis: Hypothesis where there is no notable difference between the described population.
Alternative Hypothesis: Hypothesis where there is a notable difference.

In Statistical hypothesis testing, the probability value is also known as p-value is the probability of getting results at least as utmost as the results actually have been observed, making the assumption that the null hypothesis is correct.

If p value <= 0.05, the null hypothesis is rejected.
If p-value >=0.05, the null hypothesis is accepted

But why do we need to accept or reject a hypothesis?

If we accept the null hypothesis, the independent features do not have any influence on the prediction of the target variable. If the null hypothesis is rejected it means that the feature will help in the prediction of the target variable.

How to calculate p-value?

The p-value is computed by examining the summary of linear relation formed between the target and features or between the dependent and independent variables.

With the help of straight-line linear regression will help in building the relationship between these variables by making use of the formula y=mx + B.

The points that are closed to the regression line are most important and they have the p <=0.05, so they are taken in consideration to predict y whereas the points are further from the line are not important, having p-value >=0.05 and are not taken in consideration to predict the target y.

Conclusion

Statistics and probability are the base of data science. One should know the fundamentals and concepts so as to solve the data science problems. It gives you the information about the data, how it is distributed, information about the independent and dependent variable, etc.

In this blog, I have tried to give you the basic idea about statistics and probability. Yes, there is much more to be explored when we talk about Statistics and probability in Data Science.

We have discussed the important,statistical analysis, measure of central tendency, basic terminologies in statistics, and skewness. Also, I have given you the idea of a Hypothesis done in probability and how we can accept it and reject it on the basis of a p-value.