Introduction to Statistics

AI/Data Science Digest
4 min readAug 9, 2021

Get inspired by numbers

Data tells us stories. In statistics, these stories are illustrated with charts, graphs and tables. They illustrate dominance; they reveal change; they help us organize; they educate us; they empower us; the more know, the more curious we become.

Where to begin?

When we read books, we start from the beginning, but in statistics, we prefer to start from the middle. The middle value tells us a story of the data we have.

How do we measure the middle?

In Statistics, we use three tools to determine the middle of the data. Mean is the average of all numbers, median is the middle number once arrange them from smallest to the largest and mode is the most common number in the data.

Finding Mean, Median and Mode

Calculating mean, median and mode. n is the number of data points we have.

A data set can have no mode (each value is unique), 1 mode, or more than 1 mode (if there are 2 or more equal number of most common values) — if there are two such numbers, we say the data is bi-modal.

When you have even number of data points (say 10), we don’t have an exact middle number — what we do is to take the two middle numbers (5th and 6th in this case) and take the mean of them as the median.

Variability of Data

Take these two datasets. They both have the same middle point, but they are very different.
These datasets have similar mean close to 30, but their distributions are very different. Data points are very close to each other in Data Set 1 whereas the data points are very far in Data Set 3. This property of how spread the data is referred to as the variability. To common tools used to measure the variability are range and standard deviation.

Range

Range = largest value — smallest value

Range looks at 2 extreme data points and compute the statistic.

Range can be misleading when we have extreme data points (outliers). This is because range looks at only two data points in the data set. A better measure of variability is standard deviation.

Standard Deviation

Standard Deviation measure the variability of data points with respect to the mean of the dataset. If the data points are dispersed, it has a high standard deviation. If the data points are close to the mean, it has a low standard deviation.

Green points have a lower SD whereas blue points have a higher SD (Source: Khan Academy)

Standard Normal Distribution

When our data fit a bell shaped like curve, we say our data is normal distributed and it has a normal distribution. Normal distribution is the most common distribution we see in the world. Height of the people in the world are normally distributed. Grades of large enough set of students (usually 30 or more) are normally distributed.

When the mean = 0 and SD = 1, we say the normal distribution is a standard normal distribution (SND).

SND gives us a way to meaningfully compare two distributions.

Standard Normal Distribution. 68.2% of data points are within one SD from the mean, 95% of data points are within two SD from the mean and 99.7% of data points are within three SD from the mean.

Z-Score

If two data points have z-scores of 1 and 3, we can immediately tell that first data point is closer to the mean and the second data point is further from mean.
This is how you calculate the z-score. Subtract the mean value from the data point value and divide by the SD.
This example tells us that the data point 231 is 2.24 standard deviations away from the mean in the positive direction.

Calculating Standard Deviation

SD is, roughly speaking, the average squared distance from the mean. (Notice that instead of dividing by N, we divide by N — 1. We will discuss about this in a follow up video as you need to understand the concept of the degree of freedom first)
Take the squared difference of each data point and then divide by 3; finally take the square root. That’s your SD.
Having the z-score helps us understand if the data point is above (positive) or below (negative) the mean
In the left figure, the data points are closer to the mean (tall and skinny) -> it has a low SD. In the right figure, the points are spread away from the mean (short and chubby) -> it has a high SD.

Empirical Rule

This assume that the data points follow a standard normal distribution.
Empirical Rule: Most data points are within 3 SD from the mean
This data point is within 1 SD of the mean. It is one of the data points that falls under 68% of the data. So, it is likely to be a regular data point in the data set.
This data point is away from 3 SDs from the mean. The chance of this happening is less than 0.3%, which is very tiny. It could be an outlier.

Outliers

In this example, the normal heart rate is between 60–100 bpm. Since the patient on the right has a heart rate of 40 bpm which is outside the normal range, we say his/her heart rate is an outlier.
Outliers are those data points away from the most commonly seen data points

How do we measure outliers?

Should we mark the circled point as an outlier? It depends on how we define outliers. Usually, we use a distance measure or a density measure (how dense the neighborhood). If the distance is higher or density is lower, we mark them as outliers.

What should we do with outliers?

  • We sometimes ignore them as they are rare events and are likely to not occur.
  • We often time use them as a learning opportunity. We study why it happened, how it happened, what can we do to produce more outliers like that (if it is something good — e.g. extreme performer)

In this lesson, we learned about the very basics of statistics. Now you have the understanding to conquer more advanced topics. Khan Academy’s statistics course is a great next step.

Reference:

  • LinkedIn Introduction to Statistics Course

--

--

AI/Data Science Digest

One Digest At a Time. I value your time! #datascience #AI #GenAI #LLMs #dataanalyst #datascientist #probability #statistics #ML #savetime #digest