Stat Digest: The intuition behind Confidence Interval

AI/Data Science Digest
Geek Culture
Published in
6 min readMar 6, 2023

--

What is the confidence interval? Why do we need that?

A practical use case of confidence interval (Image by Author)

In 2020, the Economist predicted that Joe Biden is like to win the presidential election with 56.4% of the votes compared to 44.6% votes for Donald Trump with a 95% confidence interval of ± 3.4.

What does 95% mean here?

What does it tell you from ± 3.4?

Fear not, we are going to master these concepts and become confidence interval (CI) pros today 😎

In this post, instead of simply dishing out the equations, I want to describe the intuition behind CI first and then move on to calculating it.

Let’s pretend that we are going to do our own survey to predict the likelihood of Joe Biden winning the presidential election.

There are 240 million eligible voters in the US. This is our population. (Note that the US population is different from the population we define here. In statistics, the population is all the entities that fall under our definition — In this case, the eligible voters.)

Population (Image by Author)

Ideally, we ask all 240 million people who they would vote for. Then, find out the proportion voting for Biden. That is our population parameter.

It is not realistic to ask from each and every eligible voter though. It would take months and cost us a lot of money to do!

What can we do here?

This is where statistics come to the rescue.

Instead of asking from the whole population, we select a random sample of, say 1000, people and compute the sample proportion. The sample proportion is a sample statistic. Then, we use this sample statistic to estimate the population parameter.

Let’s say that 530 said they’d vote for Biden. The sample proportion is 53%.

What if we survey another random sample of 1000 people? Will that sample proportion be 53% as well?

It is likely not.

It turned out to be 49%. Another sample was 56% and yet another was 50%.

Sample proportions (Image by Author)

How do we make sense of these sample proportions to estimate the population proportion?

We could take many samples and measure the proportions, build a graph of all likelihood of getting different proportions.

Does this distribution sound familiar? Yes, it is the sampling distribution. (Note that each sample has way less than 10% of the population and therefore we can assume that voters in different samples are independent. This is a necessary condition for the sampling distribution.) [Take a detour on sampling distribution and the central limit theorem if you are not familiar with it]

Let’s say we get the following sampling distribution. Notice that it is a normal distribution with a mean proportion of 55.

Sampling Distribution with 95% probability area

Recall that these percentages are pretty standard in a normal distribution as shown below:

Probability of a sample mean falling within one, two, and three standard deviations away from the population mean

Let’s say, we have measured the sample proportion to be 53%. What is the probability that it falls within two standard deviations of the population proportion?

Pause for a moment and think from the above charts.

You guessed it right. Roughly 95% of the area of the chart covers two standard deviations from the population proportion.

So, there is a 95% chance that the sample mean falls within two standard deviations from the population proportion.

Let’s flip the question.

What is the probability of population proportion falls within two standard deviations of the sample proportion?

It’s the same probability! Take a moment to convince yourself this is the case.

That’s the core thinking behind confidence interval. For 95% confidence, we identify the interval p ± 2σ. That is the interval between p — 2σ and p+2σ. In other words, two standard deviations from the sample mean.

Depending on the confidence level we chose, the confidence interval changes. The higher the confidence level, the broader the confidence interval. (Image by Author)

Typically, we use 95% conference level.

What does the confidence interval really tell us?

The following diagram illustrates what CI means in practise.

The implication of CI in paractice (Image by Author)

As you can see from the above figure when we get a sample and calculate the CI, 95% time it is very likely to include the population mean/proportion whereas 5% time the population means/proportions is likely to be outside of the CI.

Another way to interpret CI is to think that we are identifying the interval between 2.5% and 97.5% percentile values of our sampling distribution which are the most likely means/proportions we will get when we calculate the sample mean/proportion.

How do we find the sample standard deviation?

Estimating the sample standard deviation from the sample proportion (Image by Author)

As the left equation shows, we need the population proportion to compute the sample standard deviation. But we don’t know that. Instead, as shown in the right equation, we estimate the sample standard deviation using the sample proportion.

In our example, it will be the square root of (0.53 * (1–0.53))/1000), which is 0.0005. With a 95% confidence interval, our result is 0.53 ± 0.001.

How can we narrow the confidence interval?

Notice the term n (sample size) at the bottom of the standard deviation. So, if we increase the sample size, it will reduce the sample variation (i.e. standard deviation. This in turn will reduce the confidence interval.

The larger the sample size, the narrower the confidence interval (at 95%) (Image by Author)

How do we compute the conference interval for the mean?

For example, we want to find the mean height of people living in New York. We take a sample of people living in New York and then estimate the confidence interval for the mean.

Can we use the same equation we used above? No, in the above, we looked at proportions. Means are different.

Finding the confidence interval for sample mean (Image by Author)

Notice that the critical values (i.e. 2.5% and 97.5% percentile values for 95% confidence level) are drawn from a t-distribution.

t-distribution vs. z-distribution. Notice that the thicker edges of the t-distribution — helps to capture some uncertainty when the sample size is typically smaller than 30.

When the number of samples you have is small (typically less than 30), it is prudent to use the t-distribution instead of the normal distribution as it can capture more uncertainty.

As the number of samples in a t-distribution is increased, it becomes more like a normal distribution

When you have many samples (typically more than 30), using t or z (normal) values do not change the CI much. The reason is that as the number of sample increases, t-distribution looks more like a normal distribution.

t values for different confidence intervals and sample sizes. Notice that the t value increases when the confidence level is increased and the t value decreases when the sample size is increased

How do you use CI in practice?

We already looked at how exit polls use CI to compensate for the uncertainty in our measurements (i.e. point estimations). Let me explain it with some more examples.

Let’s say you run a chocolate factory. You want to make sure each chocolate has 10g grams of cocoa. You take a sample of chocolate and compute the mean amount of cocoa with a 95% confidence level. The interval turns out to be 8 grams to 12 grams. Since the population mean is within the confidence interval, you assert that you continue to produce chocolate with 10 grams of cocoa.

In the event that the confidence interval you calculated does not contain the population mean, you will have to recalibrate your chocolate-producing machines to add the right amount of cocoa.

Summary

  • CI allows us to capture the uncertainty around the point estimations we make (proportions, means, slope a linear regression)
  • Depending on how much of an uncertainty you want to encode into your finding, you can change the confidence level from 0% (no wiggle room) to 100% (no new information as it always includes the population mean/proportion and is hence useless).
  • Typically we use 95% confidence level.

--

--

AI/Data Science Digest
Geek Culture

One Digest At a Time. I value your time! #datascience #AI #GenAI #LLMs #dataanalyst #datascientist #probability #statistics #ML #savetime #digest