Confidence Interval Understanding
Suppose you are looking for a particular value of a characteristic of a population, firstly data is obtained from a sample of that population(could be the mean), and the parameter is guessed from the sampled data. Due to not using the whole population, one may ask if the sampled data is a reliable estimation of the whole population. This is where confidence interval comes in
Confidence interval is the probability that a parameter in the population(true value) lies within a range of values. In other words, it is how confident you are that the results obtained reflect what you would expect to find if the whole population was taken into consideration. Confidence interval is related to confidence level, where the confidence level is times you expect your estimate to be reproduced between the upper and lower bounds of the confidence interval. Confidence interval and level both deal with percentages.
As earlier mentioned, the range of values that you expect your estimate to lie within, within a certain level of confidence is the confidence interval. For instance, if you say you have a 90% confidence level, it means you are very confident that 90% of the time, your estimate falls between the lower and upper bounds which are specified by that confidence interval.
There's a value called the alpha value, which indicates a threshold, for instance, an alpha of 0.1, which means that there's a less than 10% chance that the test could have occurred under the null hypothesis. Coming back to our confidence level, the desired confidence level is normally 1 minus this alpha value used in the statistical testing.
Therefore, if an alpha value of p<0.1 is chosen for statistical significance, then the confidence level would ultimately be 1-0.1= 0.9 or 90%.
Confidence intervals are used for various statistical estimates such as
- Mean of a population
- Differences that exist between the mean and proportions of a population
- the variation estimates among groups
As the same suggests, these are all estimates of a number and do not give valuable information around that number, but the confidence interval is a good way to know more information about the variations of the number.
To better understand this, let's look at an example. Suppose we perform a survey of 49 Ghanaians and 49 Nigerians about their reading habits and discover that both nationalities read an average of 40 hours a month. The Ghanaians had quite a variation in the number of reading hours while the Nigerians had all read similar amounts. So, even though both nationalities have the same number of hours read(same point estimate), the Ghanaians have a wider confidence interval than the Nigerians because of the variation in the data.
Confidence Interval Determination
To determine the confidence interval of an estimate, the following should be known:
- The point estimate for which the confidence interval is being constructed
- The critical values
- The standard deviation
- The sample size
Once this information is known, the confidence interval can be calculated by inserting these into the confidence interval formula.
The point estimate is the statistical estimate you are trying to make, which could be the proportions, the population means, or the variation among the groups). In the example about the reading hours, the point estimate is the mean number of hours read(40).
Critical Value Determination
Critical values give an indication about how many standard deviations away from the mean you need to move in order to reach the confidence level of the confidence interval. There are three steps that need to be followed to find this critical value.
1. Choose the alpha value: As indicated above, the alpha value indicates the probability threshold for statistical significance. Some common alpha values are p =0.05, 0.1,0.01, in order of most common.
2. Choose either a one-tailed interval or a two-tailed interval: Two-tailed interval is the most commonly used, and it involves dividing the alpha into two, in order to get upper and lower bounds(tails). One interval is used only for a one-tailed t-test.
3.Check the critical value that corresponds with the chosen alpha value: If the dataset used follows a normal distribution, or if the sample size is large(maybe more than 40) and it's near normally distributed, the z-distribution can be used to find the critical values. The values in the table are some of the commonly used values for the z-statistics
If the data is rather small, that is below 40, and it is near normally distributed, the t-distribution is used instead. The t-distribution is almost identical to the z-distribution in terms of shape, but it corrects for smaller sample sizes, and also a degree of freedom needs to be known. For both the t and z distributions, the critical value is the same on both sides of the mean.
For the example used about the number of reading hours, there were more than 40 observations and it follows a normal distribution, which is sometimes known as the bell curve, so the z-distribution can be used. For a two-tailed 90% confidence interval, the alpha value is 0.05, which corresponds to 1.64, which means that in order to calculate the upper and lower bounds of the confidence interval, we can use the mean +/- 1.64 standard deviations from the mean.
Standard Deviation Determination
The standard deviation can be found easily, easing any statistical software. It is basically the square root of the variance, mathematically, it is:
Where N refers to the population size and xi refers to each value of the population, and mu refers to the population mean. In our example, the variance in the Ghanaian estimate is 49 and the variance in the Nigerian estimate is 16, square root of that gives the standard deviation, which gives a standard deviation of 7 and 4 respectively.
The sample size is basically the number of observations in the data. In our example, the sample size is 49 for each nationality.
Confidence Interval For Normally Distributed Mean
Data that is normally distributed has a bell shape when it is plotted on the graph, with the mean in the middle and the rest of the data distributed evenly on both sides of the mean. The confidence interval for the data is, therefore:
where CI is the confidence interval, X is the population mean, Z* is the critical value of the z-distribution. The confidence interval for the t-distribution has the same formula, but instead of Z*, it is t*. In a real-life or practical setting, the true values of the population aren't clear to see, unless maybe a complete census of the population is performed, therefore the samples from our data are replaced with the population values in the equation above.
To calculate the confidence interval of our example, we use the mean, standard deviation, and sample size. To calculate a 90% confidence interval, for Ghana:
So for Ghana, the upper and lower bounds of the 90% confidence interval are 41.64 and 38.36 respectively.
For Nigeria, the upper and lower bounds of the 90% confidence interval are 40.94 and 39.06 respectively.
Confidence Interval For Proportions
To find the confidence interval for proportions is almost similar to finding the confidence interval for the mean, the only difference is to substitute the standard deviation with a new item which can be seen in the equation below:
Where the new item is the p(1-p) which is the proportion of the sample, so in our example, it would refer to the proportion of the individuals who read books at all.
Confidence Interval For Non-Normally Distribution
In order to find the confidence interval around the mean of data that is not normally distributed, either you find a distribution that can match the shape of your data and use that distribution to calculate the confidence interval or transform your data so that it becomes a normal distribution and find the confidence interval for that transformed data.
Interpretation Of Confidence Intervals
When interpreting confidence intervals, both the upper and lower limits should be included. In our case study, we can interpret it and report it like this: 'We observed that Ghanaians and Nigerians read books for the same amount of hours per week, but there were a lot more variations in the estimate for the Ghanaians(90% confidence interval = 41.64 and 38.36) than for the Nigerians(90% confidence interval = 40.94 and 39.06)'
We talked about confidence interval and confidence level in this article and showed how to calculate this confidence interval for a number of scenarios. We looked at the equation and the parameters needed to find this confidence interval. We used an example to illustrate the confidence intervals which should give more details to readers about this confidence level. Confidence interval is key in statistics and most importantly in data science.