Understanding of Probability Distribution and Normal Distribution
Statistics is a key component in data science, which deals with gathering, analyzing, and drawing conclusions from data. An aspect of statistics is the probability distribution, which gives an idea of the likelihood of an event occurring, for example, there is an 80% chance of rain tonight.
Regarding probability, the common notation used is p(X), which means the probability that a random variable X is equal to a particular value, therefore p(X=0.8), in the example given, indicates that there's an 80% chance of X occurring. The sum of all probabilities should be equal to 1, therefore if there's a 0.8 chance of rain, then there's a 0.2 chance of no rain. Probabilities are also between 0 and 1. There are two types of probability distributions,
- Discrete probability distribution
- Continuous probability distribution
The following sections talk a bit more about both distributions
Types Of Probability Distributions
As mentioned above, there are two types of probability distributions. The discrete probability distribution is used when the outcome of a set of probabilities is finite, which means it has an end, the simplest example is a normal coin toss, where the possible outcomes are only head or tail and nothing in between. There are various types of a discrete probability distribution, some of which are
Poisson, for counting situations, such as the counts of televisions sold at a video store per week
Binomial for the binary situations, such as if it would rain or not
Uniform distribution for multiple situations that have the same probability such as a die roll
The continuous distribution is used when the outcome of a set of numbers takes on an infinite number of values between any 2 numbers, for example, between 0 and 1 contains an infinite number of values. The most common form of the continuous distribution is the normal distribution, which would be expanded and talked about in great length in the next sections.
Normal or Gaussian Distribution
The normal distribution is a probability distribution when most of the data tend to be situated around the center and the probabilities of the data not at the center are evenly distributed to both the left and the right, with no skew. In this type of distribution, the mean(average), the mode(maximum), and median(middle) are basically equal(zero). Here, the mean is zero, and the standard deviation is 1. A common term for this is the bell curve because it does have the shape of the bell, as can be seen in the figure above.
The normal distribution has certain characteristics which make it a bit easier to spot, some of which are:
- The mean, median, and mode are equal
- There is no skew(whether left or right), meaning 50% of the values are on the left of the mean, and the other 50% on the right
- The mean and standard deviation are the key terms that characterize this
As can be seen in the figure above, the mean median and mode are the same. The mean determines where the peak is at, and any adjustment to the mean alters the shape of the distribution. A positive adjustment moves the curve towards the right and a negative adjustment moves it towards the left. The standard deviation, on the other hand, determines how far the data points are spread from the mean, where a big standard deviation results in a bigger curve and a smaller one results in a narrow curve.
Normal Probability Distribution and Normal Distribution Calculator
With the mean and standard deviation determined, a normal curve can be fitted to the data using the probability density function.
For a probability density function, the area under the curve gives an idea of the probability, and the normal distribution is a probability density function, therefore the area under the curve is always 100%. The normal probability density function can be represented mathematically as:
Where f(x) is the probability, x is a data point variable, mu is the mean, sigma is the standard deviation, and sigma squared is the variance. Knowing these values, you can just plug them into the formula to get the normal probability density function.
Standard Normal Distribution and Bell Curve
This is also referred to as the z-distribution, and every normal distribution can be converted to a standard normal distribution by changing every data point into z-scores. Z-scores tell how many standard deviations, every value is away from the mean. Z-score can also be represented mathematically as:
Where x refers to a data point, mu is the mean and sigma is the standard deviation.
There are some reasons why normal distributions are converted to z-scores, some of which are:
To be able to compare and contrast results on different distribution which have different means and different standard deviations
To determine the probability that a particular samples mean is significantly different from a known mean of a population
To know the probabilities of distributions above or below a particular value
Use Cases Of Normal Distributions
Normal distributions are very important in statistics, and these are some cases showing their importance:
- Height of the population. This is a typical example of a normal distribution because taking any population into consideration, let's say a particular country, the number of people taller and shorter or bigger and smaller than the average population is relatively equal, so they assume the shape of a normal distribution. The taller individuals of the population are higher than most people
- The income level. The individuals earning a lot and the ones earning less would be on either side of the average income levels of individuals. The individuals earning more is higher than most individuals and this makes it easier for companies and individuals to determine if they are paying more or are being paid less as compared to the average population
- Weight of babies. The average baby weight is around 3.5 Kg, and the babies weighing lower or higher than this value can be easily be determined, which means any baby who is extremely lower or higher than this figure, doesn't have an average weight, which could be a piece of key information to the doctors.
- Students' marks in an examination. Most students marks would be around the mean, and any student whose mark is lower than this, would require extra support to catch up and any student whose marks are significantly higher, can be considered for promotion to a higher class
These are just a few cases where knowledge of normal distribution is important.
In this article we first looked at probability distribution, its types, then we looked at normal distribution, intuition about it, and various use cases. Knowledge about normal distribution is very key in statistics and hence data science, which can be applied to many use cases. Data scientists or analysts or even individuals trying to break into this field should make it a priority to understand this concept, as more concepts are built on this, so the foundations here should be strong.