Probability Distributions
Summary
TLDRThis business analytics lecture focuses on probability distributions, emphasizing fitting distributions to data. It explains discrete and continuous distributions, contrasting probability mass with density functions. The session explores three data analysis approaches: trace-driven simulation, theoretical distribution fitting, and empirical distribution creation. The advantages of theoretical distributions over empirical ones are discussed, noting the limitations of relying solely on observed data for predictive modeling.
Takeaways
- π Probability distributions are statistical models that show possible outcomes for a given event or action.
- π For discrete variables, distributions are represented by possible values with corresponding probabilities; for continuous variables, by a density function.
- π The focus of the session is on fitting distributions to data rather than just describing them.
- π Trace driven simulation uses actual collected data directly in simulations without fitting a theoretical distribution first.
- π Fitting a theoretical distribution involves checking how well it represents the data, such as normal or uniform distributions.
- π If theoretical distributions do not fit well, empirical distributions can be created from the collected data itself.
- π Empirical distributions are built from the data collected and are not an attempt to fit a pre-existing model to the data.
- π Building an empirical distribution involves arranging data in ascending order and defining a distribution function from rank order statistics.
- π For grouped data, a piecewise linear function can represent the distribution function, estimating the proportion of observations in each interval.
- π The building blocks of any distribution include density functions, distribution functions, and moments around the mean.
- π¬ Empirical distributions are useful when no theoretical distribution fits the data well, but they are limited by the range of the collected data.
Q & A
What is the primary focus of the second session of the business analytics course?
-The primary focus of the second session is to discuss probability distributions, specifically how to fit a distribution to a given set of data.
What are the two main types of probability distributions discussed in the script?
-The two main types of probability distributions discussed are discrete and continuous distributions.
How are discrete random variables represented in a probability distribution?
-For discrete random variables, the probability distribution is represented by all possible values of the random variable along with the corresponding probabilities for each value.
What is the difference between the representation of a discrete and a continuous probability distribution?
-A discrete probability distribution is represented by probability masses, while a continuous distribution is represented by a density function, where the y-axis represents the probability density instead of the probability itself.
What is the significance of the normal distribution in the context of grades of a course?
-The normal distribution signifies that grades are expected to follow a bell-shaped curve, with a few very high and very low marks, and a majority of students scoring in the middle range.
What is meant by 'trace driven simulation' in the context of using business data?
-Trace driven simulation refers to the direct use of collected data in simulations without fitting a theoretical distribution to the data first. It involves using the actual data points, such as monthly sales volumes, directly in the analysis.
What is a 'theoretical distribution' and how does it differ from an empirical distribution?
-A theoretical distribution is a pre-defined statistical distribution, such as the normal, uniform, binomial, Poisson, or exponential distribution. It differs from an empirical distribution, which is built from the actual data collected, rather than being a pre-defined model.
Why might one choose to create an empirical distribution instead of using a theoretical one?
-One might choose to create an empirical distribution if the collected data does not fit well with any of the available theoretical distributions, allowing for a custom distribution that better represents the data.
What are the building blocks needed to characterize a normal distribution?
-The building blocks needed to characterize a normal distribution include the density function and distribution function, from which parameters like mean, standard deviation, and moments around the mean can be estimated.
How can one build an empirical distribution from ungrouped data?
-To build an empirical distribution from ungrouped data, one can arrange the data in ascending order, calculate rank order statistics, and then define a distribution function based on these ordered values.
What are the limitations of using empirical distributions compared to theoretical distributions?
-Empirical distributions are limited by the range of data used to create them and may not accurately represent values outside of this range. They can also be biased towards the pattern of the collected data and are not as versatile as theoretical distributions for generating new values for simulations.
Outlines
π Introduction to Probability Distributions and Data Fitting
This paragraph introduces the concept of probability distributions as statistical models that represent possible outcomes of events. It distinguishes between discrete and continuous distributions, with the former using probability mass functions and the latter using density functions. The speaker emphasizes the importance of fitting distributions to given data, such as grades following a normal distribution or sales being uniformly distributed, and sets the stage for discussing how to apply this in business analytics.
π Methods of Utilizing Business Data in Analysis
The second paragraph delves into how business data can be used in simulations, either through trace-driven simulation, where actual data points are directly used, or by fitting theoretical distributions like normal, uniform, binomial, Poisson, or exponential to the data. The paragraph also introduces the concept of empirical distributions, which are created when theoretical distributions do not fit the data well, and discusses the process of building these distributions from collected data.
π Building Empirical Distributions from Data
This section explains how to construct empirical distributions from ungrouped and grouped data. For ungrouped data, the process involves arranging data in ascending order and creating a distribution function based on rank order statistics. For grouped data, a piecewise linear function can be defined using the intervals and the count of data points within each interval. The paragraph highlights the limitations of empirical distributions, such as being restricted by the range of collected data and the potential for bias.
π’ Comparing Data Utilization Approaches and Theoretical Distribution Fitting
The fourth paragraph compares the three approaches to using data: trace-driven simulation, fitting theoretical distributions, and creating empirical distributions. It discusses the use of the first approach for model validation and the preference for theoretical distributions over empirical ones due to their flexibility and lack of bias towards collected data. The speaker also touches on the challenges of empirical distributions, such as their inability to simulate values outside the observed range.
π The Limitations and Considerations of Empirical Distributions
The final paragraph focuses on the limitations of empirical distributions, such as their potential bias towards the pattern of collected data and the restriction to the range of observed values. It also discusses the importance of considering theoretical distributions when appropriate, especially in fields like reliability engineering where specific distributions, such as the Weibull distribution, are commonly used and tested for fit.
Mindmap
Keywords
π‘Probability Distribution
π‘Discrete Random Variable
π‘Continuous Random Variable
π‘Density Function
π‘Normal Distribution
π‘Uniform Distribution
π‘Trace Driven Simulation
π‘Theoretical Distribution
π‘Empirical Distribution
π‘Goodness of Fit
Highlights
The session focuses on fitting probability distributions to given data, a crucial aspect of business analytics.
Probability distributions are statistical models representing possible outcomes of events or actions.
Discrete random variables are associated with probabilities for each possible value, while continuous variables are represented by a density function.
The difference between discrete and continuous distributions lies in the representation of probability versus density.
Normal distribution is often assumed for grades in academic settings, indicating a well-shaped curve with many average scores and fewer extremes.
Uniform distribution is used in business settings, such as sales predictions, where all values within a range are equally likely.
Trace driven simulation uses actual collected data directly in simulations without fitting any distributions.
Fitting a theoretical distribution involves selecting a known distribution like normal, uniform, or Poisson and checking its fit to the data.
Empirical distributions are created when theoretical distributions do not fit the data well, using the actual data to build a distribution.
Building an empirical distribution involves arranging data in ascending order and defining a distribution function from the data.
For grouped data, a piecewise linear function can be created to represent the distribution function of an empirical distribution.
Empirical distributions are built from collected data and are not fitted to the data but rather constructed directly from it.
The building blocks of any distribution include density functions, distribution functions, and moments around the mean.
Trace driven simulation is mainly used to validate existing models by comparing model outputs with actual outcomes.
Theoretical distributions are generally preferred over empirical ones due to their flexibility and lack of bias towards collected data patterns.
Empirical distributions may be limited by the range of the data used to build them, potentially restricting the simulation of values outside observed ranges.
There are compelling reasons to use specific theoretical distributions, such as the Weibull distribution in reliability engineering.
Transcripts
Hi this is the second session of the business analytics course and we are going to discuss
probability distributions.
Most importantly we are going to discuss how are we going to fit a distribution to a given
data.
So, first of all let us do a recap, what are probability distributions?
We already have discussed it in other courses but what do we think what do we recall as
probability distributions.
So, essentially probability distributions are some kind of a statistical model that
shows possible outcomes of a particular event or course of action that event may take.
So, essentially probability distributions for a discrete random variable may look like
all the possible values of the random variable along with the corresponding probabilities
that the random variable will take on that particular value.
And for a continuous distributions we generally represent that by a density function.
For example if you recall we may have said that the x axis represents the values of the
random variable the y axis represents the probability and for a discrete random variable
we will say that what is the probability that x takes on a value equal to one and then we
would have said some probability what is the probability that x takes on a particular value
2 and we would have said some probability.
So, this is how the probability distribution looks like for a discrete random variable.
Now for a continuous random variable we still have the same format which essentially means
that x axis still represents the value of the random variable y axis represents some
form of probability but we do not say probability we usually if you recall we say density function.
Density function and then we would have drawn something like this for potential values of
the random variable x.
What is the difference here in the earlier diagram we had discrete probability masses
because the random variable was discrete.
Here we have continuous values of the random variable and therefore we can't really say
that there is a probability mass sitting at a particular point.
For example let us say that we are still talking about x taking on a value equal to 2.
We cannot say that this is the probability, this is only the probability density.
So, we only talk about density and for density we need a small interval to actually define
some probability.
So, you recall all of that.
So, the focus of this session is not to re-describe density functions and probability distributions.
The focus of this session is to go one step beyond and say that well I have data now and
how do I fit some distributions to data or what do I do with that data.
So, for example let us say that we in academic settings we hear this quite a lot.
So, grades of a course follow a normal distribution what do I mean by that.
So, what do I mean by that essentially grades.
So, a random variable is grades here.
So, the grades out of 100.
Let us say so, random variable is grades and then it follows a normal distribution which
essentially means that we are going to follow assume that this is a nice well-shaped curve
and then some people are going to get a very high mark some people are going to get very
low marks unfortunately and there are whole bunch of people who are going to be in between.
So, that is what we mean by normal distribution once again the y axis represents the density.
So, this is just a recall or sometimes in the business settings we may say something
like this: sales next month are expected to be uniformly distributed.
So, what do we mean by that.
So, I may say that sales can be as low as a hundred thousand dollars, sales can be as
high as two hundred thousand dollars this is sales next month, sales in the next month
So, it can be hundred thousand dollars or it can be two hundred thousand dollars but
instead of assuming a normal distribution.
So, on x axis is sales here is your 100000, here is your 200000 and we are saying that
it is uniformly distributed.
So, you know what uniform distribution is once again y axis represents the density.
So, these are essentially probability distributions normal distribution uniform distribution.
We have taken two examples of continuous distribution but you get the idea.
So, that's how we define probability distributions that's how we use probability distributions.
So, now how are we going to go about using data.
So, let us say that I have business data that I have collected, the business data may be
about sales volumes the business data may be about the defaulters on loans or the business
data may be the salary hikes that the employees got in a particular year.
It may be about any business context for this kind of a data we can directly use the data
and use it in our simulations there is no need to fit any distributions.
This is typically called trace driven simulation.
So, let us say that we have collected sales volume over a period of time.
Let us say we have a monthly sales volume for the last three years which essentially
means that I have 36 values in my data-set.
So, instead of first fitting a distribution to the 36 values and then using the distribution
in my further analysis I can directly use these 36 values in my analysis.
So, if I want to simulate I will simulate directly using these 36 values, this is generally
called trace driven simulation.
The second method is to actually fit a theoretical distribution.
What do you mean by theoretical distribution, theoretical distribution is all these things
that we spoke about earlier normal distribution, uniform distribution, binomial distribution
for discrete, Poisson distribution for discrete, exponential distribution for continuous these
are all theoretical distributions.
So, what we may do is for the sales volume data that we may have, sales volume data that
I may have i may try to fit, quote unquote fit a distribution to my data.
And obviously I cannot simply say OK normal distribution fits very well I have to go beyond
that and I have to actually check whether the fit that I have assumed is actually good.
And I am using these terms in a very deliberate way because these are precisely the technical
terms which are going to be helpful later on.
So, we always are going to say we are going to fit a distribution we are going to check
how good is this fitment.
Now let us say that our business data that we have collected is a particularly tricky
data set and it does not fit very well with lot of theoretical distributions or the other
way around, theoretical, most of the theoretical distributions do not fit to our data.
What are we going to do?
Well it is not the end of the world instead of trying to fit already available distributions
like a negative binomial or a double exponential.
Instead of fitting those kind of already available distributions to the data what you can do
is we can actually create our own distributions.
I mean this is like making rules as we go along typically Kelvin category but we create
our own distributions and those distributions are called empirical distributions.
So, the sales volume data that I already spoke about using that data we say that well what
would be the distribution where these 36 values could have come from.
So, using these 36 values we build our own empirical distribution and use that distribution
in our future analysis.
Now what are these empirical distributions have you discussed empirical distributions
in your earlier courses most probably you have.
So, let us quickly recall that.
So, what are these empirical distributions?
Empirical distributions are essentially distributions built from the data that we already have collected.
We are not fitting a distribution to the data we are actually building a distribution from
the data that we have collected please notice the difference.
So, let us go beyond.
So, how does one build a distribution?
First of all what are the building blocks when we say we are building a distribution.
How do we build a distribution for example normal distribution let us take simplest,
normal.
If we were to say that I want to characterize a normal distribution what would we need to
characterize a normal distribution well we will need the building blocks.
So, what are these building blocks?
So, essential building blocks of any distribution are the density functions, the distribution
functions, and we may also want to define some moments, the first moment around the
mean, the second moment around the mean which can be built using density also.
So, we have to estimate these parameters.
So, essentially defining a distribution means identifying a density function or a distribution
function from the density function you can identify the building blocks like moments
around the mean.
Mean standard deviation and so on.
Let us take an example of how to build a empirical distribution.
So, let us say the data is ungrouped.
So, let us say that we have collected X1 X2 X3 values.
So, the X1 value, X2 value, X3 value and let us say all the way to X36.
These are our 36 sales volume data for 36 months in our data set.
Now what we are going to do is we are going to arrange them in a ascending order.
So, X1 value was the first value that was recorded which was the first month but what
we are going to do now is we are going to arrange it in a ascending order where the
smallest value is called X bracket 1, OK X bracket 1, second smallest value is called
X bracket 2 and the largest value is called X bracket n in our case X bracket 36, may
not be the sales volume in the 36th month it is actually the maximum possible sales
volume that we have found in our data set.
So, these are called rank order statistics let us not worry about rank order statistics.
So, once we have arranged the data in an ascending order you can actually define a distribution
function in this way.
This is not our own creation.
These are, these definitions are usually available in any standard statistics textbook, all right!
So, this is one way.
I mean by no means we are saying that this is the only way of defining a distribution
function.
Now once we get a distribution function we all know how to get a density function and
from density function we know how to get moments around the mean.
This is for ungroup data.
So, this is for ungroup data.
Now if the data were grouped meaning that I only know that in this interval I have ten
values in the other interval I have some eight values in some other interval I have some
five values if I have group data.
So, let us say that intervals we define k intervals.
So, we have intervals k such intervals and I know that in each interval I have some n1,
n2, n3 values.
So, in the first interval I have n1 values in the second interval I have n2 values third
interval I have n3 values kth interval I have nk values and that gives me my total sample
size of n.
So, what we can do is we can create a piece wise linear function G using this definition
where each G of aj is essentially proportion of the samples, proportion of the observations
up to that point up to that interval.
So, once again a very non unique way of defining a distribution function.
Once again notice that this is a distribution function why do I know that this is a distribution
function because the value lesser than the smallest value is 0 and the value beyond the
highest value is 1 which is a typical definition of a distribution function which goes from
zero to one.
And once again our usual methods are going to kick in where we have a distribution function
from there we get the density function and so on.
So, these are examples of how we can build empirical distribution.
Let us go back why are we saying that why did we build these empirical distributions
in the first place.
We are saying that we have data we have collected data that data may be for any context it may
be sales for our marketing data it may be financial analysis data it may be stock price
data.
So, let us say that a technical analyst wants to analyze wants to invest in the stock market.
Now what are technical analysts well figure out why do not you search for it and then
we will describe it in the next sessions.
So, technical analysts let us say that they want to invest and for their investment decisions
they have collected stock prices for the last three months.
Let us say that I have actually tick level data, tick level data means I get data not
every hour of a trading day, I may get data every minute or every second.
So, I have huge data sets I mean that data set will be huge.
Now I want to decide whether the stock is going to move up or move down.
Now I have to predict whether I have whole massive data set of all the stock prices up
to that point for the last three months and now I am saying tomorrow the market opens
at 10 o'clock what is going to be the opening price of this particular stock for which I
have collected data.
Now how are you go about doing this we said the first option is to just use the three
months of data that you have collected plain data that you have collected use the same
values.
That would be called trace driven simulation.
Second approach would be for the three month data that you have collected why do not you
fit a distribution and there has been enough and more research on what is a good fit for
a stock price data.
Obviously everybody wants to crack that problem and very clearly that I have not solved that
problem because if I had cracked that problem I would not be sitting here it is already
11 o'clock I would be using my distribution and playing with the market.
So, you can fit a distribution for the three months of data that you have collected and
I have a whole bunch of candidate distributions available.
Normal distribution, uniform distribution, log normal distribution, weibull distribution,
the full family, not the full family, the full forest.
And the third way is well the three months of data that I have collected is for a fairly
weird stock, none of the distributions amicably fit the data and therefore I want to define
my own distributions.
And therefore we got into the empirical distributions.
Therefore we got into the empirical distributions.
So, these are two examples of how to build empirical distributions from the data that
we have.
Now let us go back and go to step number two what if I want to fit theoretical distributions
how do I go about doing that.
So, before we do that let us quickly take a look at how these three approaches compare
with each other.
Usually approach one which is using the plane three months data is usually used to validate
the models.
We already have a model you already have the output and you want to validate whether that
output is correct or not.
So, what you do is you push these three months of data into your model and your model generates
an output and you compare that output with the reality with the existing system which
is what happens tomorrow check and whether that matches.
So, essentially our trace driven simulation is mainly used to fit to validate a model
that you already may have built using something, some different approach.
So, you have some prior knowledge how to build models for stock prices you have already done
that now you want to check whether that model is correct or not.
And therefore you feed into that model these three months of data whatever comes out of
this model should match with what happened in reality or so should come close to each
other.
The drawback of this approach is you are going to test your model only with the data that
you have collected.
So, for example going back to the sales volume data you only have 36 values.
So, your model is going to be tested only using the 36 values that you have actually
observed and fed into the model that may not be enough that may not be enough.
Even with the three months of minute level data on the stock prices let us say the stock
price was fairly stable during these three months there was no turbulence in the market.
So, how will you test whether your model works very well in the turbulent period.
Now this data that you have collected will not give you that simulation because this
data was collected from a fairly stable stock period, a stock market period.
So, those are some of the problems.
Approaches 2 and 3 building your distributions or using a theoretical distribution kind of
avoid this problems.
Because what you can do is once you have built a distribution you can generate values from
those distributions which are not restricted to the 36 values that you have actually observed
in your sample.
So, compared to approach 1 I would say approach 2 and 3 are preferable that way.
However if you can actually find a theoretical distribution that fits your data I would generally
avoid building empirical distributions.
Therefore I would say that theoretical distributions are preferred over empirical distributions.
The problem with empirical distributions is very similar to the problem that we have for
approach one.
Now when you build an empirical distribution from the data that you have the distribution,
the shape of the distribution is completely governed by the data that you have used to
build the distribution functions.
Remember your distribution functions, your distribution functions are built from the
data that you have.
So, the shape of the distribution will be completely governed by your data.
Now once again if the data is of a particular pattern then quite likely that the distribution
will be biased towards that.
The other problem is the distribution that we built usually are restricted by the smallest
and the largest value.
So, here the distribution is 0 for all the values lesser than the smallest value that
you have observed.
The distribution is 1 beyond or the maximum value that you have observed in the sample
which may not be true.
This is the smallest value that I have observed in the sample does not mean that sales cannot
be lower than this.
This is the maximum value that I have observed in the sample does not mean that my sales
cannot be more than that.
However, the distribution that you build using these data will pretty much say so.
The distribution that we have built will say that probability of finding a sales volume
lesser than the smallest value is zero.
And indirectly speaking probability of finding a sales value sales volume bigger than the
maximum value is again 0 almost 0.
So, those are the problems.
So, we are still not able to go beyond whatever we have observed in our sample.
So, that's, those are the problems.
So, if you want to test the validity of our system from an empirical, from a data that
comes from an empirical distribution we may have problem because we cannot simulate values
which are outside of the range that was fed into.
So, those are some of the issues with empirical distributions.
Now there may be some compelling reasons for using a particular theoretical distributions.
For example let us say that you have data about reliability.
Now reliability engineering has a very high importance for weibull distribution.
So, for any data that comes about distribution or the reliability I would actually I would
like to test whether it fits the weibull family is it coming close there.
So, those cases also I mean theoretical distributions why not test it before?
So, those are the, that's the difference between fitting a theoretical distribution and fitting
an empirical distribution.
Browse More Related Video
Sample and Population in Statistics | Statistics Tutorial | MarinStatsLectures
ETC1000 Topic 2b
Cumulative Distribution Functions and Probability Density Functions
Histograms and Density Plots for Numeric Variables | Statistics Tutorial | MarinStatsLectures
100+ Statistics Concepts You Should Know
Metode Statistika | Sebaran Peluang Diskrit | Bernoulli | Binomial | Poisson
5.0 / 5 (0 votes)