Logit model explained: regression with binary variables (Excel)
Summary
TLDRThis video from Nettle, a platform for distance learning in business and finance, is hosted by Sava who delves into the logistic regression model, a vital tool for estimating regression models with binary dependent variables. The tutorial covers the model's application in credit scoring, predicting loan defaults, and emphasizes the importance of variables like homeownership and full-time employment. It guides viewers through the process of estimating the model, optimizing coefficients, and interpreting results to assess credit risk, concluding with a practical example of scoring a loan application.
Takeaways
- 📚 The script introduces the logit model, also known as logistic regression, as a statistical technique for estimating regression models with a categorical or binary dependent variable.
- 🏦 The logit model is commonly used in finance and economics for applications such as credit scoring, predicting recessions, and analyzing exam success rates.
- 🔢 The model requires a balanced dataset with a significant portion of both zeros (non-defaults) and ones (defaults) to function effectively.
- 🏠 Two important categorical variables considered in credit scoring are homeownership and full-time employment status, which are used as predictors for loan default.
- 💰 Continuous variables like income, expenses, assets, and loan amounts are transformed into interpretable indicators, such as the natural logarithm of the ratio of expenses to income, to assess the likelihood of loan repayment.
- 📈 The logit model uses the logistic distribution function to estimate the probability of the dependent variable, ensuring the output is bounded between 0 and 1, suitable for probability estimation.
- 🔍 The model's coefficients are optimized by maximizing the log likelihood function rather than minimizing the squared sum of residuals, as in ordinary least squares regression.
- 📊 The variance of the estimator is calculated using the inverse of a matrix product that includes the weight matrix, which accounts for the heteroskedasticity inherent in binary outcomes.
- 📉 The significance of the model's predictors is determined by calculating z-statistics and p-values, which help identify variables that are statistically reliable.
- 🏢 The script provides a practical example of how the logit model can be applied in a bank's credit scoring process to assess an individual's creditworthiness.
- 👍 The video concludes by emphasizing the importance of understanding the logit model for various applications in business, finance, and economics.
Q & A
What is the primary purpose of the logit model discussed in the video?
-The primary purpose of the logit model, also known as logistic regression, is to estimate regression models when the response variable is categorical or binary, which is common in fields like finance and economics.
Why is the logit model preferred over multiple linear regression for binary outcomes?
-The logit model is preferred because it restricts the estimated values of the dependent variable to be between zero and one, making it suitable for estimating probabilities, unlike multiple linear regression which can yield values outside this range.
What are some applications of the logit model mentioned in the video?
-Some applications include credit scoring, predicting recessions, and determining success or failure in exams. These applications often involve predicting a binary outcome based on various explanatory variables.
What is the significance of having a balanced sample in the logit model?
-A balanced sample, where there is a roughly equal number of zeros and ones for the binary outcome, is important for the logit model to function properly. An overwhelming majority of one outcome would make the model less effective.
What are the two main categorical variables considered in the video for predicting loan default?
-The two main categorical variables are homeownership and full-time employment status, as these are considered important factors when deciding whether to grant a loan.
How are income, expenses, assets, and loan amount transformed into explanatory variables for the logit model?
-They are transformed into interpretable indicators such as the natural logarithm of the ratio of expenses to income, leverage after the loan is granted, and the natural logarithm of the loan amount over typical income to measure repayment time.
What is the logit transformation used for in the logit model?
-The logit transformation is used to convert the logit value, which is the exponent of the weighted sum of explanatory variables and coefficients, into an estimated probability that ranges from zero to one.
How is the optimal value of coefficients in the logit model determined?
-The optimal values of coefficients are determined by maximizing the log likelihood function, which is done using a solver to find the values that best fit the model to the data.
What is the purpose of calculating the covariance matrix in the context of the logit model?
-The covariance matrix is used to estimate the variance of the estimator for the coefficients, which in turn allows for the calculation of standard errors and z-statistics to test the significance of the coefficients.
How can the logit model be used to predict an individual's likelihood of defaulting on a loan?
-By inputting an individual's specific data into the model, such as homeownership status, employment status, income, expenses, assets, and loan amount, the model can calculate the probability of default, which can then be used to assess creditworthiness.
Outlines
📚 Introduction to Logit Model in Business Analytics
The video script introduces the logit model, also known as logistic regression, as a fundamental statistical technique for estimating regression models with a categorical or binary dependent variable. Common applications include credit scoring, predicting recessions, and analyzing educational outcomes. The script emphasizes the importance of having a balanced dataset with a significant number of both outcomes (0s and 1s) for the model to be effective. It also outlines the variables considered in the model, such as homeownership, employment status, income, expenses, assets, and loan amounts, and how they are transformed into interpretable indicators for the analysis.
🔍 Exploring Variables and the Logit Model's Application
The script delves into the creation of explanatory variables from raw data, such as the natural logarithm of the ratio of expenses to income, leverage, and the time needed to repay a loan based on income. These variables are then used in the logit model to estimate the probability of default. The logistic distribution function is highlighted for its role in scaling the variables to estimate probabilities between zero and one, making it suitable for categorical variable modeling. The process of calculating the logit and transforming it into an estimated probability is explained, setting the stage for the optimization of coefficients.
📈 Maximizing Log Likelihood in Logit Model Estimation
The script explains the process of optimizing the coefficients in the logit model by maximizing the log likelihood function, which is different from the least squares approach used in multiple regression. The log likelihood is calculated based on the actual and estimated values of the dependent variable, and the total log likelihood is maximized by adjusting the coefficients using a solver tool. The resulting optimal values of coefficients are interpreted to understand their impact on the probability of default, with the script noting the intuitive nature of the relationships found between the variables and the likelihood of default.
📊 Understanding Heteroskedasticity and Variance Estimation
The script discusses the heteroskedastic nature of the data in categorical variable estimation and the need to calculate a weight matrix based on the variance of individual probabilities. This is crucial because the logit model accounts for the different variances across observations, unlike linear probability models. The weight matrix is constructed using the diagonal of the probability estimators, and the covariance matrix of the estimator is derived to obtain standard errors for the coefficients, which are essential for statistical significance testing.
🔑 Statistical Significance and Predictive Power of the Logit Model
The script concludes with an analysis of the statistical significance of the coefficients, using z-statistics and p-values to determine which variables significantly predict the likelihood of default. It highlights full-time employment and leverage as significant predictors, while other variables like homeownership and the expense-to-income ratio are less significant. The script also demonstrates how to use the model to predict the default probability for a new applicant, showcasing the practical application of the logit model in credit scoring.
Mindmap
Keywords
💡Logit Model
💡Categorical Variable
💡Binary Outcome
💡Credit Scoring
💡Dependent Variable
💡Independent Variables
💡Logistic Distribution
💡Odds Ratio
💡Log Likelihood
💡Coefficients
💡Heteroskedasticity
💡Standard Errors
💡Z-Stats
Highlights
Introduction to the Logit Model, also known as Logistic Regression, a technique for estimating regression models with a categorical or binary dependent variable.
The importance of subscribing and supporting the channel for consistent delivery of educational content.
The Logit Model's application in finance and economics for predicting outcomes like credit scoring and recessions.
The significance of the dependent variable being categorical in the Logit Model, allowing for binary outcomes such as 0 or 1.
The necessity of a balanced sample for the Logit Model to function properly, avoiding a majority of one outcome over the other.
Exploration of categorical variables like homeownership and full-time employment as predictors in the Logit Model.
Transformation of continuous variables such as income, expenses, assets, and loan amount into interpretable indicators for the Logit Model.
The use of the natural logarithm to scale variables and mitigate the impact of outliers in the model.
The process of estimating the Logit Model parameters using the odds ratio and logistic distribution function.
The distinction between the Logit Model and multiple linear regression in terms of the dependent variable's bounds.
The optimization of coefficients in the Logit Model by maximizing log likelihood rather than minimizing squared residuals.
The calculation of the covariance matrix and standard errors for the coefficients to test statistical significance.
The use of z-statistics and p-values to determine the significance of the model's predictors.
Practical application of the Logit Model in credit scoring to assess an individual's probability of defaulting on a loan.
The intuitive understanding of the model's findings, such as the positive impact of full-time employment on reducing the probability of default.
The conclusion summarizing the Logit Model's utility in credit scoring and its potential applications in various fields.
Transcripts
hello everyone and welcome again to
nettle
the best platform around for distance
learning in business
finance economics and much much more
please don't forget to subscribe to our
channel and click that bell notification
button below
so that you never miss fresh videos and
tutorials you might be interested in
many thanks to our current patreon
supporters for making this video
possible
and would also greatly appreciate if you
consider supporting us as well so please
check the link in description for more
details
my name is sava and we are going to
investigate the logit model
or as it's also called the logistic
regression
it's a go-to technique for estimating
regression models
when your response variable so your
dependent variable your y
is categorical or binary often in
finance
and economics you have got your
dependent variable
be not a real number but a categorical
number
so it can only be 0 1 or can only take
some limited number of outcomes most
importantly
and some of the most important
applications of the logit model
are credit scoring predicting whether
your borrower would default on their
debt
well zero bean and non-default
everything goes according to plan
and one being default and you are pretty
interested
in figuring out which characteristics of
your borrower
predict defaults so you could allocate
lending as a bank more efficiently and
minimize your credit risk
alternatively in economics you can be
interested in predicting recessions
zero being no recession and one being
recession and figuring out which
macroeconomic or some other indicators
perhaps
can forecast recessions
quite a legitimate research task or in
education we might be interested in
what predicts or determines uh success
or failure in an exam
how does the number of hours studied in
preparation to an exam
contribute to success rate and all of
these questions are also legitimate and
can be answered
using the logit model so without further
ado
let's try and estimate the logit model
on
a very textbook case for its application
that is credit scoring and predicting
retail
loan default so here we have got a
sample that's a real world
testing data sample of 500 applicants
that either
defaulted or haven't defaulted on their
consumer credit
so we have got zeros denoting
non-defaults so
the individuals repaid that that and
once being
default so here you can see that among
the 500
individuals that have applied for the
loan and being granted alone
127 have defaulted and that's quite
important
for our binary choice models login
included
to be functioning properly you need a
sizeable chunk
of your sample to have either
zeros or ones if the overwhelming
majority like 95 percent
of your sample being either zero one so
if almost everyone have repaid or
almost everyone have defaulted then the
logit model would have been not the best
choice however here as of what roughly a
quarter of our sample
defaulting logit model is appropriate so
that out of the way
we can study other variables that we've
got here so
the two most important perhaps
categorical variables that you could ask
for
when deciding whether to grant someone
alone or not
is to ask if they have got any retail
property on their hands
whether they are homeowner or not
whether they can pledge
their property as a collateral perhaps
for their loan
or whether they have such uh property as
an asset that could back them up if they
encounter some
difficulties in repaying and also
whether they have a full-time job so
whether they are full-time employed
and those two variables are again binary
variables that
are treated here as independent as
predictors as explanatory variables
for our why which is default or not
and we have got also some real numbers
as our explanatory variables
and we'll try and transform them into
some project that we could later use in
our logit model
so we've got income and expenses of our
borrower's household in thousands of
dollars per year
so we can see that this particular
household earns roughly
129 000 a year and spends
73 000 out of that we can also
record uh the assets and debt right now
before the loan is granted
of this particular household and the
amount of the loan they're asking for
so this data can be then coded into
other explanatory variables that we can
later use in our logic model
so first of all let's consider three
variables that transform
our income expenses asset that and loan
amount data
into some more interpretable indicators
for example we can figure out the
natural logarithm of the ratio of
expenses to income
and figure out how thrifty
how likely to save their income this
household is
and we expect that the more they save
and the less they spend
the more likely they would be to
repay their loan as they would be more
disciplined they would have more spare
cash
to meet their monthly payment schedule
and so on and so forth
so this is a legitimate variable to
consider
then we can consider leverage as in
that to assets like we do
for uh corporate analysis but here we've
got individuals
and their outstanding assets and that
and the debt that they're planning to
take
and they plan to fund some asset
purchases with such a debt perhaps they
want to buy
a new house or a new car or the like so
here we can figure out natural logarithm
this is just for
scaling purposes so that we have got no
outliers
at the top so we can figure out the
natural logarithm of the
leverage of our applicant after they
would have been granted
the low so in the numerator would have
total debt
which is outstanding debt that before
they were granted a loan
plus the loan amount divided by their
assets after they were granted
a low so assets plus the loan amount
here we assume that they'll spend the
loan amount to purchase
some assets for themselves
and here we can also figure out one
further variable that would basically
mean
how long would this individual need to
keep
earning to repay their loan and that
can be calculated as natural logarithm
of the loan amount
over their typical income
and here we have got five explanatory
variables which is
more than enough for a typical logic
model especially with
500 observations so here we can bottom
like it all the way down
and now start figuring out how to
calculate
the optimal values of these parameters
which denote
just the constant the constant term as
in all regression models
and our coefficients for our explanatory
candidate variables
in the logit model what you utilize is
the
odds ratio and you try and
estimate the probability of default
conditional
on these values of explanatory variables
and as it is also called the logistic
regression
it should not be a surprise that the
equation that relates the predicted
values
of y the variable of choice the
probability of default
utilizes the distribution function of
the
logistic distribution and we have got a
video on the logistic distribution our
channel already
so please check this out later on if
you're interested so here we use the
logistic distribution logic
to relate these variables they can be
anything they can be bounded or
unbounded they can be categorical or
real numbers
and so on and so forth so we basically
scale them
to meet the criteria that the
probability can be estimated from
zero to one and this is what
makes logistic regression and logit more
applicable
to categorical variable modeling than
for example
just multiple linear regression because
in multiple linear regression you can
theoretically get
estimated values of your dependent
variable outside of
zero to one bounds and that could be
tricky to interpret it in context of
probability of default for example or
probability of recession a probability
of failing
or passing an exam and the like
this transformation the logic
transformation allows to avoid that
so here we can calculate the logit which
is just the
exponent of the weighted sum of our
explanatory variables and the
coefficients so we can use sumproduct
and refer to our coefficients over here
and lock in the row as coefficients stay
the same for all observations
and referring to explanatory variable
plus a constant
for simplicity over here corresponding
to a particular observation
and here we calculate the value of log
it and unsurprisingly
as all the coefficients are zeros as for
now the value of log it is
one as the exponent of zero is one and
here we can use this logistic
transformation
to convert this logit value into our
estimated probability
so we just divide our log it by one plus
log it
and get a default estimated probability
of 0.5
which is unsurprising if you know
nothing about it
you can just assume it's a coin toss
it's 50 50
either you default or not quite
intuitive isn't it
and then we can bottom right click it
all the way down and estimate it for
all of our observations all of our 500
lone applicants and here it's
actually important to understand how one
might
optimize these coefficients to arrive at
the best fit
possible and here is where the logit
model
differs from multiple regression in
terms of
the function that you try and optimize
for multiple regression
you try and minimize the squared sum of
residuals
while for the logit model the most
robust approach
possible is to maximize log likelihood
and the log likelihood is defined
in terms of actual variables actual
dependent variables
that are zeros and ones y i over here
and the estimated values the expected
values of y
i denoted here as y bar and we can
estimate our log likelihood for every
single observation
by just multiplying our default
categorical variable
onto the natural logarithm of our
estimated probability
plus one minus the default categorical
variable
times the natural logarithm of one minus
the calculated probability
and we can bottom likelihood all the way
down and then we can calculate the total
log likelihood which would be the sum of
log likelihoods across
the whole sample and now we can
maximize our log likelihood by varying
the coefficient parameters
this can be done using solver so we can
go data solver
and specify our optimization task so we
want to maximize
the value of log likelihood currently at
cell
l1 we want to maximize it so here should
be max
and we need to change variable cells
that correspond to
b0 b1 all the way to b5 that are the
constant term
and the coefficients for all five
explanatory variables so we select
the array c2 to h2 and we don't want to
impose any constraints on our parameters
as theoretically any of these parameters
could impact
probability of default either positively
or negatively we just have some
reasonable suspicions whether for
example
leverage would affect the default
probability positive or negatively
but the whole purpose of logic
estimation is to test
these suspicions these hypotheses so we
need to untick
this box to make uh all parameters
either positive or negative depending on
what maximizes log likelihood
and click solve and wait until the
algorithm converges
to an optimal solution and we can see
that we have converged
to an optimal value of log likelihood we
can see it increased from roughly minus
350
to minus 233 here the negative value of
log likelihood should
not um be of concern it
doesn't really matter whether it's
positive or negative it only
matters in comparative terms whether it
have increased or decreased and we
can see that it increased quite a lot
and here we can already see
what the optimal values of coefficients
are the
closest values to some true parameters
that could be possibly estimated using
maximum likelihood and here we see quite
unsurprisingly that
being a homeowner and being full-time
employed
reduces your probability of default
makes it more likely for you to repay
the loan on time
and in full which is again quite
intuitive
while being more
careless in your spending habits
spending more
in proportion to your income makes you
less likely to repay
being more leveraged makes you less
likely to repay
and also taking on higher loan with
respect to your income
also makes you less likely to repay and
more likely to default
all of those relationships are quite
intuitive
in terms of the logic either
psychological or
economic of the theory that is behind it
but now we need to figure out which of
these relationships
are statistically significant and
reliable and which are perhaps
not significant and uh can be neglected
in further modelling so here we need to
use some procedure perhaps
to estimate the variance of our
estimator
to come up with standard errors for our
coefficients
just as we do with multiple regression
however here
just as with log likelihood instead of
minimizing
uh squared sum of residuals we have got
a slight tweak on the conventional
procedure of estimating the variance of
our grasses
and we have got it covered over here to
estimate the covariance matrix
uh the variance of estimator of b
we need to calculate the inverse matrix
of such matrix product
that has the transposed x the transposed
matrix of our explanatory variables
including the constant
multiplied by the weight matrix and
weight matrix will be explained a little
bit later so stay tuned
and again finally the matrix
of x as in explanatory variables
again and uh what is the weight matrix
well the weight matrix corresponds to
the variance of
individual probabilities and here
is actually one of the another reasons
why
logit model is preferable to linear
probability models
when you just regress your zeros and
ones
in a multiple linear regression as if
you recall in a binomial distribution
when you have
some probability that an event happens
and some probability that doesn't happen
the variance of the probability can be
defined as probability times one minus
probability
so basically in such a categorical
variable estimation
your data is heteroskedastic
by definition you have got different
variances for
different observations for example here
if you were to
regress it using the usual method the
simple multiple linear regression
you would have assumed as per the
gauss-markov assumptions
that variances for all these
observations are the same
while in fact they can be massively
different
and depending on the value of the
estimated probability
and the highest variance of probability
would be observable
for the estimated y being equal to 0.5
as is usually the case as is always the
case with the
bernoulli distribution so here we need
to calculate the
weight matrix that would be the
weighting matrix of our variance
estimator
based on the diagonal of our
probability estimators so here we have
got the template
to calculate the 500 by 500
weight matrix and here we need to first
figure out
whether we are on the diagonal and if we
are on the diagonal we need to input
this particular variance estimation and
if not we need to input zero
so first we check if the
column indicator and row indicates are
the same
we can input the product
of probability and we need to lock the
column here
times one minus probability
log in the column here as well and if
not if we are not on the diagonal
we just return zero and here we can
use ctrl right and bottom lightly knit
all the way down
we can calculate the whole weighting
matrix and we can see that on the
diagonal we have got
variance estimators them being quite
different
and being the highest the closest they
are
to 0.5 as expected
and now i can finally calculate our
covariance matrix
that we can then be used to derive
standard errors
of our coefficients so here we first
need to input min inverse
which is inverse matrix then we first
need to input
our mod function so matrix
multiplication
the first component would be the
transposed
array of explanatory variables x
starting from the constant
all the way to here so it would be a 6
by 500 array
and as a second component we input
another matrix multiplication
and here first we input our weight
matrix so from here all the way
to here so a 500 by 500 matrix
and as the second component
in this particular product will input
our
x and we don't need to transpose it at
that
particular point so we close all the
parentheses and make sure that we close
them
uh all and then enforce this formula
using shift ctrl enter as it multiplies
a bunch of matrices together basically
and we get our covariance matrix and now
on the diagonal of our covariance matrix
we would have
our standard errors squared so now to
derive our standard errors
we just can calculate the square roots
of the diagonal elements and that's what
we will do
first of all for our constant the
estimator
would be the square root of the very
first element of the matrix
and if we drag it across we'll get
the square roots of the first row
but we need to calculate the square
roots of the first diagonal so what we
can do here is we can
change the references here to refer to
the diagonal
just increasing the row reference by one
each time
and here that's how you get the standard
errors for the coefficients
now we can use the assumption that in
large samples
the distribution of coefficients is
approximately normal
so dividing the coefficients by the
standard errors we get
z stats our usual and uh
well-behaved z-stats that can be tested
for significance
using a two-tailed t-test so two times
one minus standard normal distribution
as
as arguments we input the absolute value
of our z stat
and one for cumulative and here we can
get our p values for
every single coefficient and here we can
see that
the most significant of the predictors
of default
are being full-time employed so if you
are
employed full-time you're much less
likely to default on your
consumer credit and also the significant
positive predictors of
default leverage and repayment
time given income the more leveraged you
are the less likely you are to repay
and the more it would take you to repay
your loan out of your income
the less likely you are to repay while
three other parameters the constant and
most notably
homeownership and expense to income
ratio
are of expected science but not as
significant as one could imagine given
these p-values
are greater than ten percent and now
we can use our logic model to actually
predict
uh whether a particular individual would
default
imagine we have got this model uh up and
running in our bank
and someone approaches us and asks out
for a loan so we in our credit scoring
procedure
ask them to provide some data about them
so obviously the constant
would be one um as it's always one
that's why it's a constant
uh and we ask the applicant whether they
own a home
and uh imagine they say yes
and we put one for yes then
we ask if they have a full-time job and
they say
yes and provide some evidence for that
then to calculate these three variables
we need to ask them
about their income expenses asset that
and what is the loan amount they want to
apply for
so imagine that this particular person's
household
makes 150 000 per year
and they spend roughly 120 000 of it
they own their home and it's currently
valued at
one uh one thousand thousand dollars so
one million dollars
and they haven't got any debt taken on
previously
and they want to take out one million
dollars
in loan potentially guaranteed as
collateral
by their property by their house so now
we can actually just copy these formulas
across
to calculate these ratios and we can
also copy this across to calculate our
log it
and our probability can be calculated
just like that
and we can see that our probability
which is
logged divided by one plus logit is less
than 20
so actually we can be reasonably sure
that such a person would not default on
their loan
their default probability being less
than 20 percent
19.19 which is quite good which is also
less than average over here we can see
that
roughly a quarter of our applicant's
default
and the probability of this applicant
defaulting is less than 20
so we can determine that our applicant
is actually credit worthy
and feel quite good about providing them
with a low and that's all there is for
the logit model
and its application for consumer credit
risk credit scoring
and much more please leave a like on
this video if you found it helpful
in the comments below i'm going to see
any further suggestions for videos and
business economics or finance you would
like me to record
and please don't forget to subscribe to
our channel or consider supporting us on
patreon
thank you very much and stay tuned
浏览更多相关视频
Using Multiple Regression in Excel for Predictive Analysis
Machine Learning Tutorial Python - 8: Logistic Regression (Binary Classification)
一夜。統計學:迴歸分析
Logistic Regression Using Excel
Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)
Lec-4: Linear Regression📈 with Real life examples & Calculations | Easiest Explanation
5.0 / 5 (0 votes)