Logit model explained: regression with binary variables (Excel)

NEDL
10 Feb 202124:19

Summary

TLDRThis video from Nettle, a platform for distance learning in business and finance, is hosted by Sava who delves into the logistic regression model, a vital tool for estimating regression models with binary dependent variables. The tutorial covers the model's application in credit scoring, predicting loan defaults, and emphasizes the importance of variables like homeownership and full-time employment. It guides viewers through the process of estimating the model, optimizing coefficients, and interpreting results to assess credit risk, concluding with a practical example of scoring a loan application.

Takeaways

  • 📚 The script introduces the logit model, also known as logistic regression, as a statistical technique for estimating regression models with a categorical or binary dependent variable.
  • 🏦 The logit model is commonly used in finance and economics for applications such as credit scoring, predicting recessions, and analyzing exam success rates.
  • 🔢 The model requires a balanced dataset with a significant portion of both zeros (non-defaults) and ones (defaults) to function effectively.
  • 🏠 Two important categorical variables considered in credit scoring are homeownership and full-time employment status, which are used as predictors for loan default.
  • 💰 Continuous variables like income, expenses, assets, and loan amounts are transformed into interpretable indicators, such as the natural logarithm of the ratio of expenses to income, to assess the likelihood of loan repayment.
  • 📈 The logit model uses the logistic distribution function to estimate the probability of the dependent variable, ensuring the output is bounded between 0 and 1, suitable for probability estimation.
  • 🔍 The model's coefficients are optimized by maximizing the log likelihood function rather than minimizing the squared sum of residuals, as in ordinary least squares regression.
  • 📊 The variance of the estimator is calculated using the inverse of a matrix product that includes the weight matrix, which accounts for the heteroskedasticity inherent in binary outcomes.
  • 📉 The significance of the model's predictors is determined by calculating z-statistics and p-values, which help identify variables that are statistically reliable.
  • 🏢 The script provides a practical example of how the logit model can be applied in a bank's credit scoring process to assess an individual's creditworthiness.
  • 👍 The video concludes by emphasizing the importance of understanding the logit model for various applications in business, finance, and economics.

Q & A

  • What is the primary purpose of the logit model discussed in the video?

    -The primary purpose of the logit model, also known as logistic regression, is to estimate regression models when the response variable is categorical or binary, which is common in fields like finance and economics.

  • Why is the logit model preferred over multiple linear regression for binary outcomes?

    -The logit model is preferred because it restricts the estimated values of the dependent variable to be between zero and one, making it suitable for estimating probabilities, unlike multiple linear regression which can yield values outside this range.

  • What are some applications of the logit model mentioned in the video?

    -Some applications include credit scoring, predicting recessions, and determining success or failure in exams. These applications often involve predicting a binary outcome based on various explanatory variables.

  • What is the significance of having a balanced sample in the logit model?

    -A balanced sample, where there is a roughly equal number of zeros and ones for the binary outcome, is important for the logit model to function properly. An overwhelming majority of one outcome would make the model less effective.

  • What are the two main categorical variables considered in the video for predicting loan default?

    -The two main categorical variables are homeownership and full-time employment status, as these are considered important factors when deciding whether to grant a loan.

  • How are income, expenses, assets, and loan amount transformed into explanatory variables for the logit model?

    -They are transformed into interpretable indicators such as the natural logarithm of the ratio of expenses to income, leverage after the loan is granted, and the natural logarithm of the loan amount over typical income to measure repayment time.

  • What is the logit transformation used for in the logit model?

    -The logit transformation is used to convert the logit value, which is the exponent of the weighted sum of explanatory variables and coefficients, into an estimated probability that ranges from zero to one.

  • How is the optimal value of coefficients in the logit model determined?

    -The optimal values of coefficients are determined by maximizing the log likelihood function, which is done using a solver to find the values that best fit the model to the data.

  • What is the purpose of calculating the covariance matrix in the context of the logit model?

    -The covariance matrix is used to estimate the variance of the estimator for the coefficients, which in turn allows for the calculation of standard errors and z-statistics to test the significance of the coefficients.

  • How can the logit model be used to predict an individual's likelihood of defaulting on a loan?

    -By inputting an individual's specific data into the model, such as homeownership status, employment status, income, expenses, assets, and loan amount, the model can calculate the probability of default, which can then be used to assess creditworthiness.

Outlines

00:00

📚 Introduction to Logit Model in Business Analytics

The video script introduces the logit model, also known as logistic regression, as a fundamental statistical technique for estimating regression models with a categorical or binary dependent variable. Common applications include credit scoring, predicting recessions, and analyzing educational outcomes. The script emphasizes the importance of having a balanced dataset with a significant number of both outcomes (0s and 1s) for the model to be effective. It also outlines the variables considered in the model, such as homeownership, employment status, income, expenses, assets, and loan amounts, and how they are transformed into interpretable indicators for the analysis.

05:01

🔍 Exploring Variables and the Logit Model's Application

The script delves into the creation of explanatory variables from raw data, such as the natural logarithm of the ratio of expenses to income, leverage, and the time needed to repay a loan based on income. These variables are then used in the logit model to estimate the probability of default. The logistic distribution function is highlighted for its role in scaling the variables to estimate probabilities between zero and one, making it suitable for categorical variable modeling. The process of calculating the logit and transforming it into an estimated probability is explained, setting the stage for the optimization of coefficients.

10:01

📈 Maximizing Log Likelihood in Logit Model Estimation

The script explains the process of optimizing the coefficients in the logit model by maximizing the log likelihood function, which is different from the least squares approach used in multiple regression. The log likelihood is calculated based on the actual and estimated values of the dependent variable, and the total log likelihood is maximized by adjusting the coefficients using a solver tool. The resulting optimal values of coefficients are interpreted to understand their impact on the probability of default, with the script noting the intuitive nature of the relationships found between the variables and the likelihood of default.

15:03

📊 Understanding Heteroskedasticity and Variance Estimation

The script discusses the heteroskedastic nature of the data in categorical variable estimation and the need to calculate a weight matrix based on the variance of individual probabilities. This is crucial because the logit model accounts for the different variances across observations, unlike linear probability models. The weight matrix is constructed using the diagonal of the probability estimators, and the covariance matrix of the estimator is derived to obtain standard errors for the coefficients, which are essential for statistical significance testing.

20:03

🔑 Statistical Significance and Predictive Power of the Logit Model

The script concludes with an analysis of the statistical significance of the coefficients, using z-statistics and p-values to determine which variables significantly predict the likelihood of default. It highlights full-time employment and leverage as significant predictors, while other variables like homeownership and the expense-to-income ratio are less significant. The script also demonstrates how to use the model to predict the default probability for a new applicant, showcasing the practical application of the logit model in credit scoring.

Mindmap

Keywords

💡Logit Model

The Logit Model, also known as logistic regression, is a statistical method used for estimating the relationships between one dependent binary variable and one or more independent variables. In the context of the video, it is used to predict the likelihood of a categorical outcome, such as whether a borrower will default on a loan. The model is essential for applications like credit scoring, where it helps in estimating the probability of default.

💡Categorical Variable

A categorical variable is a type of data that can take on a limited, and usually fixed, number of possible values, assigning each individual or case into different groups. In the video, the categorical variable of interest is the default status of a loan, which can be either 0 (non-default) or 1 (default), and the model predicts the probability of these outcomes.

💡Binary Outcome

A binary outcome refers to a situation where an event can result in one of two possible outcomes, often represented as 0 or 1. In the script, the binary outcome is the default status of a loan, with the logit model being a suitable technique for analyzing such binary data.

💡Credit Scoring

Credit scoring is the process of evaluating the credit risk of a borrower or debtor. It is used by banks and other lending institutions to decide whether to grant loans. In the video, the logit model is applied to credit scoring to predict whether a borrower would default on their debt, which helps in making lending decisions more efficiently.

💡Dependent Variable

The dependent variable, often denoted as 'y', is the variable being analyzed to understand its behavior in relation to other variables. In the context of the video, the dependent variable is the default status of a loan, which the model is trying to predict based on various independent variables.

💡Independent Variables

Independent variables are the factors that are believed to influence the dependent variable. In the video, independent variables include whether the borrower owns a home, whether they are full-time employed, income, expenses, assets, and loan amount, which are used to predict the likelihood of loan default.

💡Logistic Distribution

The logistic distribution is a continuous probability distribution that is used in logistic regression to model the probability of a certain event occurring. The video explains that the logit model utilizes the logistic distribution function to estimate probabilities, ensuring that the predicted values of the dependent variable fall between 0 and 1.

💡Odds Ratio

The odds ratio is a measure of association between two events, often used in logistic regression to describe the relationship between the independent variables and the likelihood of the dependent variable being 1. In the script, the odds ratio is used to estimate the probability of default conditional on the values of the explanatory variables.

💡Log Likelihood

Log likelihood is a measure used in statistical models to quantify the goodness of fit of a particular statistical model. The video describes how, in logistic regression, the model coefficients are optimized by maximizing the log likelihood function, which is based on the observed and estimated probabilities of the dependent variable.

💡Coefficients

In the context of regression analysis, coefficients are the numerical values that represent the relationship between the independent and dependent variables. The video explains how the coefficients in the logit model are estimated to determine their effect on the probability of the dependent variable, such as the likelihood of loan default.

💡Heteroskedasticity

Heteroskedasticity refers to a situation in a dataset where the variability of the unexplained part of the model is different across observations. The video points out that the logit model accounts for heteroskedasticity by using a weight matrix based on the variance of individual probabilities, which is not assumed in standard linear regression models.

💡Standard Errors

Standard errors are a measure of the average distance that the observed values fall from the estimated values. In the video, standard errors are calculated for the coefficients of the logit model to determine the precision of the estimates and to perform hypothesis testing for statistical significance.

💡Z-Stats

Z-statistics are a statistical measure used to express the number of standard deviations a data point is from the mean. In the context of the video, z-statistics are calculated by dividing the coefficients by their standard errors to test the significance of the model's coefficients in predicting the likelihood of default.

Highlights

Introduction to the Logit Model, also known as Logistic Regression, a technique for estimating regression models with a categorical or binary dependent variable.

The importance of subscribing and supporting the channel for consistent delivery of educational content.

The Logit Model's application in finance and economics for predicting outcomes like credit scoring and recessions.

The significance of the dependent variable being categorical in the Logit Model, allowing for binary outcomes such as 0 or 1.

The necessity of a balanced sample for the Logit Model to function properly, avoiding a majority of one outcome over the other.

Exploration of categorical variables like homeownership and full-time employment as predictors in the Logit Model.

Transformation of continuous variables such as income, expenses, assets, and loan amount into interpretable indicators for the Logit Model.

The use of the natural logarithm to scale variables and mitigate the impact of outliers in the model.

The process of estimating the Logit Model parameters using the odds ratio and logistic distribution function.

The distinction between the Logit Model and multiple linear regression in terms of the dependent variable's bounds.

The optimization of coefficients in the Logit Model by maximizing log likelihood rather than minimizing squared residuals.

The calculation of the covariance matrix and standard errors for the coefficients to test statistical significance.

The use of z-statistics and p-values to determine the significance of the model's predictors.

Practical application of the Logit Model in credit scoring to assess an individual's probability of defaulting on a loan.

The intuitive understanding of the model's findings, such as the positive impact of full-time employment on reducing the probability of default.

The conclusion summarizing the Logit Model's utility in credit scoring and its potential applications in various fields.

Transcripts

play00:00

hello everyone and welcome again to

play00:03

nettle

play00:03

the best platform around for distance

play00:05

learning in business

play00:07

finance economics and much much more

play00:09

please don't forget to subscribe to our

play00:11

channel and click that bell notification

play00:12

button below

play00:13

so that you never miss fresh videos and

play00:15

tutorials you might be interested in

play00:16

many thanks to our current patreon

play00:18

supporters for making this video

play00:19

possible

play00:20

and would also greatly appreciate if you

play00:21

consider supporting us as well so please

play00:23

check the link in description for more

play00:25

details

play00:26

my name is sava and we are going to

play00:27

investigate the logit model

play00:29

or as it's also called the logistic

play00:32

regression

play00:33

it's a go-to technique for estimating

play00:35

regression models

play00:36

when your response variable so your

play00:38

dependent variable your y

play00:40

is categorical or binary often in

play00:43

finance

play00:44

and economics you have got your

play00:45

dependent variable

play00:47

be not a real number but a categorical

play00:50

number

play00:50

so it can only be 0 1 or can only take

play00:54

some limited number of outcomes most

play00:57

importantly

play00:58

and some of the most important

play00:59

applications of the logit model

play01:02

are credit scoring predicting whether

play01:04

your borrower would default on their

play01:06

debt

play01:07

well zero bean and non-default

play01:09

everything goes according to plan

play01:11

and one being default and you are pretty

play01:13

interested

play01:14

in figuring out which characteristics of

play01:17

your borrower

play01:17

predict defaults so you could allocate

play01:21

lending as a bank more efficiently and

play01:23

minimize your credit risk

play01:25

alternatively in economics you can be

play01:28

interested in predicting recessions

play01:29

zero being no recession and one being

play01:31

recession and figuring out which

play01:33

macroeconomic or some other indicators

play01:35

perhaps

play01:36

can forecast recessions

play01:39

quite a legitimate research task or in

play01:43

education we might be interested in

play01:45

what predicts or determines uh success

play01:48

or failure in an exam

play01:50

how does the number of hours studied in

play01:53

preparation to an exam

play01:54

contribute to success rate and all of

play01:58

these questions are also legitimate and

play02:01

can be answered

play02:02

using the logit model so without further

play02:04

ado

play02:05

let's try and estimate the logit model

play02:07

on

play02:08

a very textbook case for its application

play02:11

that is credit scoring and predicting

play02:14

retail

play02:15

loan default so here we have got a

play02:17

sample that's a real world

play02:19

testing data sample of 500 applicants

play02:22

that either

play02:23

defaulted or haven't defaulted on their

play02:25

consumer credit

play02:26

so we have got zeros denoting

play02:28

non-defaults so

play02:30

the individuals repaid that that and

play02:32

once being

play02:33

default so here you can see that among

play02:36

the 500

play02:37

individuals that have applied for the

play02:38

loan and being granted alone

play02:41

127 have defaulted and that's quite

play02:44

important

play02:45

for our binary choice models login

play02:47

included

play02:48

to be functioning properly you need a

play02:50

sizeable chunk

play02:51

of your sample to have either

play02:54

zeros or ones if the overwhelming

play02:56

majority like 95 percent

play02:59

of your sample being either zero one so

play03:01

if almost everyone have repaid or

play03:03

almost everyone have defaulted then the

play03:05

logit model would have been not the best

play03:07

choice however here as of what roughly a

play03:10

quarter of our sample

play03:11

defaulting logit model is appropriate so

play03:14

that out of the way

play03:15

we can study other variables that we've

play03:17

got here so

play03:19

the two most important perhaps

play03:20

categorical variables that you could ask

play03:22

for

play03:23

when deciding whether to grant someone

play03:25

alone or not

play03:26

is to ask if they have got any retail

play03:29

property on their hands

play03:30

whether they are homeowner or not

play03:32

whether they can pledge

play03:34

their property as a collateral perhaps

play03:36

for their loan

play03:37

or whether they have such uh property as

play03:42

an asset that could back them up if they

play03:44

encounter some

play03:45

difficulties in repaying and also

play03:47

whether they have a full-time job so

play03:49

whether they are full-time employed

play03:51

and those two variables are again binary

play03:54

variables that

play03:54

are treated here as independent as

play03:57

predictors as explanatory variables

play04:00

for our why which is default or not

play04:03

and we have got also some real numbers

play04:07

as our explanatory variables

play04:08

and we'll try and transform them into

play04:11

some project that we could later use in

play04:13

our logit model

play04:14

so we've got income and expenses of our

play04:17

borrower's household in thousands of

play04:19

dollars per year

play04:20

so we can see that this particular

play04:21

household earns roughly

play04:24

129 000 a year and spends

play04:27

73 000 out of that we can also

play04:30

record uh the assets and debt right now

play04:34

before the loan is granted

play04:36

of this particular household and the

play04:38

amount of the loan they're asking for

play04:40

so this data can be then coded into

play04:44

other explanatory variables that we can

play04:47

later use in our logic model

play04:49

so first of all let's consider three

play04:51

variables that transform

play04:53

our income expenses asset that and loan

play04:56

amount data

play04:57

into some more interpretable indicators

play05:00

for example we can figure out the

play05:02

natural logarithm of the ratio of

play05:04

expenses to income

play05:05

and figure out how thrifty

play05:08

how likely to save their income this

play05:11

household is

play05:12

and we expect that the more they save

play05:15

and the less they spend

play05:17

the more likely they would be to

play05:20

repay their loan as they would be more

play05:22

disciplined they would have more spare

play05:24

cash

play05:24

to meet their monthly payment schedule

play05:27

and so on and so forth

play05:28

so this is a legitimate variable to

play05:30

consider

play05:31

then we can consider leverage as in

play05:35

that to assets like we do

play05:38

for uh corporate analysis but here we've

play05:40

got individuals

play05:42

and their outstanding assets and that

play05:44

and the debt that they're planning to

play05:46

take

play05:47

and they plan to fund some asset

play05:48

purchases with such a debt perhaps they

play05:50

want to buy

play05:51

a new house or a new car or the like so

play05:54

here we can figure out natural logarithm

play05:57

this is just for

play05:58

scaling purposes so that we have got no

play06:01

outliers

play06:02

at the top so we can figure out the

play06:04

natural logarithm of the

play06:06

leverage of our applicant after they

play06:09

would have been granted

play06:10

the low so in the numerator would have

play06:12

total debt

play06:13

which is outstanding debt that before

play06:16

they were granted a loan

play06:17

plus the loan amount divided by their

play06:19

assets after they were granted

play06:21

a low so assets plus the loan amount

play06:25

here we assume that they'll spend the

play06:27

loan amount to purchase

play06:29

some assets for themselves

play06:32

and here we can also figure out one

play06:35

further variable that would basically

play06:38

mean

play06:39

how long would this individual need to

play06:41

keep

play06:42

earning to repay their loan and that

play06:45

can be calculated as natural logarithm

play06:48

of the loan amount

play06:50

over their typical income

play06:53

and here we have got five explanatory

play06:55

variables which is

play06:56

more than enough for a typical logic

play06:57

model especially with

play06:59

500 observations so here we can bottom

play07:02

like it all the way down

play07:03

and now start figuring out how to

play07:06

calculate

play07:07

the optimal values of these parameters

play07:09

which denote

play07:10

just the constant the constant term as

play07:12

in all regression models

play07:14

and our coefficients for our explanatory

play07:17

candidate variables

play07:18

in the logit model what you utilize is

play07:21

the

play07:22

odds ratio and you try and

play07:25

estimate the probability of default

play07:29

conditional

play07:30

on these values of explanatory variables

play07:33

and as it is also called the logistic

play07:36

regression

play07:37

it should not be a surprise that the

play07:41

equation that relates the predicted

play07:43

values

play07:44

of y the variable of choice the

play07:47

probability of default

play07:48

utilizes the distribution function of

play07:52

the

play07:52

logistic distribution and we have got a

play07:55

video on the logistic distribution our

play07:56

channel already

play07:57

so please check this out later on if

play07:59

you're interested so here we use the

play08:01

logistic distribution logic

play08:02

to relate these variables they can be

play08:06

anything they can be bounded or

play08:08

unbounded they can be categorical or

play08:10

real numbers

play08:11

and so on and so forth so we basically

play08:14

scale them

play08:15

to meet the criteria that the

play08:17

probability can be estimated from

play08:18

zero to one and this is what

play08:21

makes logistic regression and logit more

play08:24

applicable

play08:25

to categorical variable modeling than

play08:28

for example

play08:29

just multiple linear regression because

play08:31

in multiple linear regression you can

play08:33

theoretically get

play08:34

estimated values of your dependent

play08:36

variable outside of

play08:38

zero to one bounds and that could be

play08:40

tricky to interpret it in context of

play08:42

probability of default for example or

play08:44

probability of recession a probability

play08:46

of failing

play08:47

or passing an exam and the like

play08:50

this transformation the logic

play08:52

transformation allows to avoid that

play08:55

so here we can calculate the logit which

play08:58

is just the

play08:58

exponent of the weighted sum of our

play09:02

explanatory variables and the

play09:04

coefficients so we can use sumproduct

play09:09

and refer to our coefficients over here

play09:11

and lock in the row as coefficients stay

play09:13

the same for all observations

play09:15

and referring to explanatory variable

play09:18

plus a constant

play09:19

for simplicity over here corresponding

play09:22

to a particular observation

play09:25

and here we calculate the value of log

play09:26

it and unsurprisingly

play09:29

as all the coefficients are zeros as for

play09:31

now the value of log it is

play09:33

one as the exponent of zero is one and

play09:36

here we can use this logistic

play09:37

transformation

play09:39

to convert this logit value into our

play09:42

estimated probability

play09:44

so we just divide our log it by one plus

play09:46

log it

play09:47

and get a default estimated probability

play09:50

of 0.5

play09:51

which is unsurprising if you know

play09:53

nothing about it

play09:54

you can just assume it's a coin toss

play09:57

it's 50 50

play09:58

either you default or not quite

play10:00

intuitive isn't it

play10:02

and then we can bottom right click it

play10:03

all the way down and estimate it for

play10:05

all of our observations all of our 500

play10:08

lone applicants and here it's

play10:11

actually important to understand how one

play10:14

might

play10:15

optimize these coefficients to arrive at

play10:18

the best fit

play10:19

possible and here is where the logit

play10:22

model

play10:22

differs from multiple regression in

play10:25

terms of

play10:26

the function that you try and optimize

play10:28

for multiple regression

play10:29

you try and minimize the squared sum of

play10:32

residuals

play10:33

while for the logit model the most

play10:35

robust approach

play10:36

possible is to maximize log likelihood

play10:40

and the log likelihood is defined

play10:43

in terms of actual variables actual

play10:45

dependent variables

play10:46

that are zeros and ones y i over here

play10:50

and the estimated values the expected

play10:52

values of y

play10:53

i denoted here as y bar and we can

play10:56

estimate our log likelihood for every

play10:58

single observation

play11:00

by just multiplying our default

play11:02

categorical variable

play11:03

onto the natural logarithm of our

play11:06

estimated probability

play11:07

plus one minus the default categorical

play11:10

variable

play11:12

times the natural logarithm of one minus

play11:14

the calculated probability

play11:17

and we can bottom likelihood all the way

play11:19

down and then we can calculate the total

play11:21

log likelihood which would be the sum of

play11:23

log likelihoods across

play11:25

the whole sample and now we can

play11:28

maximize our log likelihood by varying

play11:30

the coefficient parameters

play11:32

this can be done using solver so we can

play11:34

go data solver

play11:36

and specify our optimization task so we

play11:39

want to maximize

play11:40

the value of log likelihood currently at

play11:42

cell

play11:43

l1 we want to maximize it so here should

play11:46

be max

play11:47

and we need to change variable cells

play11:49

that correspond to

play11:51

b0 b1 all the way to b5 that are the

play11:54

constant term

play11:55

and the coefficients for all five

play11:57

explanatory variables so we select

play11:59

the array c2 to h2 and we don't want to

play12:03

impose any constraints on our parameters

play12:06

as theoretically any of these parameters

play12:09

could impact

play12:10

probability of default either positively

play12:12

or negatively we just have some

play12:14

reasonable suspicions whether for

play12:16

example

play12:17

leverage would affect the default

play12:19

probability positive or negatively

play12:20

but the whole purpose of logic

play12:22

estimation is to test

play12:24

these suspicions these hypotheses so we

play12:26

need to untick

play12:28

this box to make uh all parameters

play12:32

either positive or negative depending on

play12:34

what maximizes log likelihood

play12:36

and click solve and wait until the

play12:38

algorithm converges

play12:39

to an optimal solution and we can see

play12:42

that we have converged

play12:43

to an optimal value of log likelihood we

play12:45

can see it increased from roughly minus

play12:46

350

play12:48

to minus 233 here the negative value of

play12:51

log likelihood should

play12:52

not um be of concern it

play12:55

doesn't really matter whether it's

play12:56

positive or negative it only

play12:58

matters in comparative terms whether it

play13:00

have increased or decreased and we

play13:02

can see that it increased quite a lot

play13:05

and here we can already see

play13:06

what the optimal values of coefficients

play13:09

are the

play13:10

closest values to some true parameters

play13:14

that could be possibly estimated using

play13:15

maximum likelihood and here we see quite

play13:17

unsurprisingly that

play13:19

being a homeowner and being full-time

play13:22

employed

play13:22

reduces your probability of default

play13:26

makes it more likely for you to repay

play13:28

the loan on time

play13:29

and in full which is again quite

play13:32

intuitive

play13:33

while being more

play13:36

careless in your spending habits

play13:38

spending more

play13:40

in proportion to your income makes you

play13:42

less likely to repay

play13:44

being more leveraged makes you less

play13:46

likely to repay

play13:47

and also taking on higher loan with

play13:50

respect to your income

play13:51

also makes you less likely to repay and

play13:54

more likely to default

play13:56

all of those relationships are quite

play13:58

intuitive

play13:59

in terms of the logic either

play14:01

psychological or

play14:02

economic of the theory that is behind it

play14:07

but now we need to figure out which of

play14:09

these relationships

play14:10

are statistically significant and

play14:11

reliable and which are perhaps

play14:14

not significant and uh can be neglected

play14:17

in further modelling so here we need to

play14:20

use some procedure perhaps

play14:22

to estimate the variance of our

play14:25

estimator

play14:26

to come up with standard errors for our

play14:29

coefficients

play14:30

just as we do with multiple regression

play14:32

however here

play14:33

just as with log likelihood instead of

play14:36

minimizing

play14:37

uh squared sum of residuals we have got

play14:39

a slight tweak on the conventional

play14:41

procedure of estimating the variance of

play14:43

our grasses

play14:45

and we have got it covered over here to

play14:47

estimate the covariance matrix

play14:49

uh the variance of estimator of b

play14:53

we need to calculate the inverse matrix

play14:55

of such matrix product

play14:56

that has the transposed x the transposed

play15:00

matrix of our explanatory variables

play15:03

including the constant

play15:04

multiplied by the weight matrix and

play15:07

weight matrix will be explained a little

play15:08

bit later so stay tuned

play15:10

and again finally the matrix

play15:13

of x as in explanatory variables

play15:17

again and uh what is the weight matrix

play15:20

well the weight matrix corresponds to

play15:23

the variance of

play15:25

individual probabilities and here

play15:28

is actually one of the another reasons

play15:31

why

play15:32

logit model is preferable to linear

play15:34

probability models

play15:35

when you just regress your zeros and

play15:38

ones

play15:39

in a multiple linear regression as if

play15:42

you recall in a binomial distribution

play15:46

when you have

play15:47

some probability that an event happens

play15:49

and some probability that doesn't happen

play15:51

the variance of the probability can be

play15:54

defined as probability times one minus

play15:57

probability

play15:58

so basically in such a categorical

play16:01

variable estimation

play16:03

your data is heteroskedastic

play16:06

by definition you have got different

play16:08

variances for

play16:10

different observations for example here

play16:12

if you were to

play16:14

regress it using the usual method the

play16:16

simple multiple linear regression

play16:18

you would have assumed as per the

play16:21

gauss-markov assumptions

play16:22

that variances for all these

play16:25

observations are the same

play16:26

while in fact they can be massively

play16:29

different

play16:30

and depending on the value of the

play16:32

estimated probability

play16:34

and the highest variance of probability

play16:37

would be observable

play16:39

for the estimated y being equal to 0.5

play16:43

as is usually the case as is always the

play16:46

case with the

play16:47

bernoulli distribution so here we need

play16:50

to calculate the

play16:51

weight matrix that would be the

play16:52

weighting matrix of our variance

play16:54

estimator

play16:56

based on the diagonal of our

play16:59

probability estimators so here we have

play17:02

got the template

play17:03

to calculate the 500 by 500

play17:06

weight matrix and here we need to first

play17:08

figure out

play17:09

whether we are on the diagonal and if we

play17:11

are on the diagonal we need to input

play17:13

this particular variance estimation and

play17:15

if not we need to input zero

play17:17

so first we check if the

play17:20

column indicator and row indicates are

play17:22

the same

play17:25

we can input the product

play17:29

of probability and we need to lock the

play17:31

column here

play17:32

times one minus probability

play17:36

log in the column here as well and if

play17:38

not if we are not on the diagonal

play17:40

we just return zero and here we can

play17:44

use ctrl right and bottom lightly knit

play17:46

all the way down

play17:48

we can calculate the whole weighting

play17:50

matrix and we can see that on the

play17:52

diagonal we have got

play17:53

variance estimators them being quite

play17:56

different

play17:56

and being the highest the closest they

play17:59

are

play18:00

to 0.5 as expected

play18:03

and now i can finally calculate our

play18:05

covariance matrix

play18:06

that we can then be used to derive

play18:08

standard errors

play18:09

of our coefficients so here we first

play18:12

need to input min inverse

play18:14

which is inverse matrix then we first

play18:17

need to input

play18:18

our mod function so matrix

play18:21

multiplication

play18:22

the first component would be the

play18:24

transposed

play18:26

array of explanatory variables x

play18:29

starting from the constant

play18:30

all the way to here so it would be a 6

play18:32

by 500 array

play18:35

and as a second component we input

play18:37

another matrix multiplication

play18:39

and here first we input our weight

play18:42

matrix so from here all the way

play18:46

to here so a 500 by 500 matrix

play18:49

and as the second component

play18:52

in this particular product will input

play18:55

our

play18:55

x and we don't need to transpose it at

play18:58

that

play18:59

particular point so we close all the

play19:01

parentheses and make sure that we close

play19:03

them

play19:04

uh all and then enforce this formula

play19:07

using shift ctrl enter as it multiplies

play19:09

a bunch of matrices together basically

play19:12

and we get our covariance matrix and now

play19:16

on the diagonal of our covariance matrix

play19:19

we would have

play19:21

our standard errors squared so now to

play19:24

derive our standard errors

play19:25

we just can calculate the square roots

play19:27

of the diagonal elements and that's what

play19:29

we will do

play19:30

first of all for our constant the

play19:33

estimator

play19:34

would be the square root of the very

play19:37

first element of the matrix

play19:38

and if we drag it across we'll get

play19:42

the square roots of the first row

play19:45

but we need to calculate the square

play19:46

roots of the first diagonal so what we

play19:48

can do here is we can

play19:50

change the references here to refer to

play19:53

the diagonal

play19:54

just increasing the row reference by one

play19:58

each time

play20:03

and here that's how you get the standard

play20:05

errors for the coefficients

play20:07

now we can use the assumption that in

play20:09

large samples

play20:11

the distribution of coefficients is

play20:13

approximately normal

play20:14

so dividing the coefficients by the

play20:16

standard errors we get

play20:18

z stats our usual and uh

play20:22

well-behaved z-stats that can be tested

play20:25

for significance

play20:26

using a two-tailed t-test so two times

play20:29

one minus standard normal distribution

play20:33

as

play20:34

as arguments we input the absolute value

play20:37

of our z stat

play20:39

and one for cumulative and here we can

play20:42

get our p values for

play20:44

every single coefficient and here we can

play20:46

see that

play20:47

the most significant of the predictors

play20:50

of default

play20:51

are being full-time employed so if you

play20:54

are

play20:54

employed full-time you're much less

play20:56

likely to default on your

play20:58

consumer credit and also the significant

play21:01

positive predictors of

play21:02

default leverage and repayment

play21:06

time given income the more leveraged you

play21:09

are the less likely you are to repay

play21:11

and the more it would take you to repay

play21:13

your loan out of your income

play21:15

the less likely you are to repay while

play21:18

three other parameters the constant and

play21:20

most notably

play21:22

homeownership and expense to income

play21:25

ratio

play21:25

are of expected science but not as

play21:29

significant as one could imagine given

play21:30

these p-values

play21:32

are greater than ten percent and now

play21:35

we can use our logic model to actually

play21:37

predict

play21:38

uh whether a particular individual would

play21:41

default

play21:42

imagine we have got this model uh up and

play21:44

running in our bank

play21:46

and someone approaches us and asks out

play21:49

for a loan so we in our credit scoring

play21:53

procedure

play21:54

ask them to provide some data about them

play21:58

so obviously the constant

play22:02

would be one um as it's always one

play22:04

that's why it's a constant

play22:06

uh and we ask the applicant whether they

play22:09

own a home

play22:10

and uh imagine they say yes

play22:13

and we put one for yes then

play22:17

we ask if they have a full-time job and

play22:19

they say

play22:20

yes and provide some evidence for that

play22:23

then to calculate these three variables

play22:24

we need to ask them

play22:26

about their income expenses asset that

play22:29

and what is the loan amount they want to

play22:31

apply for

play22:32

so imagine that this particular person's

play22:35

household

play22:36

makes 150 000 per year

play22:39

and they spend roughly 120 000 of it

play22:43

they own their home and it's currently

play22:45

valued at

play22:46

one uh one thousand thousand dollars so

play22:49

one million dollars

play22:50

and they haven't got any debt taken on

play22:53

previously

play22:54

and they want to take out one million

play22:57

dollars

play22:58

in loan potentially guaranteed as

play23:01

collateral

play23:02

by their property by their house so now

play23:05

we can actually just copy these formulas

play23:07

across

play23:08

to calculate these ratios and we can

play23:11

also copy this across to calculate our

play23:14

log it

play23:15

and our probability can be calculated

play23:17

just like that

play23:18

and we can see that our probability

play23:20

which is

play23:21

logged divided by one plus logit is less

play23:24

than 20

play23:25

so actually we can be reasonably sure

play23:29

that such a person would not default on

play23:31

their loan

play23:32

their default probability being less

play23:34

than 20 percent

play23:36

19.19 which is quite good which is also

play23:39

less than average over here we can see

play23:42

that

play23:43

roughly a quarter of our applicant's

play23:45

default

play23:46

and the probability of this applicant

play23:47

defaulting is less than 20

play23:50

so we can determine that our applicant

play23:52

is actually credit worthy

play23:54

and feel quite good about providing them

play23:57

with a low and that's all there is for

play24:00

the logit model

play24:01

and its application for consumer credit

play24:03

risk credit scoring

play24:04

and much more please leave a like on

play24:06

this video if you found it helpful

play24:07

in the comments below i'm going to see

play24:09

any further suggestions for videos and

play24:10

business economics or finance you would

play24:12

like me to record

play24:13

and please don't forget to subscribe to

play24:14

our channel or consider supporting us on

play24:15

patreon

play24:16

thank you very much and stay tuned

Rate This

5.0 / 5 (0 votes)

Related Tags
Logistic RegressionCredit ScoringBinary OutcomeFinanceEconomicsPredictive ModelingConsumer LoansRisk AnalysisData AnalysisEducational Tutorial