The Essential Main Ideas of Neural Networks

StatQuest with Josh Starmer
30 Aug 202018:54

Summary

TLDRThis video explains neural networks, which identify patterns in data to make predictions, using the analogy of fitting a 'green squiggle' to a dataset. It introduces core neural network concepts like nodes, connections, hidden layers, and activation functions. Though neural networks seem complicated, they are essentially 'big fancy squiggle fitting machines' that use basic math operations on bent lines to generate new shapes. By adding and manipulating these shapes, neural networks can model incredibly complex datasets for machine learning tasks. This friendly beginner's guide aims to demystify neural networks by breaking things down step-by-step.

Takeaways

  • πŸ˜€ Neural networks fit squiggles to data to make predictions
  • πŸ‘‰ Nodes are connected in layers to transform activation functions into new shapes
  • πŸ“ˆ Weights and biases parameterize connections to reshape activation functions
  • πŸ” Looking inside the neural network black box demystifies how they work
  • πŸ“Š Backpropagation estimates optimal parameters by fitting the network to data
  • βš™οΈ More layers and nodes enable more complex squiggle fitting for tricky data
  • 🧠 Neural networks were named by analogy to biological neurons and synapses
  • ✏️ Simple math and labeled diagrams provide intuition into neural mechanisms
  • πŸŽ“ This tutorial series aims to develop deep understanding of neural networks
  • πŸš€ Even simple networks demonstrate the power and flexibility of the approach

Q & A

  • What is the goal of the StatQuest video series on neural networks?

    -The goal is to take a peek inside the neural network 'black box' by breaking down each concept and technique into its components and walking through how they fit together step-by-step.

  • What type of activation function is used in the example neural network?

    -The soft plus activation function is used in the example neural network.

  • Why are the layers between the input and output nodes called hidden layers?

    -The layers between the input and output nodes are called hidden layers because their values are not directly observed in the training data.

  • What are the two key components that allow a neural network to create new shapes?

    -The two key components are the activation functions, which create bent or curved lines, and the weights and biases on the connections, which reshape and combine these lines.

  • What is backpropagation and when will it be discussed?

    -Backpropagation is the method used to estimate the parameters (weights and biases) when fitting a neural network to data. It will be discussed in Part 2 of the video series.

  • Why are neural networks called 'big fancy squiggle fitting machines'?

    -Because their key functionality is using activation functions and weighted connections to fit complicated squiggly lines (green squiggles) to data.

  • What math notation is used for the log function in the video?

    -The natural log, or log base e, is used in the math notation.

  • How many nodes are in the hidden layer of the example neural network?

    -There are 2 nodes in the single hidden layer of the example neural network.

  • What are some ways you can support the StatQuest channel and quest on?

    -Ways to support include subscribing, contributing to Josh's Patreon campaign, becoming a channel member, buying songs/merchandise, or donating.

  • Why are neural networks still called neural networks if nodes aren't really like neurons?

    -They were originally inspired by and named after neurons and synapses back when they were first invented in the 1940s/50s, even if the analogy isn't perfect.

Outlines

00:00

😊 Introducing Neural Networks

The narrator Josh introduces neural networks, stating they seem complicated but can be broken down into understandable components. He outlines the goals for this video series on neural networks, which includes labeling all parts of a simple neural network to show how it fits a curve to data through basic math.

05:02

🧠 Understanding Activation Functions

The narrator defines the curved lines inside nodes of a neural network as activation functions. Common options include sigmoid, ReLU (rectified linear unit), and soft plus. The choice impacts the shape the neural network can fit. This example uses soft plus.

10:05

πŸ“ˆ Fitting Curves with Neural Networks

The narrator steps through the math to show how a simple neural network with one input node and two hidden nodes can fit a curve to data. Each hidden node slices and scales its activation function differently based on the weights and biases estimated when fitting the network.

15:06

βœ… Making Predictions from the Fitted Curve

The narrator combines the curves from the two hidden nodes into a final fitted curve that matches the data. He shows how new data can be input to generate a prediction from the corresponding point on the final fitted curve. Finally, he explains the biological inspiration for neural networks.

Mindmap

Keywords

πŸ’‘Neural network

A neural network is a type of machine learning algorithm that is designed to recognize patterns in data. It consists of an input layer, one or more hidden layers, and an output layer, with nodes in each layer and connections between the nodes. In the video, neural networks are used to fit a squiggle (nonlinear function) to the drug dosage data to predict treatment effectiveness.

πŸ’‘Activation function

The activation function is a curved or bent line inside each node of the neural network. It transforms the input from the previous layer nonlinearly before passing it to the next layer. Common activation functions mentioned are sigmoid, ReLU (rectified linear unit), and soft plus. The video uses soft plus activation functions to reshape the data.

πŸ’‘Backpropagation

Backpropagation is an algorithm used to train neural networks. It estimates the parameters (weights and biases) for the connections in the network by comparing its predictions on the training data to the actual targets, calculating the error, and adjusting the parameters to reduce the error. The estimated parameters result in different transformations of the activation function curves.

πŸ’‘Hidden layer

The hidden layers are the intermediate layers of nodes between the input and output layers in a neural network. They enable the model to learn more complex nonlinear relationships between inputs and outputs. Deciding how many hidden layers and how many nodes in each layer is important for good performance.

πŸ’‘Weight

The weights are parameter values that are multiplied with the input values as they pass from one node to the next. They determine how much influence the input to a node has on the output. The backpropagation algorithm adjusts the weights during training to minimize error.

πŸ’‘Bias

The biases are parameter values that are added to the weighted input in nodes of the neural network after multiplication by the weights. Together with the weights, they control the transformation applied to activation functions.

πŸ’‘Overfitting

Overfitting refers to a model that fits the training data very well but fails to generalize to new data. The video explains that neural networks can fit arbitrarily complex squiggles to data, avoiding overfitting is important. Techniques like regularization and dropout help prevent overfitting.

πŸ’‘Generalization

Generalization refers to how well a machine learning model can adapt what is learned from the training data to make accurate predictions on new unseen data. The video talks about neural networks being universal approximators that can fit to almost any dataset, enabling good generalization.

πŸ’‘Nonlinearity

Most real-world data has nonlinear relationships that cannot be modeled with linear functions. Neural networks introduce nonlinearity through the activation functions, allowing them to fit nonlinear squiggles to data and model complex nonlinear patterns.

πŸ’‘Universal approximation

This refers to the ability of neural networks with at least one hidden layer to represent a wide range of nonlinear functions, making them universal approximators. This theoretical capability enables them to model complex real-world data.

Highlights

Neural networks cover a broad range of concepts and techniques, however, people call them a black box because it can be hard to understand what they're doing.

The goal of this series is to take a peek into the black box by breaking down each concept and technique into its components and walking through how they fit together, step by step.

In this first part, we will learn about what neural networks do, and how they do it.

I have a new way to think about neural networks that will help beginners and seasoned experts alike gain a deep insight into what neural networks do.

The math will be as simple as possible, while still being true to the algorithm. These differences will help you develop a deep understanding of what neural networks actually do.

We're going to use this super simple dataset and show how this neural network creates this green squiggle.

The curved or bent lines are called activation functions. When you build a neural network you have to decide which activation function, or functions, you want to use.

Although there are rules of thumb for making decisions about the hidden layers, you essentially make a guess and see how well the neural network performs, adding more layers and nodes if needed.

In theory, neural networks can fit a green squiggle to just about any dataset, no matter how complicated, and i think that's pretty cool.

If you want to review statistics and machine learning offline check out the StatQuest study guides at statquest.org. There's something for everyone.

If you like this StatQuest and want to see more, please subscribe.

And if you want to support StatQuest consider contributing to my patreon campaign, becoming a channel member, buying one or two of my original songs, or a t-shirt, or a hoodie, or just donate.

This neural network starts with two identical activation functions, but the weights and biases on the connections slice them, flip them, and stretch them into new shapes, which are then added together to get a squiggle that is entirely new.

Just imagine what types of green squiggles we could fit with more hidden layers and more nodes in each hidden layer.

I think neural networks should be called big fancy squiggle fitting machines, because that's what they do.

Transcripts

play00:00

Neural networks...

play00:04

seem so complicated, but they're not!

play00:09

StatQuest!

play00:11

Hello!

play00:13

I'm Josh Starmer and welcome to StatQuest!

play00:15

Today, we're going to talk about neural networks, part one: inside the black box!

play00:22

Neural networks, one of the most popular algorithms in machine learning, cover a broad range of concepts and techniques.

play00:31

however, people call them a black box because it can be hard to understand what they're doing.

play00:38

the goal of this series is to take a peek into the black box by breaking down each

play00:44

concept and technique into its components and walking through how they fit together, step by step.

play00:51

in this first part, we will learn about what neural networks do, and how they do it.

play00:57

in part two, we'll talk about how neural networks are fit to data with backpropagation.

play01:04

then, we will talk about variations on the simple neural network presented in this part, including deep learning.

play01:12

note: crazy awesome news!

play01:16

i have a new way to think about neural networks that will help beginners and seasoned

play01:21

experts alike gain a deep insight into what neural networks do.

play01:27

for example, most tutorials use cool looking, but hard to understand graphs, and fancy

play01:34

mathematical notation to represent neural networks.

play01:39

in contrast, i'm going to label every little thing on the neural network to make it easy to keep track of the details.

play01:48

and the math will be as simple as possible, while still being true to the algorithm.

play01:54

these differences will help you develop a deep understanding of what neural networks actually do.

play02:02

so, with that said, let's imagine we tested a drug that was designed to treat an illness

play02:09

and we gave the drug to three different groups of people, with three different dosages: low, medium, and high.

play02:20

the low dosages were not effective so we set them to zero on this graph. in contrast,

play02:27

the medium dosages were effective so we set them to one.

play02:32

and the high dosages were not effective, so those are set to zero.

play02:38

now that we have this data, we would like to use it to predict whether or not a future dosage will be effective.

play02:46

however we can't just fit a straight line to the data to make predictions, because

play02:51

no matter how we rotate the straight line, it can only accurately predict two of the three dosages.

play02:59

the good news is that a neural network can fit a squiggle to the data.

play03:05

the green squiggle is close to zero for low dosages, close to one for medium dosages,

play03:12

and close to zero for high dosages. and even if we have a really complicated dataset

play03:20

like this, a neural network can fit a squiggle to it.

play03:26

in this StatQuest we're going to use this super simple dataset and show how this neural network creates this green squiggle.

play03:36

but first, let's just talk about what a neural network is.

play03:42

a neural network consists of nodes and connections between the nodes.

play03:49

note: the numbers along each connection represent parameter values that were estimated

play03:54

when this neural network was fit to the data.

play03:58

for now, just know that these parameter estimates are analogous to the slope and intercept

play04:04

values that we solve for when we fit a straight line to data.

play04:09

likewise, a neural network starts out with unknown parameter values that are estimated

play04:15

when we fit the neural network to a dataset using a method called backpropagation.

play04:21

and we will talk about how backpropagation estimates these parameters in part 2 in this series.

play04:29

but, for now, just assume that we've already fit this neural network to this specific

play04:35

dataset, and that means we have already estimated these parameters.

play04:41

also, you may have noticed that some of the nodes have curved lines inside of them.

play04:48

these bent or curved lines are the building blocks for fitting a squiggle to data.

play04:55

the goal of this StatQuest is to show you how these identical curves can be reshaped

play05:01

by the parameter values and then added together to get a green squiggle that fits the data.

play05:09

note: there are many common bent or curved lines that we can choose for a neural network.

play05:16

this specific curved line is called soft plus, which sounds like a brand of toilet paper.

play05:23

alternatively, we could use this bent line, called ReLU, which is short for rectified linear unit, and sounds like a robot.

play05:33

or, we could use a sigmoid shape, or any other bent or curved line.

play05:39

oh no!

play05:40

it's the dreaded terminology alert!

play05:43

the curved or bent lines are called activation functions.

play05:48

when you build a neural network you have to decide which activation function, or functions, you want to use.

play05:57

when most people teach neural networks they use the sigmoid activation function.

play06:03

however, in practice, it is much more common to use the ReLU activation function, or the soft plus activation function.

play06:13

so we'll use the soft plus activation function in this StatQuest.

play06:18

anyway, we'll talk more about how you choose activation functions later in this series.

play06:25

note: this specific neural network is about as simple as they get.

play06:31

it only has one input node, where we plug in the dosage, only one output node to tell

play06:37

us the predicted effectiveness, and only two nodes between the input and output nodes.

play06:44

however, in practice, neural networks are usually much fancier and have more than

play06:51

one input node, more than one output node, different layers of nodes between the

play06:57

input and output nodes, and a spider web of connections between each layer of nodes.

play07:06

oh no!

play07:07

it's another terminology alert!

play07:09

these layers of nodes between the input and output nodes are called hidden layers.

play07:16

when you build a neural network one of the first things you do is decide how many

play07:21

hidden layers you want and how many nodes go into each hidden layer.

play07:26

although there are rules of thumb for making decisions about the hidden layers, you

play07:31

essentially make a guess and see how well the neural network performs, adding more layers and nodes if needed.

play07:40

now, even though this neural network looks fancy, it is still made from the same parts

play07:47

used in this simple neural network, which has only one hidden layer with two nodes.

play07:55

so let's learn how this neural network creates new shapes from the curved or bent

play08:00

lines in the hidden layer, and then adds them together to get a green squiggle that fits the data.

play08:08

note: to keep the math simple, let's assume dosages go from zero, for low, to one, for high.

play08:18

the first thing we are going to do is plug the lowest dosage, zero, into the neural network.

play08:25

now, to get from the input node to the top node in the hidden layer, this connection

play08:32

multiplies the dosage by negative 34.4 and then adds 2.14, and the result is an x-axis coordinate for the activation function.

play08:47

for example, the lowest dosage 0 is multiplied by negative 34.4, and then we add 2.14,

play08:57

to get 2.14 as the x-axis coordinate for the activation function.

play09:06

to get the corresponding y-axis value we plug 2.14 into the activation function, which in this case is the soft plus function.

play09:18

note: if we had chosen the sigmoid curve for the activation function then we would

play09:23

plug 2.14 into the equation for the sigmoid curve.

play09:29

and if we had chosen the ReLU bent line for the activation function, then we would plug 2.14 into the ReLU equation.

play09:39

but, since we are using soft plus for the activation function, we plug 2.14 into the soft plus equation.

play09:48

and the log of one plus e raised to the 2.14 power is 2.25.

play09:57

note: in statistics, machine learning, and most programming languages, the log function

play10:04

implies the natural log, or the log base e. anyway, the y-axis coordinate for the

play10:11

activation function is 2.25, so let's extend this y-axis up a little bit and put

play10:19

a blue dot at 2.25 for when dosage equals zero.

play10:26

now, if we increase the dosage a little bit and plug 0.1 into the input, the x-axis

play10:34

coordinate for the activation function is negative 1.3, and the corresponding y-axis

play10:41

value is 0.24. so, let's put a blue dot at 0.24 for when dosage equals 0.1. and,

play10:52

if we continue to increase the dosage values all the way to 1, the maximum dosage, we get this blue curve.

play11:02

note: before we move on I want to point out that the full range of dosage values,

play11:08

from 0 to 1, corresponds to this relatively narrow range of values from the activation function.

play11:16

in other words, when we plug dosage values, from 0 to 1, into the neural network,

play11:23

and then multiply them by negative 34.4 and add 2.14, we only get x-axis coordinates that are within the red box.

play11:35

and thus, only the corresponding y-axis values in the red box are used to make this new blue curve.

play11:44

bam!

play11:45

now we scale the y-axis values for the blue curve by negative 1.3. for example, when

play11:53

dosage equals zero the current y-axis coordinate for the blue curve is 2.25, so we

play12:01

multiply 2.25 by negative 1.3 and get negative 2.93. and negative 2.93 corresponds to this position on the y-axis.

play12:16

likewise, we multiply all of the other y-axis coordinates on the blue curve by negative

play12:23

1.3 and we end up with a new blue curve.

play12:27

bam!

play12:30

now, let's focus on the connection from the input node, to the bottom node in the hidden layer.

play12:37

however, this time, we multiply the dosage by negative 2.52, instead of negative 34.4,

play12:47

and we add 1.29, instead of 2.14, to get the x-axis coordinate for the activation function.

play12:58

remember, these values come from fitting the neural network to the data with backpropagation,

play13:05

and we'll talk about that in part two in this series.

play13:09

now, if we plug the lowest dosage, zero, into the neural network, then the x-axis

play13:16

coordinate for the activation function is 1.29.

play13:21

now we plug 1.29 into the activation function to get the corresponding y-axis value,

play13:30

and get 1.53. and that corresponds to this yellow dot.

play13:37

now, we just plug in dosage values from 0 to 1 to get the corresponding y-axis values, and we get this orange curve.

play13:48

note: just like before, i want to point out that the full range of dosage values,

play13:54

from 0 to 1, corresponds to this narrow range of values from the activation function.

play14:01

in other words, when we plug dosage values from 0 to 1 into the neural network we

play14:08

only get x-axis coordinates that are within the red box.

play14:14

and thus, only the corresponding y-axis values in the red box are used to make this new orange curve.

play14:22

so we see that fitting a neural network to data gives us different parameter estimates

play14:28

on the connections and that results in each node in the hidden layer using different

play14:34

portions of the activation functions to create these new and exciting shapes.

play14:40

now, just like before, we scale the y-axis coordinates on the orange curve, only this

play14:46

time we scale by a positive number: 2.28.

play14:50

beep boop beep! for every number written on the screen] and that gives us this new orange curve.

play14:59

now the neural network tells us to add the y-axis coordinates from the blue curve

play15:06

to the orange curve, and that gives us this green squiggle.

play15:11

then, finally, we subtract 0.58 from the y-axis values on the green squiggle, and

play15:19

we have a green squiggle that fits the data.

play15:23

bam!

play15:25

now, if someone comes along and says that they are using dosage equal to 0.5 we can

play15:32

look at the corresponding y-axis coordinate on the green squiggle and see that the dosage will be effective.

play15:39

or, we can solve for the y-axis coordinate by plugging dosage equals 0.5 into the neural network, and do the math.

play16:08

[Music] and we see that the y-axis coordinate on the green squiggle is 1.03, and since

play16:19

1.03 is closer to 1 than 0, we will conclude that a dosage equal to 0.5 is effective.

play16:35

double bam!

play16:39

now, if you've made it this far you may be wondering why this is called a neural network.

play16:45

instead of a big fancy squiggle fitting machine.

play16:49

the reason is that way back in the 1940s and 50s, when neural networks were invented,

play16:56

they thought the nodes were vaguely like neurons, and the connections between the nodes were sort of like synapses.

play17:04

however, i think they should be called big fancy squiggle fitting machines, because that's what they do.

play17:11

note: whether or not you call it a squiggle fitting machine, the parameters that we

play17:17

multiply are called weights, and the parameters that we add are called biases.

play17:23

note: this neural network starts with two identical activation functions, but the

play17:29

weights and biases on the connections slice them, flip them, and stretch them into

play17:35

new shapes, which are then added together to get a squiggle that is entirely new.

play17:41

and then the squiggle is shifted to fit the data.

play17:45

now, if we can create this green squiggle with just two nodes in a single hidden layer,

play17:50

just imagine what types of green squiggles we could fit with more hidden layers and more nodes in each hidden layer.

play17:58

in theory, neural networks can fit a green squiggle to just about any dataset, no

play18:04

matter how complicated, and i think that's pretty cool.

play18:08

triple bam!

play18:12

now it's time for some shameless self-promotion!

play18:17

if you want to review statistics and machine learning offline check out the StatQuest study guides at statquest.org.

play18:25

there's something for everyone.

play18:28

hooray!

play18:29

we've made it to the end of another exciting StatQuest.

play18:32

if you like this StatQuest and want to see more, please subscribe.

play18:36

and if you want to support StatQuest consider contributing to my patreon campaign,

play18:41

becoming a channel member, buying one or two of my original songs, or a t-shirt, or a hoodie, or just donate.

play18:48

the links are in the description below.

play18:50

alright, until next time.

play18:52

quest on!