How a machine learns
Summary
TLDRThis script delves into the workings of neural networks, the cornerstone of machine learning. It explains how ANNs learn through layers of neurons, utilizing weights and activation functions to process inputs and produce outputs. The importance of non-linearity introduced by activation functions like ReLU, sigmoid, and softmax is highlighted. The script also covers the backpropagation process using gradient descent to minimize loss functions like MSE and cross-entropy, adjusting weights for improved predictions. Hyperparameters' role in guiding the learning process is also underscored.
Takeaways
- đ§ **Neural Networks Fundamentals**: Neural networks like DNN, CNN, RNN, and LLMs are based on the basic structure of ANN.
- đ **Structure of ANN**: ANNs consist of an input layer, hidden layers, and an output layer, with neurons connected by synapses.
- đą **Learning Process**: ANNs learn by adjusting weights through the training process to make accurate predictions.
- đ **Weighted Sum Calculation**: The first step in learning is calculating the weighted sum of inputs multiplied by their weights.
- đ **Activation Functions**: These are crucial for introducing non-linearity into the network, allowing complex problem-solving.
- đ **Output Layer Calculation**: The weighted sum is calculated for the output layer, potentially using different activation functions.
- đŻ **Prediction Representation**: The predicted result is denoted as \( \hat{y} \), while the actual result is \( y \).
- đ **Activation Functions Explained**: ReLU, sigmoid, and Tanh are common functions, with softmax used for multi-class classification.
- đ **Loss and Cost Functions**: These measure the difference between predicted and actual values, guiding the learning process.
- đ **Backpropagation**: A method for adjusting weights and biases based on the significant difference between predictions and actual results.
- đ **Gradient Descent**: An optimization algorithm used to minimize the cost function by iteratively adjusting weights.
- đ **Iteration and Epochs**: The training process is repeated through epochs until the cost function reaches an optimum value.
- đ ïž **Hyperparameters**: Parameters like learning rate and number of epochs that determine the learning process, set before training begins.
Q & A
What is the fundamental structure of an artificial neural network (ANN)?
-An ANN has three layers: an input layer, a hidden layer, and an output layer. Each node in these layers represents a neuron.
How do neural networks learn from examples?
-Neural networks learn from examples by adjusting the weights through a training process, aiming to minimize the difference between predicted and actual results.
What is the purpose of the weights in a neural network?
-The weights in a neural network retain the information learned through the training process and are used to calculate the weighted sum of inputs.
Why are activation functions necessary in neural networks?
-Activation functions are necessary to introduce non-linearity into the network, allowing it to learn complex patterns that linear models cannot capture.
What is the difference between a cost function and a loss function?
-A loss function measures the error for a single training instance, while a cost function measures the average error across the entire training set.
How does gradient descent help in training a neural network?
-Gradient descent is an optimization algorithm that adjusts the weights and biases to minimize the cost function by iteratively moving in the direction of the steepest descent.
What is the role of the learning rate in gradient descent?
-The learning rate determines the step size during the gradient descent process, influencing how quickly the neural network converges to the optimal solution.
Why might a neural network with only linear activation functions not be effective?
-A neural network with only linear activation functions would essentially be a linear model, which cannot capture complex, non-linear relationships in the data.
What is the significance of the number of epochs in training a neural network?
-The number of epochs determines how many times the entire training dataset is passed through the network. It affects the thoroughness of the training process.
How does backpropagation contribute to the learning process of a neural network?
-Backpropagation is the process of adjusting the weights and biases in the opposite direction of the gradient to minimize the cost function when there is a significant difference between predicted and actual results.
What is the softmax activation function used for, and how does it differ from the sigmoid function?
-The softmax activation function is used for multi-class classification problems, outputting a probability distribution across multiple classes. The sigmoid function, on the other hand, is used for binary classification, outputting a probability between 0 and 1 for a single class.
Outlines
đ§ Neural Networks Learning Process
This paragraph introduces the fundamental concepts of machine learning, particularly focusing on how neural networks learn. It explains the structure of an artificial neural network (ANN), which consists of an input, hidden, and output layer. Each neuron in these layers is connected by synapses, analogous to the human brain. The learning process is described through the calculation of weighted sums, application of activation functions to introduce non-linearity, and the prediction of outcomes. Activation functions such as ReLU, sigmoid, and softmax are mentioned, each serving different purposes in the learning process. The paragraph also touches on the importance of activation functions in preventing linearity and enabling the network to handle complex problems.
đ Evaluating Neural Network Performance
This section delves into how neural networks assess their learning effectiveness. It introduces the concepts of loss and cost functions, which are used to measure the discrepancy between predicted and actual results. The paragraph explains that these functions are crucial for guiding the learning process by identifying areas where the model's predictions are inaccurate. The use of mean squared error (MSE) for regression and cross-entropy for classification problems is highlighted. The paragraph also discusses the backpropagation algorithm, which is used to adjust the weights and biases of the network to minimize the cost function. Gradient descent is introduced as the method for finding the optimal direction and step size for these adjustments, with the learning rate being a key hyperparameter influencing the training speed and convergence. The iterative nature of learning is emphasized, where multiple epochs of training are performed until the cost function reaches an optimum.
đ§ Hyperparameters and Automated Machine Learning
The final paragraph discusses the role of hyperparameters in determining the learning process of a neural network. It mentions that data scientists typically select these parameters through experimentation to find the best combination for model performance. However, tools like AutoML can automate this process, saving time and resources. The paragraph also reviews the key terms introduced in the script, such as weights, biases, activation functions, learning rate, and epochs. It emphasizes that understanding these components is essential for building effective machine learning models. The script concludes by encouraging learners to revisit these concepts in upcoming lessons and labs.
Mindmap
Keywords
đĄNeural Networks
đĄArtificial Neural Network (ANN)
đĄWeights
đĄActivation Functions
đĄBackpropagation
đĄGradient Descent
đĄLearning Rate
đĄEpochs
đĄCost Function
đĄHyperparameters
đĄAutoML
Highlights
Neural networks are fundamental to machine learning.
Different types of neural networks solve various problems.
Artificial Neural Networks (ANN) are the basis for all neural network models.
ANNs consist of an input, hidden, and output layer.
Neurons and synapses are the building blocks of ANNs.
Neural networks learn through examples and make predictions.
The learning process involves calculating the weighted sum.
Activation functions introduce non-linearity to neural networks.
Activation functions like ReLU, sigmoid, and Tanh are widely used.
Softmax is used for multi-class classification.
Loss functions measure the difference between predicted and actual results.
Backpropagation is used to adjust weights and biases.
Gradient descent is a method to minimize the cost function.
The learning rate determines the step size in gradient descent.
An epoch represents one complete pass of the training process.
Weights are adjusted iteratively until the cost function is optimized.
Hyperparameters are set by humans and determine how a machine learns.
AutoML can automatically select hyperparameters.
The learning process of neural networks is iterative and continuous.
Transcripts
To understand machine learning, you must first understand how neural networks learn.
This includes exploring this learning process and the terms associated with it.
If you are already familiar with the ML theories and terminologies, feel free to skip this
lesson.
How do machines learn?
And how do they assess their learning?
Before you dive into building an ML model, let's take a look at how a neural network
learns.
You may already know about various neural networks, such as deep neural networks (or
DNN), convolutional neural networks (or CNN), recurrent neural networks (or RNN), and more
recently large language models (LLMs).
These networks are used to solve different problems.
All of these models stem from the most basic: artificial neural network (or ANN).
ANNs are also referred to as neural networks or shallow neural networks.
Letâs focus on ANN to see how a neural network learns.
An ANN has three layers: an input layer, a hidden layer, and an output layer.
Each node represents a neuron.
The lines between neurons stimulate synopses, which is how information is transmitted in
a human brain.
For instance, if you input article titles from multiple resources, the neural network
can tell you which media outlet or platform the article belongs to, such as GitHub, New
York Times, and TechCrunch.
How does an ANN learn from examples and then make predictions?
Letâs examine how it works in depth.
[Animation: Zoom/fade into the center of the âHidden Layerâ above, transitioning to
next slide] Letâs assume you have two input neurons or nodes, one hidden neuron, and one
output neuron.
Above the link between neurons are weights.
The weights retain information that a neural network learned through the training process
and are the mysteries that a neural network aims to discover.
The first step is to calculate the weighted sum.
This is done by multiplying each input value by its corresponding weight, and then summing
the products.
It normally includes a bias component bi . However, to focus on the core idea, ignore it for now.
The second step is to apply an activation function to the weighted sum.
What is an activation function, and why do you need it?
Letâs pause your curiosity for just a moment and get back to that soon.
In the third step, the weighted sum is calculated for the output layer, assuming multiple neurons
in the hidden layers.
The fourth step is to apply an activation function to the weighted sum.
This activation function can be different from the one applied to the hidden layers.
The result is the predicted y, which consists of the output layer.
You use y hat to represent the predicted result and y as the actual result.
Now letâs get back to activation functions.
What does an activation function do?
Well, an activation function is used to prevent linearity or add non-linearity.
What does that mean?
Think about a neural network.
Without activation functions, the predicted result y hat will always be a linear function
of the input x, regardless of the number of layers between input and output.
Letâs walk through this for clarity.
Without the activation function, the value of the hidden layer h equals a total of w1
times x1 and w2 times x2.
Please note that to make this illustration easy, we ignored the bias component b, which
you often see in other ML materials.
The output y hat therefore equals to w3 times h, and eventually equals to a total of constant
number a times x1 and a constant number b times x2 In other words, the output Y is a
linear combination of the input X.
If y is a linear function of x, you donât need all the hidden layers, but only one input
and one output.
You might already know that linear models do not perform well when handling comprehensive
problems.
Thatâs why you must use activation functions to convert a linear network to a non-linear
one.
What are the widely used activation functions?
You can use the rectified linear unit (or ReLU) function, which turns an input value
to zero if itâs negative, or keeps the original value if itâs positive.
You can use the sigmoid function, which turns the input to a value between 0 and 1.
And hyperbolic tangent (Tanh) function, which shifts the sigmoid curve and generates a value
between -1 and +1.
Another interesting and important activation function is called softmax.
Think about sigmoid: it generates a value from zero to one and is used for binary classification
in logistic regression models.
An example for this would be deciding whether an email is spam.
What if you have multiple categories, such as GitHub, NYTimes, and TechCrunch?
Here you must use softmax, which is the activation function for multi-class classification.
It maps each output to a [0,1] range in a way that the total adds up to 1.
Therefore, the output of softmax is a probability distribution.
Skipping the math, you can conclude that softmax is used for multi-class classification, whereas
sigmoid is used for binary-class classification in logistic regression models.
Also note that you donât need to have the same activation function across different
layers.
For instance, you can have ReLU for hidden layers and softmax for the output layer.
Now that you understand the activation function and get a predicted y, how do you know if
the result is correct?
You use an assessment called loss function or cost function to measure the difference
between the predicted y and the actual y.
Loss function is used to calculate errors for a single training instance, whereas cost
function is used to calculate errors from the entire training set.
Therefore, in step five, you calculate the cost function to minimize the difference.
If the difference is significant, the neural network knows that it did a bad job in predicting
and must go back to learn more and adjust parameters.
Many different cost functions are used in practice.
For regression problems, mean squared error, or MSE, is a common one used in linear regression
models.
MSE equals the average of the sum of squared differences between y hat and y.
For classification problems, cross-entropy is typically used to measure the difference
between the predicted and actual probability distributions in logistic regression models.
If the difference between the predicted and actual results is significant, you must go
back to adjust weights and biases to minimize the cost function.
This potential sixth step is called backpropagation.
The challenge now is how to adjust the weights.
The solution is slightly complex, but indeed the most interesting part of a neural network.
The idea is to take cost functions and turn them into a search strategy.
Thatâs where gradient descent comes in.
Gradient descent refers to the process of walking down the surface formed by the cost
function and finding the bottom.
It turns out that the problem of finding the bottom can be divided into two different and
important questions: The first is: which direction should you take?
The answer involves the derivative.
Letâs say you start from the top left.
You calculate the derivative of the cost function and find itâs negative.
This means the angle of the slope is negative and you are at the left side of the curve.
To get to the bottom, you must go down and right.
Then, at one point, you are on the right side of the curve, and you calculate the derivative
again.
This time the value is positive, and you must slide again to the left.
You calculate the derivative of the cost function every time to decide which direction to take.
Repeat this process, according to gradient descent, and you will eventually reach the
regional bottom.
The second question in finding the bottom is what size should the steps be?
The step size depends on the learning rate, which determines the learning speed of how
fast you bounce around to reach the bottom.
Step size or âlearning rateâ is a hyperparameter that is set before training.
If the step size is too small, your training might take too long.
If the step size is too large, you might bounce from wall to wall or even bounce out of the
curve entirely, without converging.
When step size is just right, youâre set.
The seventh, and last step, is iteration.
One complete pass of the training process from step 1 to step 6 is called an epoch.
You can set the number of epochs as a hyperparameter in training.
Weights or parameters are adjusted until the cost function reaches its optimum.
You can tell that the cost function has reached its optimum when the value stops decreasing,
even after many iterations.
This is how a neural network learns.
It iterates the learning by continuously adjusting weights to improve behavior until it reaches
the best result.
This is similar to a human learning lessons from the past.
We have illustrated a simple example with two input neurons (nodes), one hidden neuron,
and one output neuron.
In practice, you might have many neurons in each layer.
Regardless of the number of neurons in the input, hidden, and output layer, the fundamental
process of how a neural network learns remains the same.
Learning about neural networks can be exciting, but also overwhelming with the large number
of new terms.
Letâs take a moment to review them.
In a neural network, weights and biases are parameters learned by the machine during training.
You have no control of the parameters except to set the initial values.
The number of layers and neurons, activation functions, learning rate, and epochs are hyperparameters,
which are decided by a human before training.
The hyperparameters determine how a machine learns.
For example, the learning rate decides how fast a machine learns and the number of epochs
defines how many times the learning interates.
Normally, data scientists choose the hyperparameters and experiment with them to find the optimum
combination.
However, if you use a tool like AutoML, it automatically selects the hyperparameters
for you and saves you plenty of experiment time.
You also learned about cost or loss functions, which are used to measure the difference between
the predicted and actual value.
They are used to minimize error and improve performance.
You use backpropagation to modify the weights and bias if the difference is significant,
and gradient descent to decide how to tune the weights and bias and when to stop.
These terms are your best friends when building an ML model.
Youâll revisit them in upcoming lessons and labs.
Voir Plus de Vidéos Connexes
Activation Functions In Neural Networks Explained | Deep Learning Tutorial
Deep Learning: In a Nutshell
What is backpropagation really doing? | Chapter 3, Deep learning
Gradient descent, how neural networks learn | Chapter 2, Deep learning
I2DL NN
Backpropagation Solved Example - 4 | Backpropagation Algorithm in Neural Networks by Mahesh Huddar
5.0 / 5 (0 votes)