Watching Neural Networks Learn

Emergent Garden
17 Aug 202325:27

Summary

TLDRThis video explores the concept of function approximation in neural networks, emphasizing their role as universal function approximators. It delves into how neural networks learn by adjusting weights and biases to fit data points, using examples like curve fitting and image recognition. The video also discusses the challenges of high-dimensional problems and the 'curse of dimensionality,' comparing neural networks to other mathematical tools like Taylor and Fourier series for function approximation. It concludes by highlighting the potential of these methods for real-world applications and poses a challenge to the audience to improve the approximation of the Mandelbrot set.

Takeaways

  • 🧠 Neural networks are universal function approximators, capable of learning and modeling complex relationships in data.
  • 🌐 Functions are fundamental to describing the world, with everything from sound to light being represented through mathematical functions.
  • 📊 The goal of artificial intelligence is to create programs that can understand, model, and predict the world, often through self-learning functions.
  • 🔍 Neural networks learn by adjusting weights and biases to minimize error between predicted and actual outputs, a process known as backpropagation.
  • 📉 Activation functions like ReLU and leaky ReLU play a crucial role in shaping the output of neurons within neural networks.
  • 🔱 Neural networks can handle both low-dimensional and high-dimensional data, although the complexity of learning increases with dimensionality.
  • 📊 The Fourier series and Taylor series are mathematical tools that can be used to approximate functions and can be integrated into neural networks to improve performance.
  • 🌀 The Mandelbrot set demonstrates the infinite complexity that can be captured even within low-dimensional functions, challenging neural networks to approximate it accurately.
  • 📈 Normalizing data and reducing the learning rate are practical techniques for optimizing neural network training and improving approximation quality.
  • 🚀 Despite the theoretical ability of neural networks to learn any function, practical limitations and the curse of dimensionality can affect their effectiveness in higher dimensions.

Q & A

  • Why are functions important in describing the world?

    -Functions are important because they describe the world by representing relationships between numbers. Everything can be fundamentally described with numbers and the relationships between them, which we call functions. This allows us to understand, model, and predict the world around us.

  • What is the goal of artificial intelligence in the context of function approximation?

    -The goal of artificial intelligence in function approximation is to create programs that can understand, model, and predict the world, or even have them write themselves. This involves building their own functions that can fit data points and accurately predict outputs for inputs not in the data set.

  • How does a neural network function as a universal function approximator?

    -A neural network functions as a universal function approximator by adjusting its weights and biases through a training process to minimize error. It can fit any data set by bending its output to match the given inputs and outputs, effectively constructing any function.

  • What is the role of the activation function in a neural network?

    -The activation function in a neural network defines the mathematical shape of a neuron. It determines how the neuron responds to different inputs by introducing non-linearity into the network, allowing it to learn and represent complex functions.

  • Why is backpropagation a crucial algorithm for training neural networks?

    -Backpropagation is crucial for training neural networks because it efficiently computes the gradient of the loss function with respect to the weights, allowing the network to update its weights in a way that minimizes the loss, thus improving its predictions over time.

  • How does normalizing inputs improve the performance of a neural network?

    -Normalizing inputs improves the performance of a neural network by scaling the values to a range that is easier for the network to deal with, such as -1 to 1. This makes the optimization process more stable and efficient, as the inputs are smaller and centered at zero.

  • What is the curse of dimensionality and how does it affect function approximation?

    -The curse of dimensionality refers to the phenomenon where the volume of the input space increases so fast that the available data becomes sparse. This makes function approximation and machine learning tasks computationally impractical or impossible for higher-dimensional problems, as the number of computations needed grows exponentially with the dimensionality of the inputs.

  • How do Fourier features enhance the performance of a neural network in function approximation?

    -Fourier features enhance the performance of a neural network by providing additional mathematical building blocks in the form of sine and cosine terms. These terms allow the network to approximate functions more effectively, especially in low-dimensional problems, by capturing the wave-like nature of the data.

  • What is the difference between a Taylor series and a Fourier series in the context of function approximation?

    -In the context of function approximation, a Taylor series is an infinite sum of polynomial functions that approximate a function around a specific point, while a Fourier series is an infinite sum of sine and cosine functions that approximate a function within a given range of points. Both can be used to enhance neural networks by providing additional input features.

  • Why might using Fourier features in high-dimensional problems lead to overfitting?

    -Using Fourier features in high-dimensional problems might lead to overfitting because the network can become too tailored to the specific data points, capturing noise and irregularities rather than the underlying function. This happens when the network has too many parameters relative to the amount of data, leading to a poor generalization to new, unseen data.

Outlines

00:00

🧠 Understanding Neural Networks as Function Approximators

The paragraph introduces the concept of neural networks as universal function approximators, emphasizing their ability to learn and model complex relationships in data. It explains that functions are fundamental to describing the world, and neural networks aim to approximate these functions based on given data points. The video's purpose is to explore neural networks learning in complex spaces, their limitations, and alternative machine learning methods. The speaker admits to being a programmer with a dislike for math but acknowledges its importance. The process of function approximation in neural networks is described, where the goal is to find a function that fits a given dataset and can predict outputs for new inputs. The concept of 'curve fitting' is introduced, and the video shows an actual neural network learning to fit a curve to data points.

05:00

📈 Deeper Dive into Neural Network Function Approximation

This paragraph delves deeper into how neural networks learn through the training process, using backpropagation to adjust weights and minimize error. It discusses higher-dimensional problems, such as image recognition, where inputs and outputs are vectors representing pixel values. The paragraph explains the use of activation functions like ReLU and leaky ReLU, and how they contribute to the network's learning process. The speaker also touches on the importance of normalizing inputs and using appropriate activation functions in the output layer to ensure the network's predictions fall within the desired range. The paragraph concludes with a demonstration of how neural networks can learn complex shapes, such as parametric surfaces, despite the challenges involved.

10:04

🌀 Tackling Complex Shapes with Neural Networks

The speaker attempts to use a neural network to approximate complex shapes, such as a spiral shell surface, and acknowledges the challenges in getting the network to accurately model such intricate forms. The paragraph also introduces the Mandelbrot set as an example of an infinitely complex fractal that is difficult for neural networks to approximate due to its detailed and intricate nature. Despite the network's efforts, it struggles to capture all the details of the Mandelbrot set, highlighting the limitations of neural networks in approximating certain types of functions. The speaker then explores alternative mathematical tools like the Taylor series for function approximation, which is likened to a single-layer neural network.

15:05

🔍 Exploring Fourier Series for Function Approximation

The paragraph discusses the use of the Fourier series as a tool for function approximation, contrasting it with the Taylor series. The Fourier series is shown to be effective for approximating functions within a given range, particularly when dealing with periodic data. The speaker demonstrates how computing additional Fourier features and feeding them into a neural network can significantly improve the network's performance in learning and approximating functions. The success of this method is illustrated through its application to image data, where the Fourier-enhanced network outperforms a standard neural network. The paragraph also addresses the 'curse of dimensionality,' noting that while neural networks handle high-dimensional data well, other methods like the Fourier series can become computationally impractical.

20:07

🌐 Applying Fourier Features to Real-World Data

The speaker applies the concept of Fourier features to real-world data, specifically the MNIST dataset of handwritten digits. Despite the initial success of Fourier features in lower-dimensional problems, the paragraph reveals that their effectiveness diminishes with higher-dimensional data, such as full images. The use of Fourier features leads to overfitting, where the network performs well on training data but fails to generalize to new data. The speaker concludes by emphasizing that no single method is universally best for all tasks and that the exploration of function approximation methods, even in low-dimensional toy problems, can provide valuable insights for more complex, real-world applications.

25:08

🚀 Concluding Thoughts and a Challenge

In the final paragraph, the speaker wraps up the discussion by reiterating the importance of function approximation in machine learning and the versatility of neural networks. They issue a challenge to the viewers to improve upon the Mandelbrot set approximation, emphasizing that there is always room for discovery and innovation. The speaker expresses optimism that the exploration of these methods could lead to better solutions in the field of machine learning, encouraging a collaborative and curious approach to problem-solving.

Mindmap

Keywords

💡Neural Networks

Neural networks are a series of algorithms modeled loosely after the human brain. They are designed to recognize patterns. In the context of the video, neural networks are described as 'Universal function approximators,' capable of learning and modeling complex relationships between inputs and outputs. The video uses neural networks to demonstrate how they can be trained to approximate various functions and patterns, such as recognizing handwritten digits from the MNIST dataset.

💡Universal Function Approximators

This term refers to the ability of neural networks to model any function given enough data and complexity. The video emphasizes the importance of this concept by stating that functions describe the world, and thus, the capacity to approximate functions is crucial for understanding and predicting real-world phenomena. The video provides examples of how neural networks can learn to approximate complex shapes and patterns, showcasing their universal approximating capabilities.

💡Backpropagation

Backpropagation is a method used to calculate the gradient of the loss function with respect to all the weights in the network, which is essential for training neural networks. The video mentions backpropagation as the algorithm that helps in adjusting the weights of the network to minimize error and improve predictions over time. It's a fundamental concept in neural network training, although the video does not delve into the specifics of the algorithm.

💡Activation Functions

Activation functions are mathematical functions used to add non-linear properties to a neural network. They determine the output of a neuron and are crucial for the network to learn complex patterns. The video discusses different types of activation functions such as ReLU and leaky ReLU, explaining how they shape the mathematical form of neurons and thus the overall function that the neural network can approximate.

💡Loss Function

A loss function is a measure of how well the neural network is performing. It calculates the difference between the predicted output and the actual output. The video explains that the goal of training a neural network is to minimize this loss, which is achieved by adjusting the weights through backpropagation. The loss function guides the learning process by quantifying the network's error.

💡Feature Vectors

In the context of machine learning, a feature vector is a list of numbers that describe an object or an event. The video mentions that inputs and outputs of a neural network are sometimes referred to as features and predictions, respectively. These feature vectors are arrays of numbers that are used by the network to make predictions or decisions, such as recognizing patterns in images.

💡Fully Connected Feedforward Network

This is a type of neural network where each neuron in one layer is connected to every neuron in the next layer. The video discusses this architecture as it pertains to the learning process, where each neuron processes its inputs, and the resulting outputs are passed as inputs to the next layer. This structure allows the network to learn complex functions by combining the outputs of many neurons.

💡Curse of Dimensionality

The curse of dimensionality refers to the various problems that occur when analyzing and organizing data in high-dimensional spaces. The video touches on this concept when discussing the challenges of using certain mathematical tools like the Fourier series for high-dimensional data. It illustrates how the complexity of function approximation increases exponentially with the number of dimensions.

💡Fourier Series

The Fourier series is a mathematical tool used to represent a function as a sum of sine and cosine functions. In the video, the Fourier series is used to create additional 'Fourier features' that can be fed into a neural network to improve its ability to approximate certain types of functions, particularly those that can be represented as waves or combinations of waves.

💡Mandelbrot Set

The Mandelbrot set is a set of complex numbers that is famous for its fractal boundary. It is used in the video to illustrate the challenges of function approximation. The video attempts to approximate the Mandelbrot set using a neural network, highlighting the difficulty of capturing its infinite complexity even with a low-dimensional input.

Highlights

Neural networks are universal function approximators, capable of learning almost anything.

Functions are crucial as they describe the world, including sound and light.

The goal of AI is to create programs that can understand, model, and predict the world.

Neural networks build their own functions through function approximation.

Neural networks learn by adjusting weights to minimize error between predicted and true outputs.

The architecture used in the video is a fully connected feed-forward network.

Neurons in a network learn individual features of the overall function.

The training process involves backpropagation, which is not explained in detail in the video.

Higher-dimensional problems, such as image recognition, are approached by treating each pixel as an output.

Normalization and activation functions like leaky ReLU can improve neural network performance.

The video explores the limitations of neural networks when approximating complex shapes like spiral surfaces.

The Mandelbrot set is used to demonstrate the challenge of approximating infinitely complex fractals.

Alternative function approximation methods like Taylor and Fourier series are discussed.

Fourier features, derived from the Fourier series, can significantly improve neural network performance on certain tasks.

The curse of dimensionality affects many function approximation methods, but neural networks handle it well.

The video concludes with a challenge to approximate the Mandelbrot set using a universal function approximator.

Transcripts

play00:00

you are currently watching a neural

play00:02

network learn

play00:03

about a year ago I made a video about

play00:05

how neural networks can learn almost

play00:07

anything and this is because they are

play00:09

Universal function approximators why is

play00:11

that so important well you might as well

play00:14

ask why functions are important they are

play00:16

important because functions

play00:20

describe

play00:22

the world

play00:24

everything is described by functions

play00:28

that's right functions describe the

play00:31

sound of my voice on your eardrum

play00:33

function the light that's kind of

play00:35

hitting your eyeballs right now function

play00:38

different classes and Mathematics

play00:40

different areas in Mathematics Study

play00:42

different kinds of function high school

play00:43

math studies second degree one variable

play00:45

polynomials calculus studies smooth one

play00:48

variable functions and it goes on and on

play00:50

functions describe the world

play00:56

yes correct thanks Thomas he gets a

play00:59

little excited but he's right the world

play01:01

can fundamentally be described with

play01:03

numbers and relationships between

play01:05

numbers we call those relationships

play01:07

functions and with functions we can

play01:10

understand model and predict the world

play01:12

around us

play01:14

the goal of artificial intelligence is

play01:16

to write programs that can also

play01:18

understand model and predict the world

play01:20

or rather have them write themselves so

play01:23

they must be able to build their own

play01:25

functions that is the point of function

play01:27

approximation and that is what neural

play01:29

networks do they are function building

play01:31

machines in this video I want to expand

play01:34

on the ideas of my previous video by

play01:36

watching actual neural networks learn

play01:38

strange shapes in strange spaces here we

play01:41

will encounter some very difficult

play01:43

challenges discover the limitations of

play01:45

neural networks and explore other

play01:47

methods for machine learning and

play01:48

Mathematics to approach this open

play01:50

problem

play01:51

now I am a programmer not a

play01:53

mathematician and to be honest I kind of

play01:56

hate math I've always found it difficult

play01:59

and intimidating but that's a bad

play02:01

attitude because math is unavoidably

play02:03

useful and occasionally beautiful I'll

play02:05

do my best to keep things simple and

play02:07

accurate for an audience like me but

play02:09

know that I'm gonna have to brush over a

play02:11

lot of things and I'm gonna be pretty

play02:12

informal

play02:13

I recommend you watch my previous video

play02:15

but to summarize functions are input

play02:18

output machines they take an input set

play02:20

of numbers and output a corresponding

play02:22

set of numbers and the function defines

play02:25

the relationship between those numbers

play02:27

the particular problem that neural

play02:29

network solve is when we don't know the

play02:31

definition of the function that we're

play02:33

trying to approximate instead we have a

play02:36

sample of data points from that function

play02:38

inputs and outputs this is our data set

play02:41

we must approximate a function that fits

play02:43

these data points and allows us to

play02:45

accurately predict outputs given inputs

play02:48

that are not in our data set this

play02:50

process is also called Curve fitting and

play02:52

you can see why

play02:53

now this is not some handcrafted

play02:56

animation it is an actual neural network

play02:58

attempting to fit the curve to the data

play03:00

and it does so by sort of bending the

play03:03

line into shape this process is

play03:05

generalizable such that it can fit the

play03:07

curve to any data set and thus construct

play03:09

any function this makes it a universal

play03:12

function approximator

play03:17

the network itself is also a function

play03:19

and should approximate some unknown

play03:20

Target function the particular neural

play03:23

architecture we're dealing with in this

play03:25

video is called a fully connected feed

play03:27

forward Network its inputs and outputs

play03:29

are sometimes called features and

play03:31

predictions and they take the form of

play03:33

vectors arrays of numbers

play03:35

the overall function is made up of lots

play03:37

of simple functions called neurons that

play03:40

take many inputs but only produce one

play03:42

output each input is multiplied by its

play03:45

own weight and added up along with one

play03:47

extra weight called a bias

play03:50

let's rewrite this weighted sum with

play03:52

some linear algebra we can put our

play03:54

inputs into a vector with an extra one

play03:56

for the bias and our weights into

play03:58

another vector and then take what is

play04:00

called the dot product let's just make

play04:02

up some example values

play04:04

to take the dot product we multiply each

play04:06

input by each weight and then add them

play04:08

all up

play04:09

finally this dot product is then passed

play04:12

to a very simple activation function in

play04:14

this case a relu which here returns zero

play04:17

we could use a different activation

play04:19

function but Aurelio looks like this the

play04:21

activation function defines the neuron's

play04:23

mathematical shape while the weights

play04:25

shift and squeeze and stretch that shape

play04:28

we feed the original inputs of our

play04:30

Network to a layer of neurons each with

play04:32

our own learned weights and each with

play04:34

our own output value we stack these

play04:37

outputs together into a vector and then

play04:39

feed the output Vector as inputs to the

play04:41

next layer and the next and the next

play04:43

until we get the final output of the

play04:45

network

play04:46

each neuron is responsible for learning

play04:49

its own little piece or feature of the

play04:51

overall function and by combining many

play04:53

neurons we can build an Ever more

play04:55

intricate function with an infinite

play04:58

number of neurons we can provably build

play05:00

any function

play05:02

the values of the weights or parameters

play05:04

are discovered through the training

play05:06

process we give the network inputs from

play05:08

our data set and ask it to predict the

play05:10

correct outputs over and over and over

play05:12

the goal is to minimize the Network's

play05:15

error or loss which is some measurement

play05:17

of difference between the predicted

play05:19

outputs and the true outputs

play05:22

over time the network should do better

play05:24

and better as loss goes down the

play05:26

algorithm for this is called back

play05:27

propagation and I am again not going to

play05:30

explain it in this video I'll make a

play05:31

video on it eventually I promise it's a

play05:33

pretty magical algorithm

play05:36

however this is a baby problem what

play05:38

about functions with more than just one

play05:40

input or output that is to say higher

play05:43

dimensional problems

play05:45

the dimensionality of a vector is

play05:47

defined by the number of numbers in that

play05:50

Vector for a higher dimensional problem

play05:52

let's try to learn an image the input

play05:55

Vector is the row column coordinates of

play05:58

a pixel and the output Vector is the

play06:00

value of the pixel itself in mathspeak

play06:02

we would say that this function maps

play06:04

from R2 to R1 our data set is all of the

play06:08

pixels in an image let's use this

play06:09

unhappy man as an example a pixel value

play06:12

of 0 is black and one is white although

play06:15

I'm going to use different color schemes

play06:16

because it's pretty

play06:18

as we train we take snapshots of the

play06:20

Learned function as the approximation

play06:22

improves that's what you're seeing now

play06:24

and that's what you saw at the beginning

play06:26

of this video

play06:27

but to clarify this image is not a

play06:30

single output from the network rather

play06:32

every individual pixel is a single

play06:34

output we are looking at the entire

play06:37

function all at once and we can do this

play06:39

because it is very low dimensional

play06:42

you'll also notice that the learning

play06:43

seems to slow down it's not changing as

play06:46

abruptly as it was at the beginning this

play06:48

is because we periodically reduce the

play06:50

learning rate a parameter that controls

play06:52

how much our training algorithm Alters

play06:55

the current function this allows it to

play06:57

progressively refine details

play07:00

now just because our neural network

play07:02

should theoretically be able to learn

play07:04

any function there are things we can do

play07:06

to practically improve the approximation

play07:08

and optimize the learning process for

play07:11

instance one thing I'm doing here is

play07:13

normalizing the row column inputs which

play07:15

means I'm moving the values from a range

play07:17

of 0 1400 to the range of negative one

play07:20

one I do this with a simple linear

play07:22

transformation that shifts and scales

play07:24

the values the negative 1 1 range is

play07:27

easier for the network to deal with

play07:29

because it's smaller and centered at

play07:30

zero

play07:32

another trick is that I'm not using a

play07:34

relu as my activation function but

play07:36

rather something called a leaky relu a

play07:38

leaky value can output negative values

play07:40

while still being non-linear and has

play07:42

been shown to generally improve

play07:43

performance so I'm using a leaky value

play07:46

in all of my layers except for the last

play07:48

one because the final output is a pixel

play07:51

value it needs to be between 0 and 1. to

play07:54

enforce this in the final layer we can

play07:56

use a sigmoid activation function which

play07:59

squishes its inputs between 0 and 1.

play08:02

except there is a different squishing

play08:05

function called tan H that squishes its

play08:07

inputs between negative one and one I

play08:10

can then normalize those outputs into

play08:11

the final range of 0 1. why go through

play08:14

the trouble well tan H just tends to

play08:17

work better than sigmoid

play08:19

intuitively this is because tanh is

play08:21

centered at zero and plays much nicer

play08:23

with back propagation but ultimately the

play08:25

reasoning doesn't matter as much as the

play08:27

results both networks here are

play08:29

theoretically Universal function

play08:31

approximators but practically one works

play08:34

much better than the other this can be

play08:36

measured empirically by calculating and

play08:38

comparing the error rates of both

play08:39

Networks I think of this as the science

play08:42

of math where we must test our ideas and

play08:44

validate them with evidence rather than

play08:46

providing formal proofs it'd be great if

play08:49

we could do both but that is not always

play08:51

possible and it is often much easier to

play08:53

just try and see what happens and that's

play08:55

my kind of math

play08:57

let's make it harder here we have a

play08:59

function that takes two inputs UV and

play09:02

produces three outputs x y z it's a

play09:05

parametric surface function and we'll

play09:07

use the equation for a sphere we can

play09:09

learn it the same way as before take a

play09:11

random sample of points across the

play09:13

surface of the sphere and ask our

play09:15

Network to approximate it now this is

play09:17

clearly a very silly way to make a

play09:18

sphere but the network is trying its

play09:20

best to sort of wrap the surface around

play09:22

the sphere to fit the data points

play09:25

I hope this also gives you a better view

play09:26

of what a parametric surface is it takes

play09:29

a flat 2D sheet and contorts it in 3D

play09:32

space According to some function

play09:36

now this does okay though it never quite

play09:38

closes up around the poles

play09:40

for a real challenge let's try this

play09:42

beautiful spiral shell surface I got the

play09:45

equation for this from this wonderful

play09:46

little website that lets you play with

play09:48

all kinds of shell surfaces see what I

play09:50

mean when I say that functions describe

play09:52

the world

play09:53

anyway let's sample some points across

play09:55

the spiral surface and start learning

play09:57

[Music]

play09:59

[Laughter]

play10:04

[Music]

play10:07

well it's working but clearly we're

play10:10

having some trouble here I'm using a

play10:12

fairly big neural network but this is a

play10:14

complicated shape and it seems to be

play10:16

getting a little bit confused we'll come

play10:18

back to this one

play10:21

we can also make the problem harder not

play10:23

by increasing dimensionality but by

play10:25

increasing the complexity of the

play10:27

function itself

play10:28

let's use the mandelbrot set an

play10:31

infinitely complex fractal

play10:33

we can simply Define a mandelbrot

play10:35

function as taking two real valued

play10:37

inputs and producing one output the same

play10:39

dimensionality as the images we learned

play10:41

earlier

play10:42

I have to find my mandelbrot function to

play10:44

Output a value between 0 and 1 where 1

play10:47

is in the mandelbrot set and anything

play10:49

less than one is not under the hood it's

play10:52

iteratively operating on complex numbers

play10:54

and I added some stuff to Output smooth

play10:56

values between 0 and 1 but I'm not going

play10:59

to explain it much more than that after

play11:01

all a neural network doesn't know the

play11:02

function definition either and it

play11:04

shouldn't matter it should be able to

play11:06

approximate it all the same

play11:07

the data set here is randomized points

play11:10

drawn uniformly from this range now this

play11:13

has actually been a pet project of mine

play11:15

for some time and I've made several

play11:16

videos trying this exact experiment over

play11:18

the years

play11:19

I hope you can see why it's interesting

play11:21

despite being so low dimensional the

play11:23

mandelbrot function is infinitely

play11:25

complex literally made with complex

play11:27

numbers and is uniquely difficult to

play11:29

approximate you can just keep fitting

play11:32

and fitting and fitting the function and

play11:34

you will always come up short

play11:36

now you could do this with any fractal I

play11:38

just use the mandelbrot set because it's

play11:39

so well known

play11:42

so after training for a while we've made

play11:44

some progress but clearly we're still

play11:46

missing an infinite amount of detail

play11:48

I've gotten this to look better in the

play11:50

past but I'm not going to waste any more

play11:51

time training this network there are

play11:53

better ways of doing this

play11:57

are there different methods for

play11:58

approximating functions besides neural

play12:00

networks yes many actually there are

play12:03

always many ways to solve the same

play12:05

problem though some ways are better than

play12:07

others another mathematical tool we can

play12:10

use is called the Taylor series

play12:12

this is an infinite sum of a sequence of

play12:15

polynomial functions X Plus x squared

play12:18

plus X cubed plus x to the fourth up to

play12:20

x to the n n is the order of the series

play12:24

each of these terms are multiplied by

play12:27

their own value called a coefficient

play12:29

each coefficient controls how much that

play12:32

individual term affects the overall

play12:34

function

play12:36

given some Target function by choosing

play12:38

the right coefficients we can

play12:40

approximate that Target function around

play12:41

a specific point in this case Zero the

play12:45

approximation gets better the more terms

play12:47

we add where an infinite sum of terms is

play12:50

exactly equivalent to the Target

play12:51

function

play12:53

if we know the target function we can

play12:55

actually derive the exact coefficients

play12:58

using a general formula to calculate

play13:00

each coefficient for each term but of

play13:03

course in our particular problem we

play13:05

don't know the function we only have a

play13:07

sample of data points so how do we find

play13:09

the coefficients

play13:10

well do you see anything familiar in

play13:13

this weighted sum of terms we can put

play13:15

all of the X to the N terms into an

play13:18

inputs vector and put all of the

play13:20

coefficients into a weights vector and

play13:22

then take the dot product a weighted sum

play13:24

the Taylor series is effectively a

play13:28

single layer neural network but one

play13:30

where we compute a bunch of additional

play13:32

inputs x squared x cubed and so on we'll

play13:35

call these additional inputs Taylor

play13:37

features we can then learn the

play13:39

coefficients or weights with back

play13:41

propagation of course we can only

play13:44

compute a finite number of these the

play13:46

partial Taylor series up to some order

play13:48

but the higher the order the better it

play13:50

should do let's use this simple Taylor

play13:52

Network to learn this function using

play13:55

eight orders of the Taylor series here's

play13:57

our data set and here's the

play13:58

approximation

play14:00

[Music]

play14:04

that's not great polynomials are pretty

play14:07

touchy as their values can explode very

play14:10

quickly so I think back propagation has

play14:12

a tough time finding the right

play14:13

coefficients but we can do better rather

play14:16

than using a single layer Network let's

play14:18

just give these Taylor features to a

play14:20

full multi-layered Network let's give it

play14:23

a shot

play14:25

[Music]

play14:28

foreign

play14:35

it's a bit wonky but this performs much

play14:37

better this trick of computing

play14:40

additional features to feed to the

play14:41

network is a well-known and commonly

play14:44

used one intuitively it's like giving

play14:46

the network different kinds of

play14:47

mathematical building blocks to build a

play14:49

more diverse complex function

play14:53

let's try this on an image data set

play15:01

[Music]

play15:05

well that's pretty good it's learning

play15:07

but it doesn't seem to work any better

play15:09

than just using a good old-fashioned

play15:10

neural network the Taylor series is made

play15:13

to approximate a function around a

play15:15

single given point while we want to

play15:17

approximate within a given range of

play15:19

points a better tool for this is the

play15:22

Fourier series

play15:24

the Fourier series acts very much like

play15:27

the Taylor series but is an infinite sum

play15:29

of Sines and cosines each order n of the

play15:33

series is made up of sine N X plus

play15:35

cosine n x

play15:37

each sine and cosine is multiplied by

play15:40

its own coefficient again controlling

play15:42

how much that term affects the overall

play15:44

function

play15:45

n these inner multiplier values control

play15:48

the frequency of each wave function the

play15:50

higher the frequency the more Hills the

play15:53

curve has

play15:54

by combining weighted waves of different

play15:57

frequencies we can approximate a

play15:59

function within the range of 2 pi one

play16:01

full period

play16:03

again if we know the function we can

play16:05

compute the weights and even if we don't

play16:07

we could use something called the

play16:09

Discrete Fourier transform which is

play16:11

really cool but we're not dealing with

play16:13

it in this video

play16:14

I hope you see where I'm going with this

play16:16

let's just jump ahead and do what we did

play16:18

before compute a bunch of terms of the

play16:20

Fourier series and feed them to a

play16:23

multi-layer network as additional inputs

play16:25

Fourier features

play16:28

note that we have twice as many Fourier

play16:29

features as Taylor features since we

play16:32

have a sine and cosine

play16:34

let's try it on this data set

play16:38

this works pretty well it's a little

play16:40

wavy but not too shabby note that for

play16:43

this to work we need to normalize our

play16:45

inputs between negative pi and positive

play16:47

Pi one full period

play16:49

let's try this on an image

play16:51

that looks strange at first almost like

play16:53

static coming into Focus but it works

play16:56

and it works really well

play16:58

if we compare it to networks of the same

play17:00

size trained for the same amount of time

play17:02

we can see the Fourier Network learns

play17:04

much better and faster than the network

play17:06

without Fourier features or the one with

play17:08

Taylor features just look at the level

play17:10

of detail in those curly locks you can

play17:13

hardly tell the difference from the real

play17:14

image

play17:17

now I've glossed over a very important

play17:20

detail the example Fourier series I gave

play17:23

had one input this function has two

play17:26

inputs to handle this properly we have

play17:29

to use the two-dimensional Fourier

play17:31

series one that takes an input of X and

play17:34

Y what we do with that extra y

play17:38

here are the terms for the 2D Fourier

play17:40

series up to two orders we are now

play17:43

multiplying the X and Y terms together

play17:45

and end up with sine x cosine y sine X

play17:49

sine y cosine x cosine Y and cosine X

play17:52

sine y every combination of sine and

play17:55

cosine and Y and X

play17:57

not only that we also have every

play18:00

combination of frequencies that inner

play18:02

multiplier so sine 2x times cosine 1y

play18:06

and so on and so forth here's up to

play18:08

three orders now four

play18:11

that is a lot of terms we have to

play18:14

calculate this many terms per order and

play18:17

this number grows very quickly as we

play18:19

increase the order much faster than it

play18:21

would for the 1D series and this is just

play18:24

for a baby 2D input for a 3D 4D 5D input

play18:28

forget it the number of computations

play18:30

needed for higher dimensional Fourier

play18:32

series explodes as we increase the

play18:35

dimensionality of our inputs we have

play18:38

encountered the curse of dimensionality

play18:40

lots of methods of function

play18:42

approximation and machine learning

play18:44

breakdown as dimensionality grows these

play18:48

methods might work well on low

play18:49

dimensional problems but they become

play18:51

computationally impractical or

play18:53

impossible for higher dimensional

play18:55

problems

play18:56

neural networks by contrast handle the

play18:58

dimensionality problem very well

play19:00

comparatively it is Trivial to add

play19:03

additional dimensions

play19:05

but we don't need to use the 2D Fourier

play19:07

series we can just treat each input as

play19:10

its own independent variable and compute

play19:12

1D Fourier features for each input this

play19:16

is less theoretically sound but much

play19:18

more practical to compute it's still a

play19:20

lot of additional features but it's

play19:21

manageable and it's worth it it

play19:23

drastically improves performance that's

play19:26

what I've been using to get these image

play19:27

approximations

play19:29

it really shouldn't be surprising that

play19:31

Fourier features help so much here since

play19:33

the Fourier series and transform is used

play19:35

to compress images it's how the jpeg

play19:37

compression algorithm Works turns out

play19:40

lots of things can be represented as

play19:42

combinations of waves

play19:43

so let's apply it to our mandelbrot data

play19:46

set again it looks a little weird but it

play19:49

is definitely capturing more detail than

play19:51

the previous attempt

play19:54

well that's fun to watch but let's

play19:56

evaluate

play19:57

for comparison here is the real

play19:59

mandelbrot set

play20:01

actually no this is not the real

play20:04

mandobrot set it is an approximation

play20:06

from our Fourier Network

play20:09

now you might be able to tell if you're

play20:11

on a 4k monitor especially when I zoom

play20:13

in

play20:13

this network was given 256 orders of the

play20:17

Fourier series which means 1024 extra

play20:20

Fourier features being fed to the

play20:22

network and the network itself is pretty

play20:23

damn big

play20:25

when we really zoom in it becomes very

play20:28

obvious that this is not the real deal

play20:29

it is still missing an infinite amount

play20:32

of detail

play20:35

[Music]

play20:40

nonetheless I am blown away by the

play20:42

quality of the Fourier Network's

play20:44

approximation Fourier features are of

play20:46

course not my idea they come from this

play20:48

paper that was suggested by a Reddit

play20:50

commenter who I think actually may have

play20:52

been a co-author I'm still missing

play20:54

details from this adding Fourier

play20:56

features was one of if not the most

play20:58

effective improvements to the

play21:00

approximation I've applied and it was

play21:02

really surprising

play21:03

to return to the tricky spiral shell

play21:06

surface we can see that our Fourier

play21:08

network does way better than our

play21:09

previous attempt although the target

play21:11

function is literally defined with Sines

play21:13

and cosines so of course it will do well

play21:18

so if Fourier features help so much why

play21:21

don't we use them more often they hardly

play21:23

ever show up in real world neural

play21:25

networks to State the obvious all of the

play21:28

approximations in this video so far are

play21:30

completely useless we know the functions

play21:32

and the images we don't need a massive

play21:34

neural network to approximate them

play21:37

but I hope that you can see that we're

play21:39

not studying the functions we're

play21:40

studying the methods of approximation

play21:43

because these toy problems are so low

play21:45

dimensional we can visualize them and

play21:48

hopefully gain insights that will carry

play21:49

over into higher dimensional problems so

play21:52

let's bring it back to Earth with a real

play21:54

problem that uses real data

play21:57

this is the mnist data set images of

play22:01

hand-drawn numbers and their labels

play22:03

our input is an entire image flattened

play22:06

out into a vector and our output is a

play22:08

vector of 10 values representing a label

play22:10

as to which number 0 through 9 is in the

play22:13

image

play22:14

there is some unknown function that

play22:17

describes the relationship between an

play22:18

image and its label and that's what

play22:21

we're trying to discover

play22:22

even for tiny 28 by 28 black and white

play22:25

images that is a 784 dimensional input

play22:29

that is a lot and this is still a very

play22:31

simple problem for real world problems

play22:33

we must address the curse of

play22:35

dimensionality our method must be able

play22:37

to handle huge dimensional inputs and

play22:40

outputs we also can't visualize the

play22:42

entire approximation all at once as

play22:44

before any idea what a 700 dimensional

play22:47

space looks like

play22:49

but a normal neural network can handle

play22:51

this problem just fine it's pretty

play22:52

trivial we can evaluate it by measuring

play22:55

the accuracy of its predictions on

play22:57

images from the data set that it did not

play22:59

see during training we'll call this

play23:01

evaluation accuracy and a small network

play23:03

does pretty well what if we use Fourier

play23:06

features on this problem say up to eight

play23:08

orders

play23:09

well it does do a little better but

play23:11

we're adding a lot of additional

play23:13

features for only eight orders we're

play23:15

Computing a total of 13

play23:18

328 input features which is a lot more

play23:21

than 784 and it's only two percent more

play23:24

accurate when we use 32 orders of the

play23:27

Fourier series it actually seems to harm

play23:29

performance up to 64 orders and its

play23:32

downright ruinous

play23:33

this may be due to something called

play23:35

overfitting where our approximation

play23:37

learns the data really well too well but

play23:40

fails to learn the underlying function

play23:42

usually this is a product of not having

play23:44

enough data but our Fourier Network

play23:46

seems to be especially prone to this

play23:49

this seems consistent with the

play23:50

conclusions of the paper I mentioned

play23:52

earlier and ultimately our Fourier

play23:54

Network seems to be very good for low

play23:56

dimensional problems but not very good

play23:58

for high dimensional problems no single

play24:01

architecture model or method is the best

play24:03

fit for all tasks indeed there are all

play24:05

kinds of problems that require different

play24:07

approaches than the ones discussed here

play24:10

now I'd be surprised if the Fourier

play24:12

series didn't have more to teach us

play24:13

about machine learning but this is where

play24:15

I'll leave it I hope this video has

play24:17

helped you appreciate what function

play24:18

approximation is and why it's useful and

play24:21

maybe sparked your imagination with some

play24:22

alternative perspectives neural networks

play24:25

are a kind of mathematical clay that can

play24:27

be molded into arbitrary shapes for

play24:29

arbitrary purposes

play24:32

I want to finish by opening up the

play24:34

mandelbrot approximation problem as a

play24:36

fun challenge for anyone who's

play24:38

interested how precisely and deeply can

play24:40

you approximate the mandobrot set given

play24:43

only a random sample of points there are

play24:46

probably a million things that could be

play24:47

done to improve on my approximation and

play24:49

the internet is much smarter than I am

play24:52

the Only Rule is that your solution must

play24:54

still be a universal function

play24:56

approximator meaning it could still

play24:58

learn any other data set of any

play25:00

dimensionality

play25:01

now this is just for fun but potentially

play25:03

solutions to this toy problem could have

play25:05

uses in the real world there is no

play25:08

reason to think that we found the best

play25:09

way of doing this and there may be far

play25:11

better Solutions waiting to be

play25:13

discovered

play25:15

thanks for watching

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
Neural NetworksFunction ApproximationMachine LearningUniversal ApproximatorsData ScienceMathematicsAI LearningMandelbrot SetFourier FeaturesModel Optimization
Besoin d'un résumé en anglais ?