ESRGAN Paper Walkthrough
Summary
TLDRThis video explores ESRGAN, an enhanced version of SRGAN for single image super-resolution. It addresses SRGAN's issues with artifacts, primarily linked to batch normalization, by refining network architecture, adversarial, and perceptual losses. Key updates include the RRDB residual in residual dense block without batch normalization, a relativistic GAN loss for relative realness prediction, and a modified VGG perceptual loss applied before ReLU activation. The video also critiques discrepancies between the paper and its source code, suggesting improvements for clarity.
Takeaways
- 📜 The video discusses ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks), an improvement over the SRGAN model for single image super-resolution.
- 🔍 ESRGAN addresses issues like unpleasant artifacts in SRGAN, which were linked to batch normalization.
- 🏗️ The architecture of ESRGAN includes a Residual in Residual Dense Block (RRDB) without batch normalization.
- 🛠️ A key contribution is the use of a relativistic GAN loss function, which predicts relative realness instead of an absolute value.
- 🎨 The perceptual loss function was modified to operate before the ReLU activation instead of after, enhancing texture quality.
- 📈 Network interpolation is used to reduce noise and improve perceptual quality in the generated images.
- 📊 The training process involves two stages: first, training with an L1 loss for PSNR, and then incorporating the GAN loss and perceptual loss.
- 🔢 The paper suggests that smaller initialization weights and a beta residual scaling parameter can improve the training stability and output quality.
- 🔧 The video script points out discrepancies between the paper's descriptions and the actual source code, indicating potential areas of confusion.
- 🌐 The training data for ESRGAN includes the DIV2K and Flickr2K datasets, with data augmentation techniques like horizontal flipping and random rotations applied.
Q & A
What does ESRGAN stand for?
-ESRGAN stands for Enhanced Super Resolution Generative Adversarial Networks.
What is the primary issue addressed by ESRGAN over SRGAN?
-ESRGAN addresses the issue of unpleasant artifacts in SRGAN, which were associated with the use of batch normalization.
What are the three key components of SRGAN that ESRGAN studies and improves?
-The three key components are the network architecture, the adversarial loss, and the perceptual loss.
What is the RRDB block used in ESRGAN's architecture?
-The RRDB block stands for Residual in Residual Dense Block, which is used in the network architecture of ESRGAN instead of the background layers and the original basic block with batch normalization.
How does the relativistic GAN in ESRGAN differ from the standard GAN?
-In ESRGAN, the relativistic GAN allows the discriminator to predict relative realness instead of an absolute value, which is a change from the standard GAN approach.
What is the difference between the perceptual loss used in SRGAN and ESRGAN?
-In ESRGAN, the perceptual loss is applied before the ReLU activation (before the non-linearity), whereas in SRGAN, it is applied after the ReLU activation.
What is the role of the beta residual scaling parameter in ESRGAN?
-The beta residual scaling parameter is used in the residual connections of the RRDB block, where the output is scaled by beta (0.2 in their setting) before being added to the original input, aiming to correct improper initialization and avoid magnifying input signal magnitudes.
Why does ESRGAN use network interpolation during training?
-Network interpolation is used in ESRGAN to remove unpleasant noise while maintaining good perceptual quality, achieved by interpolating between a model trained on L1 loss and a model trained with GAN and perceptual loss.
How does ESRGAN handle the issue of artifacts during training?
-ESRGAN handles artifacts by removing batch normalization and using network interpolation, which is claimed to produce results without introducing artifacts.
What are the training details mentioned for ESRGAN that differ from SRGAN?
-ESRGAN uses a larger patch size of 96x96, trains on the DIV2K and Flickr2K datasets, employs horizontal flip and 90-degree random rotations for data augmentation, and divides the training process into two stages similar to SRGAN.
What is the significance of the smaller initialization mentioned in the ESRGAN paper?
-Smaller initialization is used in ESRGAN, where the original initialization weights are multiplied by a scale of 0.1, which is found to work well in their experiments.
Outlines
😲 Introduction to ESRGAN
This paragraph introduces the concept of Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN), which is an advancement over the previous SGAN model. The speaker recommends watching a previous video on SGAN before this one. ESRGAN aims to improve upon SGAN by addressing issues like unpleasant artifacts associated with batch normalization and enhancing image quality. The focus is on three key components: network architecture, adversarial loss, and perceptual loss. The architecture uses a Residual in Residual Dense Block (RRDB) without batch normalization. The loss function is changed to a relativistic GAN loss, and the perceptual loss is modified to be applied before the ReLU activation instead of after.
🔍 Deep Dive into ESRGAN Architecture
The speaker delves into the architecture of ESRGAN, highlighting the use of RRDB blocks which replace the basic blocks used in SGAN. Each RRDB block contains three dense blocks, significantly increasing the network's size and complexity compared to SGAN. The paragraph discusses the use of skip connections and concatenation of channels, inspired by DenseNets, to enhance feature propagation. The speaker also mentions a beta residual scaling parameter used in the residual connections, which prioritizes the original input over the processed output. There's a critique of the paper for not clearly detailing some changes in the architecture that are evident in the source code, such as kernel sizes and padding, which are different from those used in SGAN.
🌟 Unique Features of ESRGAN
This section discusses unique features of ESRGAN, including the use of a real relativistic discriminator, which predicts relative realness instead of an absolute value. The speaker expresses uncertainty about the importance of this feature and suggests that other methods like WGAN-GP might yield similar results. The paragraph also covers the loss function used in ESRGAN, which includes a perceptual loss applied before ReLU activation, an L1 loss, and a relativistic GAN loss. The speaker points out inconsistencies in the paper regarding the constants used in the loss function and the need for clarity in the presentation of these details.
🔧 Network Interpolation and Training Details
The speaker talks about the network interpolation technique used in ESRGAN to reduce noise and artifacts. This method involves training the generator on an L1 loss and then interpolating with a GAN-trained model to achieve a balance between noise reduction and perceptual quality. The paragraph also covers the training process, which involves downsampling high-resolution images using MATLAB's bicubic kernel function, a choice that the speaker finds questionable. Details about the training datasets, patch sizes, and the two-stage training process are provided, with an emphasis on the benefits of a larger patch size for capturing more semantic information.
📚 Appendix and Final Thoughts
In the final paragraph, the speaker reviews the appendix of the ESRGAN paper, which discusses the artifacts associated with batch normalization and the use of residual learning with a scaling factor to correct initialization issues. The speaker also mentions that smaller initialization weights were found to work well in experiments. The paragraph wraps up with a teaser for the next video, where the speaker plans to implement ESRGAN and provide a more hands-on exploration of its features and performance. There's a call for viewer engagement, inviting thoughts and questions on the presented material.
Mindmap
Keywords
💡ESRGAN
💡Super-Resolution
💡Generative Adversarial Networks (GANs)
💡Residual in Residual Dense Block (RRDB)
💡Adversarial Loss
💡Perceptual Loss
💡Batch Normalization
💡Pixel Shuffle
💡Relativistic GAN
💡Network Interpolation
💡L1 Loss
Highlights
ESRGAN stands for Enhanced Super Resolution Generative Adversarial Networks, building on the previous work of SRGAN.
SRGAN was capable of generating realistic textures during single image super-resolution but had issues with unpleasant artifacts.
The artifacts were associated with batch normalization, prompting improvements in ESRGAN.
ESRGAN focuses on three key components: network architecture, adversarial loss, and perceptual loss.
The architecture uses a Residual in Residual Dense Block (RRDB) without batch normalization.
A relativistic GAN loss is introduced, allowing the discriminator to predict relative realness.
The VGG perceptual loss is modified to be applied before ReLU activation instead of after.
ESRGAN demonstrates better texture generation compared to SRGAN in example images.
The basic architecture of SRResNet is maintained but with significant changes in ESRGAN.
All background layers are removed, and the original basic block is replaced with the proposed RRDB block.
Each RRDB block contains three dense blocks, significantly increasing the network size compared to SRGAN.
Skip connections are used between different paths in the network, inspired by DenseNets.
A beta residual scaling parameter is introduced to balance the contribution of different paths.
The training process is divided into two stages: PSNR with L1 loss and then GAN training with additional loss terms.
Network interpolation is used to remove unpleasant noise while maintaining perceptual quality.
The paper suggests that smaller initialization weights work well in experiments.
The training uses the DIV2K and Flickr2K datasets, emphasizing the importance of rich textures for natural results.
The appendix discusses the removal of batch normalization to address artifacts and the use of residual learning.
Transcripts
in this video we will be taking a look
at esr gan
which builds on the previous
implementation and uh the previous
walkthrough of
srgan that i recommend that you check
out before
uh taking a look at this one but so
in this video we'll be taking a look at
the paper and then the next video will
implement
this in pytorch so esrgan
stands for enhanced super resolution
generative adversarial
networks and essentially you know
it builds on the you know the work of sr
gan which was able to
uh generate realistic textures during
single image
super resolution but there was a problem
which is that
there were some um unpleasant artifacts
and that was associated with what they
found to be with batch norm
and also they just did things that just
made the quality
better but so they study three key
key three key components of srgan which
is the network architecture
the adversarial laws and the perceptual
laws
and so they made some improvements to
all three and then they
have this enhanced sr again and so for
the architecture they use
residual in residual dense block rrdb
as what they call it and we'll see the
details of it but
it doesn't use back normalization and
they also
instead of for the loss they use
something called a really
relativistic gan uh which lets the
discriminant predict
relative realness instead of the
absolute value and then they also do
a very minor thing which is uh change
the vgg perceptual loss to be um
before activation so before the relu
instead of after
and i guess that that works better
and um yeah so i'm gonna go through the
important details of this but
here's perhaps one example we can see
that they've shown that
show that esr again has some better
texture than srgan
here they do look to be a bit better and
yeah so i'm gonna skip these
introductory parts because we want to
just see
sort of what they did so the proposed
method
that they did is that they uh they still
use the basic architecture of
sr resnet that they use in sr-gam
where you know the most computation is
done in the low resolution feature space
because we kind of do the um you know so
sort of at the end here
is when we actually do the upsample
right so
here we do a bunch of computation on
this low resolution image and then we
up sample it in the end now um
so this is actually not entirely true
that they mention here is that they
employ the basic architecture as a
resnet
because they they change sort of major
things
i would say in the implementation and
i'll i'll go through it
later on but um you know the paper
doesn't mention some key details that
they changed but it's visible in the
source code
right which i kind of felt was
i don't know it just didn't feel right
but so they should have definitely
mentioned those in the paper i feel like
but anyways we'll get to those but so
the network architecture
they remove all the background layers
and they also remove the original basic
block with the proposed
rrdb block and so what how it works is
that you know in the beginning we have
combat from relative combat form they
remove the background
and then they essentially one
of those residual blocks is now
one of these blocks all right
where one of those blocks contains
three dense blocks and one
dense block contains all of this stuff
so you know they it's going to be a
much much larger network in in like it's
in pretty much insanely
much larger than the sr gan because you
know so
this is one residual block and these
they use 23 of these ones
right so you can imagine 23 of these
where you sort of had run it through a
dense block and then you use a residual
connection here um to to this sort of
the main path
right and the main path is is this in
the center here
um but so uh that is what they do for
one
rrdb block and then one dense block they
do
it with a comrelu comrello comrello
comrello and then
conf and so what's kind of the key here
i guess
is that they also do these um these
skip connections between all of the
different paths
so in the beginning here they do escape
connection to
after the first sort of pair of comrelu
second
third and also the other one so sort of
an
all to all um in the path forward
where they do skip connections and also
here is um
i guess it comes from dense nets where
they instead of doing a
skip connection where they element wise
uh sum them
they do a um a
concatenation so all of these ones right
are a concatenation
of uh of the channels so yeah
i guess that will also become clear when
we go
actually implement the next video but
that's what they do
um and that's what they yeah the major
change to
the residual block so you can imagine
you know this is you know a lot lot
bigger because in the beginning they
so srgan had 16 of these rb blocks right
16 uh where we had i guess two comp
layers
now we have 23 of these where we have
three times i guess five so 15
right so 23 times 15 versus
16 times two you know that that's a big
difference in the in the um
number of combo layers that they
actually use
okay so um
i guess one other detail as well which
is kind of interesting which they
mentioned in appendix is that these
residual
connections here um those are actually
element-wise sort of standard skip
connections
but they do an interesting thing which
is that they use a um
a beta residual scaling parameter
so they they uh they don't just do sort
of um
i guess you know x plus uh residual
they actually do x plus the residual um
let's see so yeah so they do the one
that goes through the dense block
right um so perhaps you know this would
be
x then that goes in the main path they
times the residual the one that's gone
through this dense block
by a beta parameter beta which is equal
to 0.2
so that is interesting because that
means that
we're sort of taking the main the one
that was from the beginning before
running through this dense
block is what we kind of uh sort of
prioritize i guess
because we take one times that amount
and then 0.2
times the amount that has gone through
the dense block
and they sort of intuitively
mention that uh this is that the
the the one going through the desk block
will modify the initialization
uh to be i guess correct or whatever
but that is some just an interesting
part of it
um and yeah so they mentioned here that
when the statistics of training and
testing data sets differ a lot
bathroom layers tend to introduce in
unplugging artifacts
which is what i mean what i observed as
well when
training srgan uh is that there were
just some random artifacts
that just appeared randomly during
training and stuff and particularly when
you would have some
some odd looking image like there would
be some
some some image with a with a black
background for example
that could you know um introduce these
artifacts as well
so one thing that i felt i want didn't
want to miss is that
you know this is what the mentioned to
be the change in the architecture
but they actually did some different
things as well which they didn't they
weren't clear on
um that could you know impact the
performance
a bit i would say so i'm gonna show you
the code and we can just see some some
differences that they did there
so here is the uh the source code for
esr again that they have
in the uh in the paper and this is just
for the generator architecture
because the the code is kind of massive
to go through
but just looking at the generator all
right we can see that
um for example if we look at let's see
the all right so if we look at the the
entire network here
our db net which is the generator for
example we can see here that they use a
kernel size of one
straight of one padding of one which
they didn't do in srgan they used a
kernel size of nine
in the beginning and padding of four
and and also for the last one they
also here used a kernel size of three so
another big thing is that they used
pixel shuffle in srgan but in esrgan
they use an f dot interpolate um where
so they're doing
sort of a nearest neighbor up sampling
which is also
quite different right
so those are definitely things they
should mention in the paper in my
opinion
so that is one key part and
the other is the real relativistic
discriminator
and you know i haven't seen much about
this in
other papers so this is kind of the
first time that i've seen this actually
um and i'm not really sure if it's that
important
um i think you know
using vegan gp or something would
probably give
similar results and in fact their
implementation did
um did also have a vegan gp
which they tried with and i
i think they mentioned that they it was
just sort of um
it took longer time but it didn't give
significant improvement
but it didn't seem to do anything that
was worse
but yeah so the idea is that um and
again i'm gonna skip this a little bit
because
you know i'm not really too familiar
with this but uh the idea here is
anyways that you know this is the
standard again where we just do sigmoid
of the
output from the from the discriminator
and you know similarly for the the fake
ones
so the real one should be one fake one
should be zero
but here um for the relativistic can
they instead do
sigmoid of the output so the one that
you know
is over here but they do subtract with
the
expected value of the of the sort of
uh the fake images so we run through the
discriminator of the fake images
and then we take the torch.mean of that
so
we take the mean value of the fake
across the batch that we
currently have and we subtract that with
the
um with with sort of the um
the output of the real one from the
discriminator
and yeah i guess i don't want to go
into in more detail i don't feel that
this is
super important but that's one also key
part that they used
um and um so the standard discrimination
sram can be expressed as dx is sigmoid
of
cx where sigmoid is a sigmoid where
sigma
is yeah and c of x is the non-transform
discriminator
output um and yeah and then they
mentioned some part about the
the loss function that they use here and
then the perceptual loss
which is you know this is kind of funny
like the only difference here
is uh that they used it before
activation of the relu
and so um yeah so we developed a more
effective perceptual loss by
constraining on features before
activation rather than
after activations as practiced in srgan
so what this means kind of like
concretely is that
vgg you know in the implementation of
srgan we did
vgg dot features i think and then that
brought into a list and we did
up to 36 right the thing that's
different now is
that we have to do we need to change
that to a five
and so that is the difference in the
perceptual loss
and then uh the total loss um
and yeah so one big thing as well is
that they included the l1 loss during
training
which we kind of discussed for srgan
wasn't really clear if they used that
because it seemed like they replaced uh
the um
the the l2 loss in their case
with an avg
sort of a feature perceptual loss
but here so i guess they do two things
here
they first introduce this l1 loss
and then they also have it mean
sort of kept during training so when
they add the actual
perceptual loss so now they have three
loss terms
one for the perceptual one which is vgg
or yeah when it's run through vdg then
we have one for the relative
the relativistic gan which is multiplied
by this
5e minus 3 constant and then we have
this
other term for the l1
and that is multiplied by e minus two
yeah so then they also
yeah they mentioned those things that i
just said which is that
this is the uh the l1 loss and then we
have these constants
i feel like they they didn't mention the
constants here they mentioned them later
on but it would have been clear if they
would have just said the constants here
when they
actually introduced these uh
uh that these lost terms but so
if we move along uh they also use
another trick
which is in network interpolation so
um yeah i have some comments about this
but
they mentioned that to remove unpleasant
noise in in gan based methods while
making a good perceptual quality
we propose a flexible and effective
strategy network interpolation
so what they do is that they take the
training again
um for psnr meaning they only train the
generator
on an l1 loss or an l2 loss
um but so they trained it on i think l1
loss
and then they trained the um the gan
when they introduced this
this discriminator and this additional
loss terms for the perceptual loss and
then they do an
inter interception of those two um where
they take some constant times the
gan weight all right and then they take
another one minus that constant
times the um the weight of the one that
was only trained on l1
and in that way they found to
to uh remove unpleasant noise so i'm not
really sure what they mean by unpleasant
noise
um hopefully that doesn't mean artifacts
because
that was the reason why we introduced
why we removed batchworm
um so let's see wait um first
interpolated model is able to produce
friendly feasible without introducing
artifacts yeah okay so yeah i kind of
missed that but
without introducing artifacts so that is
i think what they mean
and which is unfortunate like right
because
then you question the um the the fact
of removing the batch norms if you still
have these artifacts that you're gonna
now have to solve with network
interpolation so
yeah this kind of felt like they
mentioned that yeah we remove bathroom
which solved the artifacts
but then you come to this part and they
say we introduced an additional network
interpolation because it removed
artifacts
but then you know the question is why
they didn't i thought you removed those
so you know honestly um
this um i kind of um
have some doubts about this paper
because they're
like the code and paper doesn't always
match and the
i don't know in my opinion there are
some things that could definitely be
improved on this and to make things
clearer um but yeah let me know if you
have any thoughts
and then also all right so then they
also the training details they mentioned
that we obtained low resolution images
by down sampling high resolution images
using the matlab bicubic kernel function
and so this this doesn't make sense to
me either you know they used pi torch
you can
you can um down down sample
using pytorch and definitely there exist
libraries for that
why did you do it in matlab um and they
also mentioned sort of in the
in the in their github source code that
um you might not receive the same
results as we do if you trained us from
scratch
uh if you do not use the matlab bicubic
kernel function
and then you know i don't know
at least there should be some comments
added as to why they did that
and sort of why it's that important
i guess so what is the difference
between matlabs and torch vision
all right and then they also mentioned
that the minion batch is set to 16 same
as srgan
and then they also they used a higher uh
patch
of 128 they actually did even higher
than that so
they did 198 as well
but to 128 where esr again used 96 by 96
so they mentioned that we observed that
training a deeper network benefits from
a larger patch size
since an enlarged receptive field helps
to capture more semantic information
and i guess that makes sense um
yeah i guess that makes sense and then
they mentioned that the training process
is divided into two stages similarly as
srgan
to train the psnr with l1 loss and then
they have some details of the learning
rate and min batch updates
which is nice um and then they um
yeah the the generator so then they
employed the
psnr as initialization for just as
sargan did
and then it's trained with this new loss
function and here they introduced the
the constant terms that we looked at
over
before and so they use one e minus four
learning rate and then they have them
after every 50k
update steps they also mention here that
they use atom
same beta 1 and beta2 and
they use um they use one that has 16
residual blocks which is what srgan did
and then they had one uh with 23
blocks which is the one that they use
mainly so the one with 23 is what they
use mainly
um but it also doesn't feel correct to
compare those residual blocks of srgan
and esr again
as we saw that the difference is massive
uh
you know in the amount of common layers
that they used um
you know they they kind of completely
changed the architecture
that yeah anyways
so for training they use div 2k data set
and they also use flickr 2k
data set um they
mentioned that they empirically find
that using this large data set
with richer textures helps the generator
to produce more natural results
and here they also mentioned that they
use a horizontal flip
and 90 degree random rotations
all right so let's go down maybe there
are some other stuff
so yeah
right i think i just wanted to go to the
appendix now
to look at some details here um
yeah here they talked about bathroom
artifacts
um and then and yeah so this is kind of
what it looked like randomly during
training sometimes
for um for sr gam
and let's see so right yeah so i wanted
to mention that with the residual
learning where they um they use
um basically that they multiply the one
that has gone through the block
with a very so 0.2 constant
and then they keep the one that was
originally
sort of with a constant of one so you
mentioned here that it scales down the
original by multiplying constants
between zero and one
in our settings for each residual block
the residual features after the last
convolutional layer
are multiplied by 0.2 intuitively the
residual scaling can be interpreted to
correct the improper initialization
thus avoiding magnifying the magnitudes
of input signals
and residual networks and you know i'm
not really sure how much i buy into this
ma you know actually mattering i wonder
if you could just do sort of a
normal one without multiplying with 0.2
but yeah that is i guess one detail of
what they did
all right so in the next video i'll try
to implement this one and uh we'll see
exactly how it looks like and the
details of its implementation
but hopefully this leaves you with a a
solid understanding of esr gan sort of
then the update of the network and the
relativistic gan and also
other things
yeah i think one thing i actually missed
also
is that they they also mention here
um i thought i took that part
but they also said that they found that
smaller initialization
um worked well in their experiments
so they actually um in the source code
they multiply with a scale of 0.1
of the original initialization weights
um
so yeah that is just one thing also to
to keep in mind but
i'll go through that in the
implementation as well all right thank
you so much for watching and hope to see
you next time
Weitere ähnliche Videos ansehen
ICCV 2023 - Sigmoid Loss for Language Image Pre-Training
Michał Kudelski (TCL): Inpainting using Deep Learning: from theory to practice
Unpaired Image-Image Translation using CycleGANs
LLM Chronicles #3.1: Loss Function and Gradient Descent (ReUpload)
Examination of Edentulous patient
Prüft jetzt ob Bitlocker bei eurem Windows 11 heimlich aktiv ist
5.0 / 5 (0 votes)