ESRGAN Paper Walkthrough

Aladdin Persson
23 Sept 202222:39

Summary

TLDRThis video explores ESRGAN, an enhanced version of SRGAN for single image super-resolution. It addresses SRGAN's issues with artifacts, primarily linked to batch normalization, by refining network architecture, adversarial, and perceptual losses. Key updates include the RRDB residual in residual dense block without batch normalization, a relativistic GAN loss for relative realness prediction, and a modified VGG perceptual loss applied before ReLU activation. The video also critiques discrepancies between the paper and its source code, suggesting improvements for clarity.

Takeaways

  • 📜 The video discusses ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks), an improvement over the SRGAN model for single image super-resolution.
  • 🔍 ESRGAN addresses issues like unpleasant artifacts in SRGAN, which were linked to batch normalization.
  • 🏗️ The architecture of ESRGAN includes a Residual in Residual Dense Block (RRDB) without batch normalization.
  • 🛠️ A key contribution is the use of a relativistic GAN loss function, which predicts relative realness instead of an absolute value.
  • 🎨 The perceptual loss function was modified to operate before the ReLU activation instead of after, enhancing texture quality.
  • 📈 Network interpolation is used to reduce noise and improve perceptual quality in the generated images.
  • 📊 The training process involves two stages: first, training with an L1 loss for PSNR, and then incorporating the GAN loss and perceptual loss.
  • 🔢 The paper suggests that smaller initialization weights and a beta residual scaling parameter can improve the training stability and output quality.
  • 🔧 The video script points out discrepancies between the paper's descriptions and the actual source code, indicating potential areas of confusion.
  • 🌐 The training data for ESRGAN includes the DIV2K and Flickr2K datasets, with data augmentation techniques like horizontal flipping and random rotations applied.

Q & A

  • What does ESRGAN stand for?

    -ESRGAN stands for Enhanced Super Resolution Generative Adversarial Networks.

  • What is the primary issue addressed by ESRGAN over SRGAN?

    -ESRGAN addresses the issue of unpleasant artifacts in SRGAN, which were associated with the use of batch normalization.

  • What are the three key components of SRGAN that ESRGAN studies and improves?

    -The three key components are the network architecture, the adversarial loss, and the perceptual loss.

  • What is the RRDB block used in ESRGAN's architecture?

    -The RRDB block stands for Residual in Residual Dense Block, which is used in the network architecture of ESRGAN instead of the background layers and the original basic block with batch normalization.

  • How does the relativistic GAN in ESRGAN differ from the standard GAN?

    -In ESRGAN, the relativistic GAN allows the discriminator to predict relative realness instead of an absolute value, which is a change from the standard GAN approach.

  • What is the difference between the perceptual loss used in SRGAN and ESRGAN?

    -In ESRGAN, the perceptual loss is applied before the ReLU activation (before the non-linearity), whereas in SRGAN, it is applied after the ReLU activation.

  • What is the role of the beta residual scaling parameter in ESRGAN?

    -The beta residual scaling parameter is used in the residual connections of the RRDB block, where the output is scaled by beta (0.2 in their setting) before being added to the original input, aiming to correct improper initialization and avoid magnifying input signal magnitudes.

  • Why does ESRGAN use network interpolation during training?

    -Network interpolation is used in ESRGAN to remove unpleasant noise while maintaining good perceptual quality, achieved by interpolating between a model trained on L1 loss and a model trained with GAN and perceptual loss.

  • How does ESRGAN handle the issue of artifacts during training?

    -ESRGAN handles artifacts by removing batch normalization and using network interpolation, which is claimed to produce results without introducing artifacts.

  • What are the training details mentioned for ESRGAN that differ from SRGAN?

    -ESRGAN uses a larger patch size of 96x96, trains on the DIV2K and Flickr2K datasets, employs horizontal flip and 90-degree random rotations for data augmentation, and divides the training process into two stages similar to SRGAN.

  • What is the significance of the smaller initialization mentioned in the ESRGAN paper?

    -Smaller initialization is used in ESRGAN, where the original initialization weights are multiplied by a scale of 0.1, which is found to work well in their experiments.

Outlines

00:00

😲 Introduction to ESRGAN

This paragraph introduces the concept of Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN), which is an advancement over the previous SGAN model. The speaker recommends watching a previous video on SGAN before this one. ESRGAN aims to improve upon SGAN by addressing issues like unpleasant artifacts associated with batch normalization and enhancing image quality. The focus is on three key components: network architecture, adversarial loss, and perceptual loss. The architecture uses a Residual in Residual Dense Block (RRDB) without batch normalization. The loss function is changed to a relativistic GAN loss, and the perceptual loss is modified to be applied before the ReLU activation instead of after.

05:00

🔍 Deep Dive into ESRGAN Architecture

The speaker delves into the architecture of ESRGAN, highlighting the use of RRDB blocks which replace the basic blocks used in SGAN. Each RRDB block contains three dense blocks, significantly increasing the network's size and complexity compared to SGAN. The paragraph discusses the use of skip connections and concatenation of channels, inspired by DenseNets, to enhance feature propagation. The speaker also mentions a beta residual scaling parameter used in the residual connections, which prioritizes the original input over the processed output. There's a critique of the paper for not clearly detailing some changes in the architecture that are evident in the source code, such as kernel sizes and padding, which are different from those used in SGAN.

10:02

🌟 Unique Features of ESRGAN

This section discusses unique features of ESRGAN, including the use of a real relativistic discriminator, which predicts relative realness instead of an absolute value. The speaker expresses uncertainty about the importance of this feature and suggests that other methods like WGAN-GP might yield similar results. The paragraph also covers the loss function used in ESRGAN, which includes a perceptual loss applied before ReLU activation, an L1 loss, and a relativistic GAN loss. The speaker points out inconsistencies in the paper regarding the constants used in the loss function and the need for clarity in the presentation of these details.

15:04

🔧 Network Interpolation and Training Details

The speaker talks about the network interpolation technique used in ESRGAN to reduce noise and artifacts. This method involves training the generator on an L1 loss and then interpolating with a GAN-trained model to achieve a balance between noise reduction and perceptual quality. The paragraph also covers the training process, which involves downsampling high-resolution images using MATLAB's bicubic kernel function, a choice that the speaker finds questionable. Details about the training datasets, patch sizes, and the two-stage training process are provided, with an emphasis on the benefits of a larger patch size for capturing more semantic information.

20:07

📚 Appendix and Final Thoughts

In the final paragraph, the speaker reviews the appendix of the ESRGAN paper, which discusses the artifacts associated with batch normalization and the use of residual learning with a scaling factor to correct initialization issues. The speaker also mentions that smaller initialization weights were found to work well in experiments. The paragraph wraps up with a teaser for the next video, where the speaker plans to implement ESRGAN and provide a more hands-on exploration of its features and performance. There's a call for viewer engagement, inviting thoughts and questions on the presented material.

Mindmap

Keywords

💡ESRGAN

ESRGAN stands for Enhanced Super-Resolution Generative Adversarial Networks. It is an improvement over the original SRGAN (Super-Resolution Generative Adversarial Networks), which was capable of generating realistic textures during single image super-resolution. ESRGAN addresses some of the issues found in SRGAN, such as unpleasant artifacts, and enhances the quality of the generated images. The video script discusses how ESRGAN builds upon SRGAN by studying and improving three key components: network architecture, adversarial loss, and perceptual loss.

💡Super-Resolution

Super-Resolution is the process of increasing the resolution of an image or video. In the context of the video, it refers to the technique used by ESRGAN to upscale low-resolution images to high-resolution ones while maintaining or improving the quality of the image. The script mentions that ESRGAN is designed to generate more realistic textures and reduce artifacts compared to previous methods.

💡Generative Adversarial Networks (GANs)

GANs are a class of machine learning models consisting of two neural networks, a generator and a discriminator, that are trained together. In the video, ESRGAN utilizes GANs to create high-resolution images. The generator network produces images, while the discriminator network evaluates them, providing feedback that helps the generator to improve.

💡Residual in Residual Dense Block (RRDB)

RRDB is a network architecture used in ESRGAN, which is an enhancement over the basic ResNet architecture used in SRGAN. The script explains that RRDB does not use batch normalization and is composed of multiple dense blocks, each containing several convolutional layers. This architecture allows for a much larger network, which contributes to the improved performance of ESRGAN.

💡Adversarial Loss

Adversarial loss is a type of loss function used in GANs to train the generator and discriminator networks. In the video, the script discusses how ESRGAN uses a modified version of adversarial loss called relativistic GAN loss, which allows the discriminator to predict relative realness instead of an absolute value, improving the training process.

💡Perceptual Loss

Perceptual loss is a loss function that measures the difference between features extracted from the generated image and the target image. The script mentions that ESRGAN changes the perceptual loss from using features after the ReLU activation (as in SRGAN) to before the activation, which is claimed to work better.

💡Batch Normalization

Batch normalization is a technique used in neural networks to stabilize and speed up the training by normalizing the input to a layer. The script points out that ESRGAN removes batch normalization from the network architecture, as it was found to introduce artifacts in the generated images.

💡Pixel Shuffle

Pixel shuffle is an upsampling technique used in image processing to increase the resolution of an image. The script contrasts the use of pixel shuffle in SRGAN with the use of a different upsampling method, F dot interpolate, in ESRGAN, which uses a nearest neighbor approach.

💡Relativistic GAN

Relativistic GAN is a concept introduced in the script where the discriminator's output is not an absolute measure of realness but a relative one. This approach is said to improve the training stability and the quality of the generated images in ESRGAN.

💡Network Interpolation

Network interpolation is a technique mentioned in the script where models trained with different objectives are combined to produce a final model. In ESRGAN, it is used to remove noise from the generated images by interpolating between a model trained with a perceptual loss and one trained with an L1 loss.

💡L1 Loss

L1 loss, also known as the absolute error loss, is a loss function that measures the absolute difference between the predicted and actual values. The script discusses how ESRGAN includes an L1 loss term in its training process to help reduce noise and improve image quality.

Highlights

ESRGAN stands for Enhanced Super Resolution Generative Adversarial Networks, building on the previous work of SRGAN.

SRGAN was capable of generating realistic textures during single image super-resolution but had issues with unpleasant artifacts.

The artifacts were associated with batch normalization, prompting improvements in ESRGAN.

ESRGAN focuses on three key components: network architecture, adversarial loss, and perceptual loss.

The architecture uses a Residual in Residual Dense Block (RRDB) without batch normalization.

A relativistic GAN loss is introduced, allowing the discriminator to predict relative realness.

The VGG perceptual loss is modified to be applied before ReLU activation instead of after.

ESRGAN demonstrates better texture generation compared to SRGAN in example images.

The basic architecture of SRResNet is maintained but with significant changes in ESRGAN.

All background layers are removed, and the original basic block is replaced with the proposed RRDB block.

Each RRDB block contains three dense blocks, significantly increasing the network size compared to SRGAN.

Skip connections are used between different paths in the network, inspired by DenseNets.

A beta residual scaling parameter is introduced to balance the contribution of different paths.

The training process is divided into two stages: PSNR with L1 loss and then GAN training with additional loss terms.

Network interpolation is used to remove unpleasant noise while maintaining perceptual quality.

The paper suggests that smaller initialization weights work well in experiments.

The training uses the DIV2K and Flickr2K datasets, emphasizing the importance of rich textures for natural results.

The appendix discusses the removal of batch normalization to address artifacts and the use of residual learning.

Transcripts

play00:00

in this video we will be taking a look

play00:01

at esr gan

play00:03

which builds on the previous

play00:04

implementation and uh the previous

play00:06

walkthrough of

play00:07

srgan that i recommend that you check

play00:09

out before

play00:10

uh taking a look at this one but so

play00:13

in this video we'll be taking a look at

play00:15

the paper and then the next video will

play00:17

implement

play00:18

this in pytorch so esrgan

play00:22

stands for enhanced super resolution

play00:24

generative adversarial

play00:26

networks and essentially you know

play00:31

it builds on the you know the work of sr

play00:34

gan which was able to

play00:36

uh generate realistic textures during

play00:38

single image

play00:39

super resolution but there was a problem

play00:42

which is that

play00:43

there were some um unpleasant artifacts

play00:46

and that was associated with what they

play00:48

found to be with batch norm

play00:50

and also they just did things that just

play00:53

made the quality

play00:54

better but so they study three key

play00:57

key three key components of srgan which

play01:01

is the network architecture

play01:02

the adversarial laws and the perceptual

play01:04

laws

play01:06

and so they made some improvements to

play01:08

all three and then they

play01:09

have this enhanced sr again and so for

play01:12

the architecture they use

play01:14

residual in residual dense block rrdb

play01:17

as what they call it and we'll see the

play01:20

details of it but

play01:21

it doesn't use back normalization and

play01:24

they also

play01:25

instead of for the loss they use

play01:27

something called a really

play01:28

relativistic gan uh which lets the

play01:31

discriminant predict

play01:32

relative realness instead of the

play01:34

absolute value and then they also do

play01:36

a very minor thing which is uh change

play01:40

the vgg perceptual loss to be um

play01:43

before activation so before the relu

play01:45

instead of after

play01:47

and i guess that that works better

play01:50

and um yeah so i'm gonna go through the

play01:53

important details of this but

play01:54

here's perhaps one example we can see

play01:56

that they've shown that

play01:58

show that esr again has some better

play02:00

texture than srgan

play02:01

here they do look to be a bit better and

play02:06

yeah so i'm gonna skip these

play02:08

introductory parts because we want to

play02:09

just see

play02:10

sort of what they did so the proposed

play02:12

method

play02:13

that they did is that they uh they still

play02:15

use the basic architecture of

play02:17

sr resnet that they use in sr-gam

play02:20

where you know the most computation is

play02:22

done in the low resolution feature space

play02:24

because we kind of do the um you know so

play02:27

sort of at the end here

play02:29

is when we actually do the upsample

play02:31

right so

play02:32

here we do a bunch of computation on

play02:34

this low resolution image and then we

play02:36

up sample it in the end now um

play02:39

so this is actually not entirely true

play02:43

that they mention here is that they

play02:44

employ the basic architecture as a

play02:46

resnet

play02:47

because they they change sort of major

play02:50

things

play02:50

i would say in the implementation and

play02:53

i'll i'll go through it

play02:54

later on but um you know the paper

play02:57

doesn't mention some key details that

play02:59

they changed but it's visible in the

play03:01

source code

play03:02

right which i kind of felt was

play03:06

i don't know it just didn't feel right

play03:09

but so they should have definitely

play03:10

mentioned those in the paper i feel like

play03:12

but anyways we'll get to those but so

play03:14

the network architecture

play03:16

they remove all the background layers

play03:18

and they also remove the original basic

play03:20

block with the proposed

play03:21

rrdb block and so what how it works is

play03:25

that you know in the beginning we have

play03:27

combat from relative combat form they

play03:30

remove the background

play03:31

and then they essentially one

play03:34

of those residual blocks is now

play03:38

one of these blocks all right

play03:41

where one of those blocks contains

play03:44

three dense blocks and one

play03:48

dense block contains all of this stuff

play03:51

so you know they it's going to be a

play03:55

much much larger network in in like it's

play03:58

in pretty much insanely

play03:59

much larger than the sr gan because you

play04:03

know so

play04:03

this is one residual block and these

play04:05

they use 23 of these ones

play04:08

right so you can imagine 23 of these

play04:12

where you sort of had run it through a

play04:14

dense block and then you use a residual

play04:16

connection here um to to this sort of

play04:20

the main path

play04:21

right and the main path is is this in

play04:23

the center here

play04:25

um but so uh that is what they do for

play04:29

one

play04:29

rrdb block and then one dense block they

play04:32

do

play04:33

it with a comrelu comrello comrello

play04:36

comrello and then

play04:37

conf and so what's kind of the key here

play04:41

i guess

play04:42

is that they also do these um these

play04:45

skip connections between all of the

play04:47

different paths

play04:48

so in the beginning here they do escape

play04:50

connection to

play04:51

after the first sort of pair of comrelu

play04:55

second

play04:57

third and also the other one so sort of

play05:00

an

play05:00

all to all um in the path forward

play05:04

where they do skip connections and also

play05:06

here is um

play05:07

i guess it comes from dense nets where

play05:10

they instead of doing a

play05:12

skip connection where they element wise

play05:14

uh sum them

play05:15

they do a um a

play05:18

concatenation so all of these ones right

play05:20

are a concatenation

play05:22

of uh of the channels so yeah

play05:27

i guess that will also become clear when

play05:29

we go

play05:30

actually implement the next video but

play05:32

that's what they do

play05:33

um and that's what they yeah the major

play05:37

change to

play05:37

the residual block so you can imagine

play05:39

you know this is you know a lot lot

play05:41

bigger because in the beginning they

play05:43

so srgan had 16 of these rb blocks right

play05:47

16 uh where we had i guess two comp

play05:50

layers

play05:51

now we have 23 of these where we have

play05:54

three times i guess five so 15

play05:58

right so 23 times 15 versus

play06:02

16 times two you know that that's a big

play06:04

difference in the in the um

play06:06

number of combo layers that they

play06:07

actually use

play06:09

okay so um

play06:14

i guess one other detail as well which

play06:17

is kind of interesting which they

play06:18

mentioned in appendix is that these

play06:20

residual

play06:20

connections here um those are actually

play06:25

element-wise sort of standard skip

play06:28

connections

play06:29

but they do an interesting thing which

play06:30

is that they use a um

play06:33

a beta residual scaling parameter

play06:36

so they they uh they don't just do sort

play06:38

of um

play06:39

i guess you know x plus uh residual

play06:43

they actually do x plus the residual um

play06:46

let's see so yeah so they do the one

play06:50

that goes through the dense block

play06:52

right um so perhaps you know this would

play06:55

be

play06:56

x then that goes in the main path they

play06:58

times the residual the one that's gone

play07:00

through this dense block

play07:01

by a beta parameter beta which is equal

play07:04

to 0.2

play07:06

so that is interesting because that

play07:09

means that

play07:11

we're sort of taking the main the one

play07:13

that was from the beginning before

play07:15

running through this dense

play07:16

block is what we kind of uh sort of

play07:18

prioritize i guess

play07:20

because we take one times that amount

play07:22

and then 0.2

play07:23

times the amount that has gone through

play07:24

the dense block

play07:27

and they sort of intuitively

play07:30

mention that uh this is that the

play07:33

the the one going through the desk block

play07:35

will modify the initialization

play07:38

uh to be i guess correct or whatever

play07:41

but that is some just an interesting

play07:43

part of it

play07:45

um and yeah so they mentioned here that

play07:47

when the statistics of training and

play07:48

testing data sets differ a lot

play07:50

bathroom layers tend to introduce in

play07:52

unplugging artifacts

play07:54

which is what i mean what i observed as

play07:57

well when

play07:58

training srgan uh is that there were

play08:00

just some random artifacts

play08:01

that just appeared randomly during

play08:03

training and stuff and particularly when

play08:05

you would have some

play08:07

some odd looking image like there would

play08:09

be some

play08:10

some some image with a with a black

play08:12

background for example

play08:13

that could you know um introduce these

play08:16

artifacts as well

play08:17

so one thing that i felt i want didn't

play08:19

want to miss is that

play08:20

you know this is what the mentioned to

play08:22

be the change in the architecture

play08:25

but they actually did some different

play08:26

things as well which they didn't they

play08:28

weren't clear on

play08:30

um that could you know impact the

play08:32

performance

play08:33

a bit i would say so i'm gonna show you

play08:36

the code and we can just see some some

play08:38

differences that they did there

play08:40

so here is the uh the source code for

play08:41

esr again that they have

play08:43

in the uh in the paper and this is just

play08:46

for the generator architecture

play08:48

because the the code is kind of massive

play08:51

to go through

play08:52

but just looking at the generator all

play08:54

right we can see that

play08:56

um for example if we look at let's see

play08:59

the all right so if we look at the the

play09:02

entire network here

play09:03

our db net which is the generator for

play09:06

example we can see here that they use a

play09:07

kernel size of one

play09:08

straight of one padding of one which

play09:10

they didn't do in srgan they used a

play09:12

kernel size of nine

play09:13

in the beginning and padding of four

play09:18

and and also for the last one they

play09:21

also here used a kernel size of three so

play09:23

another big thing is that they used

play09:25

pixel shuffle in srgan but in esrgan

play09:28

they use an f dot interpolate um where

play09:31

so they're doing

play09:32

sort of a nearest neighbor up sampling

play09:35

which is also

play09:36

quite different right

play09:39

so those are definitely things they

play09:41

should mention in the paper in my

play09:42

opinion

play09:44

so that is one key part and

play09:47

the other is the real relativistic

play09:49

discriminator

play09:50

and you know i haven't seen much about

play09:53

this in

play09:54

other papers so this is kind of the

play09:55

first time that i've seen this actually

play09:58

um and i'm not really sure if it's that

play10:02

important

play10:02

um i think you know

play10:06

using vegan gp or something would

play10:08

probably give

play10:09

similar results and in fact their

play10:12

implementation did

play10:14

um did also have a vegan gp

play10:18

which they tried with and i

play10:21

i think they mentioned that they it was

play10:22

just sort of um

play10:24

it took longer time but it didn't give

play10:26

significant improvement

play10:28

but it didn't seem to do anything that

play10:30

was worse

play10:31

but yeah so the idea is that um and

play10:34

again i'm gonna skip this a little bit

play10:35

because

play10:36

you know i'm not really too familiar

play10:38

with this but uh the idea here is

play10:40

anyways that you know this is the

play10:41

standard again where we just do sigmoid

play10:43

of the

play10:43

output from the from the discriminator

play10:47

and you know similarly for the the fake

play10:49

ones

play10:50

so the real one should be one fake one

play10:53

should be zero

play10:55

but here um for the relativistic can

play10:58

they instead do

play10:58

sigmoid of the output so the one that

play11:01

you know

play11:02

is over here but they do subtract with

play11:05

the

play11:06

expected value of the of the sort of

play11:10

uh the fake images so we run through the

play11:12

discriminator of the fake images

play11:14

and then we take the torch.mean of that

play11:17

so

play11:18

we take the mean value of the fake

play11:20

across the batch that we

play11:22

currently have and we subtract that with

play11:24

the

play11:25

um with with sort of the um

play11:29

the output of the real one from the

play11:31

discriminator

play11:32

and yeah i guess i don't want to go

play11:35

into in more detail i don't feel that

play11:38

this is

play11:39

super important but that's one also key

play11:41

part that they used

play11:43

um and um so the standard discrimination

play11:46

sram can be expressed as dx is sigmoid

play11:48

of

play11:49

cx where sigmoid is a sigmoid where

play11:51

sigma

play11:52

is yeah and c of x is the non-transform

play11:54

discriminator

play11:55

output um and yeah and then they

play11:58

mentioned some part about the

play12:00

the loss function that they use here and

play12:02

then the perceptual loss

play12:04

which is you know this is kind of funny

play12:06

like the only difference here

play12:07

is uh that they used it before

play12:09

activation of the relu

play12:11

and so um yeah so we developed a more

play12:14

effective perceptual loss by

play12:15

constraining on features before

play12:17

activation rather than

play12:18

after activations as practiced in srgan

play12:21

so what this means kind of like

play12:22

concretely is that

play12:24

vgg you know in the implementation of

play12:27

srgan we did

play12:29

vgg dot features i think and then that

play12:32

brought into a list and we did

play12:33

up to 36 right the thing that's

play12:37

different now is

play12:38

that we have to do we need to change

play12:39

that to a five

play12:41

and so that is the difference in the

play12:43

perceptual loss

play12:45

and then uh the total loss um

play12:48

and yeah so one big thing as well is

play12:50

that they included the l1 loss during

play12:52

training

play12:53

which we kind of discussed for srgan

play12:56

wasn't really clear if they used that

play12:57

because it seemed like they replaced uh

play13:00

the um

play13:01

the the l2 loss in their case

play13:04

with an avg

play13:07

sort of a feature perceptual loss

play13:10

but here so i guess they do two things

play13:13

here

play13:13

they first introduce this l1 loss

play13:16

and then they also have it mean

play13:20

sort of kept during training so when

play13:22

they add the actual

play13:23

perceptual loss so now they have three

play13:25

loss terms

play13:26

one for the perceptual one which is vgg

play13:30

or yeah when it's run through vdg then

play13:32

we have one for the relative

play13:35

the relativistic gan which is multiplied

play13:37

by this

play13:38

5e minus 3 constant and then we have

play13:41

this

play13:42

other term for the l1

play13:45

and that is multiplied by e minus two

play13:48

yeah so then they also

play13:50

yeah they mentioned those things that i

play13:51

just said which is that

play13:53

this is the uh the l1 loss and then we

play13:56

have these constants

play13:58

i feel like they they didn't mention the

play13:59

constants here they mentioned them later

play14:02

on but it would have been clear if they

play14:03

would have just said the constants here

play14:05

when they

play14:06

actually introduced these uh

play14:09

uh that these lost terms but so

play14:12

if we move along uh they also use

play14:14

another trick

play14:16

which is in network interpolation so

play14:20

um yeah i have some comments about this

play14:23

but

play14:23

they mentioned that to remove unpleasant

play14:25

noise in in gan based methods while

play14:27

making a good perceptual quality

play14:29

we propose a flexible and effective

play14:32

strategy network interpolation

play14:34

so what they do is that they take the

play14:36

training again

play14:38

um for psnr meaning they only train the

play14:41

generator

play14:42

on an l1 loss or an l2 loss

play14:46

um but so they trained it on i think l1

play14:48

loss

play14:49

and then they trained the um the gan

play14:52

when they introduced this

play14:53

this discriminator and this additional

play14:55

loss terms for the perceptual loss and

play14:58

then they do an

play14:58

inter interception of those two um where

play15:02

they take some constant times the

play15:03

gan weight all right and then they take

play15:06

another one minus that constant

play15:08

times the um the weight of the one that

play15:11

was only trained on l1

play15:13

and in that way they found to

play15:17

to uh remove unpleasant noise so i'm not

play15:19

really sure what they mean by unpleasant

play15:21

noise

play15:22

um hopefully that doesn't mean artifacts

play15:25

because

play15:26

that was the reason why we introduced

play15:28

why we removed batchworm

play15:30

um so let's see wait um first

play15:34

interpolated model is able to produce

play15:36

friendly feasible without introducing

play15:38

artifacts yeah okay so yeah i kind of

play15:40

missed that but

play15:41

without introducing artifacts so that is

play15:44

i think what they mean

play15:45

and which is unfortunate like right

play15:48

because

play15:49

then you question the um the the fact

play15:53

of removing the batch norms if you still

play15:55

have these artifacts that you're gonna

play15:57

now have to solve with network

play15:59

interpolation so

play16:01

yeah this kind of felt like they

play16:05

mentioned that yeah we remove bathroom

play16:08

which solved the artifacts

play16:10

but then you come to this part and they

play16:12

say we introduced an additional network

play16:14

interpolation because it removed

play16:15

artifacts

play16:16

but then you know the question is why

play16:18

they didn't i thought you removed those

play16:22

so you know honestly um

play16:26

this um i kind of um

play16:29

have some doubts about this paper

play16:31

because they're

play16:33

like the code and paper doesn't always

play16:34

match and the

play16:36

i don't know in my opinion there are

play16:37

some things that could definitely be

play16:39

improved on this and to make things

play16:40

clearer um but yeah let me know if you

play16:43

have any thoughts

play16:46

and then also all right so then they

play16:48

also the training details they mentioned

play16:50

that we obtained low resolution images

play16:51

by down sampling high resolution images

play16:53

using the matlab bicubic kernel function

play16:56

and so this this doesn't make sense to

play16:58

me either you know they used pi torch

play17:00

you can

play17:01

you can um down down sample

play17:04

using pytorch and definitely there exist

play17:06

libraries for that

play17:08

why did you do it in matlab um and they

play17:11

also mentioned sort of in the

play17:12

in the in their github source code that

play17:16

um you might not receive the same

play17:18

results as we do if you trained us from

play17:20

scratch

play17:20

uh if you do not use the matlab bicubic

play17:23

kernel function

play17:25

and then you know i don't know

play17:28

at least there should be some comments

play17:29

added as to why they did that

play17:31

and sort of why it's that important

play17:35

i guess so what is the difference

play17:37

between matlabs and torch vision

play17:40

all right and then they also mentioned

play17:41

that the minion batch is set to 16 same

play17:43

as srgan

play17:45

and then they also they used a higher uh

play17:48

patch

play17:49

of 128 they actually did even higher

play17:51

than that so

play17:52

they did 198 as well

play17:55

but to 128 where esr again used 96 by 96

play18:00

so they mentioned that we observed that

play18:02

training a deeper network benefits from

play18:03

a larger patch size

play18:06

since an enlarged receptive field helps

play18:07

to capture more semantic information

play18:10

and i guess that makes sense um

play18:14

yeah i guess that makes sense and then

play18:17

they mentioned that the training process

play18:19

is divided into two stages similarly as

play18:21

srgan

play18:22

to train the psnr with l1 loss and then

play18:25

they have some details of the learning

play18:27

rate and min batch updates

play18:29

which is nice um and then they um

play18:33

yeah the the generator so then they

play18:35

employed the

play18:36

psnr as initialization for just as

play18:40

sargan did

play18:41

and then it's trained with this new loss

play18:42

function and here they introduced the

play18:44

the constant terms that we looked at

play18:47

over

play18:47

before and so they use one e minus four

play18:51

learning rate and then they have them

play18:52

after every 50k

play18:53

update steps they also mention here that

play18:57

they use atom

play18:58

same beta 1 and beta2 and

play19:02

they use um they use one that has 16

play19:05

residual blocks which is what srgan did

play19:07

and then they had one uh with 23

play19:11

blocks which is the one that they use

play19:15

mainly so the one with 23 is what they

play19:17

use mainly

play19:19

um but it also doesn't feel correct to

play19:22

compare those residual blocks of srgan

play19:24

and esr again

play19:25

as we saw that the difference is massive

play19:27

uh

play19:28

you know in the amount of common layers

play19:29

that they used um

play19:31

you know they they kind of completely

play19:33

changed the architecture

play19:34

that yeah anyways

play19:39

so for training they use div 2k data set

play19:41

and they also use flickr 2k

play19:42

data set um they

play19:46

mentioned that they empirically find

play19:47

that using this large data set

play19:49

with richer textures helps the generator

play19:51

to produce more natural results

play19:53

and here they also mentioned that they

play19:55

use a horizontal flip

play19:56

and 90 degree random rotations

play20:01

all right so let's go down maybe there

play20:03

are some other stuff

play20:06

so yeah

play20:11

right i think i just wanted to go to the

play20:13

appendix now

play20:14

to look at some details here um

play20:17

yeah here they talked about bathroom

play20:19

artifacts

play20:22

um and then and yeah so this is kind of

play20:25

what it looked like randomly during

play20:26

training sometimes

play20:27

for um for sr gam

play20:31

and let's see so right yeah so i wanted

play20:34

to mention that with the residual

play20:35

learning where they um they use

play20:39

um basically that they multiply the one

play20:41

that has gone through the block

play20:42

with a very so 0.2 constant

play20:45

and then they keep the one that was

play20:46

originally

play20:49

sort of with a constant of one so you

play20:52

mentioned here that it scales down the

play20:53

original by multiplying constants

play20:54

between zero and one

play20:56

in our settings for each residual block

play20:58

the residual features after the last

play21:00

convolutional layer

play21:02

are multiplied by 0.2 intuitively the

play21:04

residual scaling can be interpreted to

play21:06

correct the improper initialization

play21:09

thus avoiding magnifying the magnitudes

play21:11

of input signals

play21:12

and residual networks and you know i'm

play21:15

not really sure how much i buy into this

play21:17

ma you know actually mattering i wonder

play21:20

if you could just do sort of a

play21:23

normal one without multiplying with 0.2

play21:26

but yeah that is i guess one detail of

play21:29

what they did

play21:31

all right so in the next video i'll try

play21:35

to implement this one and uh we'll see

play21:38

exactly how it looks like and the

play21:39

details of its implementation

play21:41

but hopefully this leaves you with a a

play21:44

solid understanding of esr gan sort of

play21:47

then the update of the network and the

play21:49

relativistic gan and also

play21:52

other things

play21:55

yeah i think one thing i actually missed

play21:58

also

play21:58

is that they they also mention here

play22:02

um i thought i took that part

play22:06

but they also said that they found that

play22:08

smaller initialization

play22:10

um worked well in their experiments

play22:13

so they actually um in the source code

play22:15

they multiply with a scale of 0.1

play22:17

of the original initialization weights

play22:20

um

play22:22

so yeah that is just one thing also to

play22:24

to keep in mind but

play22:25

i'll go through that in the

play22:26

implementation as well all right thank

play22:28

you so much for watching and hope to see

play22:30

you next time

Rate This

5.0 / 5 (0 votes)

関連タグ
ESRGANSuper ResolutionAI TechnologyGenerative Adversarial NetworksImage QualityMachine LearningDeep LearningArtifact ReductionNeural NetworksImage Processing
英語で要約が必要ですか?