Michał Kudelski (TCL): Inpainting using Deep Learning: from theory to practice

ML in PL
20 Mar 201932:31

Summary

TLDRThe speaker from TCL Research Europe introduces their AI project focused on image and video inpainting using deep learning, a technique to reconstruct lost or deteriorated parts of visual media. Applications include restoring old photos, scene editing, and even uncensoring animations. The talk covers the use of partial convolutions, challenges in training, and practical issues like batch normalization and high-resolution inpainting. The presentation concludes with sample results and an invitation to learn more about TCL's innovative projects.

Takeaways

  • 📍 The speaker is from TCL Research Europe, a new R&D center focusing on AI methods, particularly in computer vision for smart devices like TVs and smartphones.
  • 🎨 'Inpainting' is the process of reconstructing lost or deteriorated parts of images or videos, which is the main topic of the presentation.
  • 🤖 Deep learning, specifically partial convolutions, is the approach used in the speaker's project for image inpainting, which is more advanced than traditional methods.
  • 🔍 The project's practical applications include restoring old photos, automatic scene editing, and even uncensoring images, demonstrating the versatility of inpainting.
  • 🛠️ Training data for inpainting models can be obtained from existing databases or by generating random masks to simulate missing parts of images.
  • 🌟 The architecture of the inpainting model is based on an encoder-decoder structure, with partial convolutions accounting for missing data in the input image.
  • 🔧 The model's loss function is a combination of several elements, including pixel-wise loss, perceptual loss, style loss, and total variation loss, each contributing to the quality of the inpainted output.
  • 🚀 Challenges in inpainting include issues with batch normalization due to varying mask sizes and the increased computational demand of high-resolution images.
  • 🔍 Solutions to these challenges include training with diversified masks, using instance normalization, or removing normalization layers altogether.
  • 🔄 The speaker also discusses the potential of using adversarial losses and a new loss function called IDM-RF to improve the realism and diversity of inpainted images.
  • 📈 TCL Research Europe is actively working on advancing inpainting technology, with a focus on practical applications and overcoming technical hurdles for real-world use.

Q & A

  • What is TCL Research Europe and what is its primary focus?

    -TCL Research Europe is a new R&D center established by TCL in Warsaw. It primarily focuses on AI methods, specifically in the area of computer vision, as TCL is a major manufacturer of Smart TVs and smartphones.

  • What is the concept of 'inpainting' in the context of the presented project?

    -Inpainting refers to the process of reconstructing lost or deteriorated parts of images or videos. It involves using an input image with a mask indicating the missing parts, and then reconstructing those parts based on the surrounding context.

  • Why is the topic of inpainting considered interesting and important?

    -Inpainting is considered interesting due to its applications in various fields such as restoring old photos and videos, automatic scene editing, retouching, denoising, and even entertainment like uncensoring Japanese animations. It was also a topic at the prestigious NIPS conference, indicating its significance in the AI community.

  • What is the role of deep learning in the inpainting project presented?

    -Deep learning is used to build an inpainting model that can effectively reconstruct missing image parts. It is based on a recent paper introducing partial convolutions, which is a technique that takes into account the masks indicating missing areas during the convolution process.

  • What are partial convolutions and how do they differ from traditional convolutions?

    -Partial convolutions are a modification of traditional convolutions that account for missing data by multiplying the input patch with a mask before performing the convolution. This means that during the convolution, only the pixels outside of the mask are considered, and the mask is updated after each layer to reflect the reconstructed pixels.

  • What are some practical issues encountered during the inpainting project?

    -Some practical issues include problems with batch normalization due to varying mask sizes, difficulties with high-resolution inpainting due to increased computational cost, and challenges with reconstructing detailed textures at higher resolutions.

  • How can batch normalization issues be addressed in the inpainting model?

    -Batch normalization issues can be addressed by using techniques such as freeze training, where batch normalization layers are frozen after initial training, allowing the model to adapt to different mask sizes during fine-tuning. Other methods include using instance normalization or removing batch normalization layers altogether.

  • What are some approaches to handle high-resolution inpainting challenges?

    -To handle high-resolution inpainting challenges, one can reduce model size, optimize the model for inference, use quantization techniques, or leverage specialized hardware like DSP processors. Additionally, increasing the receptive fields of the model or using architectures with different receptive field sizes can help improve results.

  • What is the significance of mask generation in the inpainting process?

    -Mask generation is crucial as it defines the areas of the image that need to be inpainted. Specialized masks can be generated using techniques like semantic segmentation or object detection to focus on specific elements like faces or objects, which can be useful for automatic scene editing or fine-tuning the model for specific applications.

  • Can you provide an example of how the inpainting model can be applied to facial images?

    -The inpainting model can be trained on facial images to reconstruct missing parts of faces realistically. It can also be used for facial retouching, such as smoothing out wrinkles or removing imperfections, resulting in a retouched and more aesthetically pleasing facial image.

Outlines

00:00

📚 Introduction to AI Research and Inpainting at TCL

The speaker introduces TCL Research Europe, a new R&D center in Warsaw focusing on AI, specifically computer vision for applications like Smart TVs and smartphones. The main topic, 'inpainting,' is presented as a process to reconstruct missing or deteriorated parts of images or videos using AI methods. The speaker outlines the talk's structure, which includes explaining inpainting, showcasing a deep learning approach, discussing practical issues, and presenting results.

05:02

🎨 Deep Learning Approach to Image Inpainting

This paragraph delves into the specifics of using deep learning for inpainting. The speaker discusses the advantages of deep learning over traditional methods, particularly in handling complex tasks like reconstructing faces or objects. The architecture of the model is described, emphasizing the use of partial convolutions that take into account the masks indicating missing areas. The process of training the model, including the importance of mask diversity and the structure of the encoder-decoder model, is explained.

10:02

🔍 Advanced Architectures and Loss Functions in Inpainting

The speaker presents various advanced architectures for inpainting, including those from Adobe and recent NIPS conferences. Different approaches like multi-column convolutional neural networks and the use of specialized convolutions to increase receptive fields are discussed. The paragraph also covers the composition of the loss function, which includes pixel-wise loss, perceptual loss, style loss, and total variation loss, highlighting their roles in optimizing the model's performance.

15:03

🛠️ Practical Challenges in Inpainting

The speaker addresses practical issues encountered during the project, such as problems with batch normalization due to varying mask sizes and the challenges of training at different resolutions. Solutions like freeze training, using diversified masks, and considering alternative normalization techniques are suggested. The discussion also touches on the removal of batch normalization layers to avoid artifacts and color coherence issues.

20:04

🖼️ High-Resolution Inpainting and Its Challenges

This paragraph focuses on the challenges of high-resolution inpainting, including increased computational demands and memory consumption. Strategies to address these issues, such as model optimization, quantization, and leveraging mobile device hardware like DSP units, are presented. The speaker also discusses the 'big bath problem,' where high-resolution images require reconstructing many more pixels, and suggests increasing receptive fields and using multi-stream models as potential solutions.

25:06

🔧 Enhancing Inpainting with Advanced Techniques

The speaker discusses methods to improve inpainting results, particularly at higher resolutions. Techniques such as training on high-resolution images, using adversarial loss to avoid artifacts, and combining inpainting with super-resolution are suggested. The importance of detailed textures in high-resolution inpainting is highlighted, and the potential for post-processing to blend original and reconstructed patches for realism is explored.

30:07

🎭 Applications and Future of Inpainting Technology

The final paragraph showcases sample results of the inpainting model, demonstrating its effectiveness in removing unwanted objects and reconstructing faces realistically. The speaker emphasizes the potential applications of inpainting in automatic scene editing and face retouching. The paragraph concludes with a summary of the importance of inpainting, the journey from research to practical application, and an invitation for interested individuals to engage with TCL Research Europe.

Mindmap

Keywords

💡Inpainting

Inpainting is a technique used in image processing to reconstruct missing or deteriorated parts of images or videos. In the context of the video, it is an important topic as it can be applied to restore old photos, remove unwanted objects from images, and even for denoising and compression. The script mentions various applications of inpainting, such as restoring old photos by masking defects and reconstructing the masked areas based on the surrounding image content.

💡Deep Learning

Deep Learning is a subset of machine learning that uses artificial neural networks with multiple layers to model and understand complex patterns. In the video, deep learning is the primary method discussed for solving the inpainting problem. The script describes how deep learning captures high-level semantics of images, allowing for the realistic reconstruction of missing content, unlike traditional methods which may struggle with complex tasks.

💡Computer Vision

Computer vision is an interdisciplinary field that focuses on enabling computers to interpret and understand visual information from the world. The script mentions that TCL Research Europe, where the speaker is from, focuses on AI methods, particularly in the area of computer vision, which is the foundation for projects like inpainting that deal with image and video analysis.

💡Partial Convolution

Partial Convolution is a type of convolution operation that is aware of the presence of missing data, as indicated by masks. The script explains that in the inpainting model, partial convolutions are used instead of normal convolutions to take into account the masked areas in the images during the reconstruction process. This allows the model to focus only on the available pixels when performing convolutions.

💡Mask

In the context of inpainting, a mask is a selection of pixels in an image that are to be reconstructed. The script discusses the importance of masks in the inpainting process, as they define the areas of the image that need to be filled in. The speaker also mentions the need for diverse masks during training to ensure the model can handle various shapes and sizes of missing regions.

💡Loss Function

A loss function is a measure of error used to train and optimize models in machine learning. In the script, different components of the loss function are discussed, such as pixel-wise loss, perceptual loss, style loss, and total variation loss, which are all used to guide the inpainting model to produce more accurate and realistic reconstructions.

💡Normalization

Normalization is a technique used in neural networks to stabilize and accelerate training by adjusting the activations. The script describes issues with batch normalization when dealing with varying mask sizes, which can lead to artifacts in the inpainted results. The speaker suggests techniques such as freeze training or removing normalization layers to mitigate these issues.

💡High-Resolution

High-resolution refers to the level of detail and pixel density in an image. The script discusses the challenges of performing inpainting at high resolutions, such as increased computational cost and memory consumption. The speaker also mentions strategies to address these challenges, like model optimization and using techniques like super-resolution.

💡Generative Adversarial Networks (GANs)

GANs are a class of artificial intelligence algorithms used in unsupervised learning, consisting of two parts: a generator that creates data and a discriminator that evaluates it. The script mentions the use of adversarial loss in the inpainting model, where a discriminator is trained to distinguish between real and generated images, helping to improve the realism of the inpainted results.

💡Receptive Field

In neural networks, the receptive field is the region of the input that a given output is dependent on. The script discusses the importance of increasing the receptive field size in the inpainting model to better capture the context needed for reconstructing missing high-resolution details. Techniques such as multi-column convolutional neural networks are mentioned to achieve this.

💡Mask Generation

Mask generation refers to the process of creating masks that define the areas to be inpainted. The script touches on the use of techniques like semantic segmentation and object detection to automatically generate specialized masks for inpainting tasks. This is useful for applications like automatic scene editing, where objects can be removed from images without manual mask drawing.

Highlights

Introduction of TCL Research Europe, a new R&D center focusing on AI methods, specifically in computer vision.

In-painting is the process of reconstructing lost or deteriorated parts of images or videos.

In-painting can be used for restoring old photos, automatic scene editing, and denoising.

Deep learning is used in in-painting to capture high-level semantics of images, unlike traditional methods.

The importance of training data and the use of partial convolutions in the in-painting model.

Different architectures for in-painting, including encoder-decoder models and multi-column convolutional neural networks.

Loss functions used for optimizing the in-painting model, including pixel loss, perceptual loss, and total variation loss.

The use of adversarial loss and generative adversarial networks for refining in-painting results.

Practical issues encountered during the in-painting project, such as artifacts caused by batch normalization.

Approaches to address high-resolution in-painting challenges, including model optimization and the use of DSP processors.

The problem of detailed textures in high-resolution in-painting and methods to improve realism.

The generation of specialized masks for in-painting using techniques like semantic segmentation and object detection.

Sample results showcasing the effectiveness of the in-painting model in removing objects and reconstructing faces.

The potential of in-painting for entertainment, such as uncensoring images and revealing details in animations.

The need for further improvement in in-painting models to address artifacts and enhance detail reconstruction.

Summary emphasizing the value of in-painting, the journey from research to practical application, and the ongoing projects at TCL Research Europe.

Transcripts

play00:02

hello everyone as we have heard I'm from

play00:06

TCL research Europe which is a new R&D

play00:09

center we started in August here in

play00:11

Warsaw and we are mostly focusing on the

play00:15

AI actually only on the AI methods and

play00:18

mostly in the area of computer vision

play00:20

because TCL is a is a big manufacturer

play00:23

Chinese manufacturer of Smart TVs and

play00:25

smartphones as well and I would like to

play00:28

present you one of our one of our

play00:31

projects namely in painting and the plan

play00:36

is simple so first I will tell you in

play00:39

simple words what in painting is and I

play00:42

will try to show you why it is

play00:43

interesting then I will I will show you

play00:49

one sample approach based on deep

play00:51

learning deep learning this is the one

play00:53

that we are building on in our project I

play00:56

will also mention about some other

play00:58

approaches and modifications possible

play00:59

modifications then I will also say a few

play01:02

words about some practical issues that

play01:04

we encountered during during the project

play01:06

and now I'll I will show some sample

play01:08

results and summarize at the end so let

play01:11

me start what is what actually is in

play01:14

painting so so the answer is quite quite

play01:18

simple so it is the process of

play01:19

reconstructing lost of the Terek

play01:21

deteriorated parts of images or videos

play01:25

so like in this example we have an input

play01:27

image then we put some masks on it and

play01:30

we are trying to reconstruct the missing

play01:32

parts of the image basing on the

play01:35

neighborhood so that's more or less the

play01:38

topic and why it's interesting so the

play01:40

first answer is it was on the nips

play01:42

recent on recent nips conference nips is

play01:45

the top AI conference so it's the answer

play01:48

itself it's if it's on nips then okay it

play01:51

has to be interesting but believe me or

play01:53

not there are some other reasons so in

play01:55

other applications useful so for example

play01:58

it can be used to restore old old photos

play02:02

or videos like here we have some defects

play02:04

on on a photo we can put a mask on it

play02:06

and then try to restore what the photo

play02:09

should like without without the defect

play02:11

in our application obvious one is an

play02:14

automatic scene editing

play02:16

and retouching so for example we have

play02:19

some photos with with objects we want to

play02:21

remove the objects so we put put the

play02:23

mask on objects and then we have a clear

play02:25

photo without without the objects on it

play02:28

there are also some other applications

play02:31

like in painting can be used for the

play02:33

noising as well so as a kind of a side

play02:36

effect the in painting results tend to

play02:39

be smooth even if we even if we put

play02:44

noisy input and there are works

play02:48

working on it on it and trying to figure

play02:51

out what the mask should look like to

play02:53

achieve a good denoising results but

play02:56

that will not focus on that also it can

play02:58

be used for a compression so here are

play03:01

some interesting interesting results so

play03:05

from the only the 5% of pixels of course

play03:08

if we choose those peaks or pixels in a

play03:11

smart way

play03:11

we are able to reconstruct the whole

play03:14

image so yes it can be used for

play03:17

compassion clearly here also I was

play03:21

considering to remove it but we have a

play03:23

weekend after all so let's also talk

play03:25

about entertainment so this is a there

play03:28

was a there was a recent model published

play03:30

which does on censoring of images in

play03:35

particular we can do uncensored on

play03:37

censoring of Japanese animations like

play03:41

like here and reveal some reveal some

play03:44

interesting details out of out of its

play03:46

ends or images so ok now I I hope you

play03:50

are convinced that this is an important

play03:51

problem so let me let me start with

play03:56

describing how we can solve it solve it

play03:58

with deep learning our baseline approach

play04:01

is based on an immediate paper quite

play04:03

recent one introducing partial

play04:05

convolutions so I will tell about these

play04:07

partial compositions and describe the

play04:08

whole the whole the whole pipeline of

play04:11

training in painting model but let me

play04:14

start with the with the answer to the

play04:16

question why did learning so there exist

play04:18

many many classical methods to do the in

play04:22

painting based on example an example

play04:25

based in paper in painting or some

play04:27

patches

play04:29

and there are also commercial solutions

play04:31

like adobe of course it's working on it

play04:34

they work pretty well but also they have

play04:38

they have some problems so first of all

play04:41

it's hard to be accepted to nips if you

play04:43

don't do the deep learning so this is

play04:44

one reason to do this to do this with

play04:47

deep learning if you want to go to nibs

play04:48

but again there are other other reasons

play04:52

as well so traditional traditional

play04:54

methods they usually work well for

play04:57

specific tasks like for example

play04:59

background in painting when you can just

play05:01

simply repeat some patches from the

play05:03

neighborhood to reconstruct the missing

play05:05

part and they have problems with let's

play05:09

say hallucinating the the missing

play05:11

content if we are talking about

play05:13

challenging tasks like complex objects

play05:16

or faces for example and deep learning

play05:19

in contrast does quite well because it

play05:22

also captures some high level semantics

play05:24

of images and for example here you can

play05:26

see this is output from our model where

play05:28

we we are able to reconstruct face

play05:31

realistically so if we if you use

play05:34

traditional methods then probably this

play05:36

would not look like a face anymore okay

play05:40

so how to how to do this step by step

play05:43

first of all we need training data it's

play05:46

quite this is a good news it's quite

play05:47

simple to get the data because you can

play05:49

use any photos actually so we can use

play05:50

any existing databases like image net

play05:53

places and so on or any kind of photos

play05:56

the the the simplest option is to simply

play05:59

generate some random masks like this one

play06:02

and try to learn to restore the missing

play06:06

parts of the of the of the images one

play06:10

important thing here to mention is that

play06:12

masks do matter so for example in the

play06:14

original paper that I mentioned they

play06:16

proposed a way to to create diversified

play06:19

masks because they need to be device

play06:22

diversifies during training as much as

play06:24

possible so there have different shapes

play06:26

they cover the different area of areas

play06:29

of the image and so on and also it is

play06:33

also well worth considering to use some

play06:34

specialized masks like masks put on some

play06:37

face landmarks or on objects I will

play06:39

mention about it later as well so when

play06:42

have masks we have dead training data we

play06:45

have images what we need is a model so

play06:48

this is a one architecture that we are

play06:52

building on it's quite popular it's

play06:54

based on unit unit is a an encoder and

play06:59

decoder based based architecture used

play07:01

for example for image segmentation with

play07:04

many successes the difference the

play07:07

difference is here is that instead of

play07:08

using normal brush convolution we are

play07:10

using so called partial convolution that

play07:13

they that take takes into account also

play07:18

masks I will talk about it later in the

play07:21

next slide so it more or less looks like

play07:24

that in the encoder part part we get an

play07:26

image then we use a strided convolution

play07:29

so the image during the collision is

play07:31

during the convolution is down ston

play07:34

scaled then we have budge normalization

play07:36

and we have let's say another layer here

play07:38

another layer here and again

play07:40

convolutions tried it so going here in

play07:43

the encoder we are decreasing the

play07:44

resolution of the image and we are

play07:46

adding some more feature maps to it and

play07:48

then in the decoder phase we do the

play07:51

upscaling here we don't use any kind of

play07:55

transpose convolution on deconvolution

play07:56

we just use a simple obscure lling here

play07:59

and based on the neighbor nearest

play08:02

neighbor approach and then we do the

play08:04

partial convolution again we also have

play08:07

the skip connections which can be quite

play08:09

important in the case of in baiting

play08:10

because in particular in the last layer

play08:12

our model can can produce the output

play08:15

basing on the whole processed image here

play08:17

and reconstructed image here and also we

play08:19

can it can take and map the images that

play08:23

the original Peaks pixels from the

play08:25

original image in the area which is

play08:28

outside of the mask so that's more or

play08:30

less how the actual architecture looks

play08:32

like let me tell you a few words about

play08:34

this partial convolution because it's

play08:36

quite quite simple idea so actually it's

play08:40

like it's it's a simple convolution but

play08:43

before doing the convolution we are

play08:46

multiplying our input patch of the image

play08:49

with masks so everywhere where the mask

play08:53

is you

play08:55

everywhere where the mask is we are

play08:57

setting the pixels to zeros and then we

play08:59

are doing the convolution so we are only

play09:01

considering the that the pixels outside

play09:05

of the masks and then we are doing that

play09:07

the normalization because convolution is

play09:10

based on sums so if we are removing some

play09:12

elements then we need some normalization

play09:14

component here which just puts our

play09:19

activations back to the same level in

play09:22

irrespective of the mask size so that's

play09:26

that's the that's the difference with

play09:28

with a convolution and also one

play09:30

important thing is that the mask is also

play09:33

updated so after each layer so after

play09:37

each layer we are updating the mask if

play09:40

we if we in one layer we reconstructed

play09:42

some pixels so when we are from the

play09:45

point of view of a given pixels when our

play09:47

receptive field was covering some real

play09:51

pixels in the place of the previous

play09:53

layer not not the mask then we are able

play09:55

to to calculate some activation and then

play09:58

we are we are updating this removing

play10:00

mask from this pixel so we are

play10:02

considering the information that we

play10:03

don't start it information as a normal

play10:05

information in the nest subsequent

play10:08

layers so our a mask is shrinking from

play10:10

layer to layer and then it usually

play10:11

disappears in the encoder part okay so

play10:17

that's that's how how this partial

play10:20

convolution works okay III want also to

play10:24

mention about some other architectures

play10:26

here from from from different papers

play10:28

there are two recent approaches one from

play10:30

Adobe and one from from the recent nips

play10:33

conference so you can see some

play10:35

modifications here like this

play10:37

architecture consists of two two parts

play10:40

first there's the encoder and decoder

play10:42

part which performs some coarse

play10:46

reconstruction based only on only on

play10:51

let's say per pixel reconstruction error

play10:56

and then there is another part of the

play10:58

network which performs refinement using

play11:01

some adversarial loss and generative

play11:04

generative advertiser Network

play11:06

framework so this is one possible

play11:09

extension here another one this is

play11:12

called multi column convolution

play11:15

convolutional neural network and we have

play11:18

different several different streams and

play11:22

they operate with different filter sizes

play11:25

so they have different receptive fields

play11:27

they take the same input and then then

play11:30

at some point they they are combined

play11:33

with each other and then the decoding is

play11:36

is common for all the older streams and

play11:39

also there are some other layers used

play11:41

like for example delighted for example

play11:43

delayed convolution here which is a

play11:45

modification of convolution with

play11:47

increased receptive receptive fields

play11:50

without coming into details because as

play11:53

we will see later receptive fields are

play11:54

crucial here in the code in the problem

play11:58

of in painting so okay that was about

play12:01

the the architecture so what else do it

play12:04

of course we need the last function to

play12:06

optimize to optimize our model

play12:10

parameters the loss functions in the

play12:13

loss function in the original paper is

play12:14

composed on many of many many elements

play12:17

the first one is based on a simple per

play12:21

pixel per pixel loss so per pixel

play12:25

perfect perfect celery construction

play12:27

error and here we are considering two

play12:31

two elements one is calculated inside

play12:34

the mask and another one outside of the

play12:36

mask so these are two per pixel lost

play12:39

components we also have something like

play12:41

perceptual loss which looks at two

play12:44

images like in this in this part the

play12:46

ground floors image and the output image

play12:49

but not in a pixel space but in a higher

play12:52

level feature space so we are extracting

play12:55

some features from a pre-training model

play12:57

like vgg 16 model for example and here

play13:00

we are calculating this part comparing

play13:03

these features four four four two

play13:05

outputs taking the L 1 norm and summing

play13:08

over these free free layers here in this

play13:11

case and also we we do this not only for

play13:14

our output image but also for the in the

play13:16

the so called composite image which is

play13:18

composed of

play13:19

they reconstructed let's say masks and

play13:22

original pixels put around so like here

play13:27

in the whole the whole formulation we

play13:29

put more attention to the in inside of

play13:32

mask reconstruction error and also days

play13:36

there is a similar style loss which is

play13:40

similar to perceptual loss but before

play13:43

taking the l1 norm we are performing

play13:47

autocorrelation using some gram matrix

play13:49

and then after the autocorrelation we we

play13:53

do the same more or less with some

play13:54

normalization factor depending on the

play13:57

size of our feature map taken from you

play13:59

Gigi which is number of channels - width

play14:01

of our feature map and these two

play14:03

components conceptual laws and style

play14:05

laws there are there are used also in

play14:07

other problems like style transfer for

play14:09

example and they are more more in line

play14:13

with human perception than the simple

play14:15

reconstruction error from from the

play14:17

previous component and the last

play14:19

component here that total variation loss

play14:21

which is also quite popular in other

play14:23

applications it is a kind of a penalty

play14:25

for non smooth output so we are we are

play14:31

calculating int it we are calculating it

play14:34

in the area P which is our mask slightly

play14:38

enlarged slightly enlarge by by a

play14:40

deletion operation and here we want to

play14:43

we want the output to be smooth inside

play14:46

the mask and on the boundary between the

play14:48

mask and the original original image so

play14:51

the total loss looks some something like

play14:55

like this so this is the this is a

play14:58

simple weighted weighted sum of the

play15:01

whole all of these components these are

play15:03

weights taken directly from from from

play15:05

the paper but you have to keep in mind

play15:07

that they depend on manufacturers so for

play15:11

example of course on your data and on

play15:14

the model that you are used to get to

play15:16

compute the perceptual style loss so you

play15:18

have to actually tune the weights to

play15:21

your particular problem and just monitor

play15:23

the contribution of each loss component

play15:27

during the training of course there are

play15:31

some other loss components positive

play15:33

which we are working on right now and

play15:35

trying to add it to our pipeline so as I

play15:38

mentioned already the adversarial laws

play15:39

can be helpful here like in this

play15:42

pipeline in this pipeline we are

play15:44

training to discriminators one local

play15:47

looking at the whole picture

play15:50

one local looking at the at the mask and

play15:52

one global looking at the whole picture

play15:53

picture and they are trained to

play15:56

distinguish between original picture

play15:58

original images and images generated by

play16:01

our in painting model and they are

play16:03

trained together with the generator in a

play16:06

standard adverbial setup a generative

play16:09

all visual networks adversarial networks

play16:12

network setup so this loss can be quite

play16:15

helpful and another kind of loss is IDM

play16:18

IDM RF loss introduced in the recent

play16:21

nips paper which is implicit diversified

play16:26

Markov random field loss so the name is

play16:28

quite quite impressive but it's quite

play16:32

interesting as well so the the we've are

play16:36

without going into details the idea is

play16:39

that our reconstructed patches should be

play16:43

similar to the nearest neighbors of of

play16:48

their of of these patches in the

play16:51

original image so we are taking a patch

play16:53

we are looking for some nearest

play16:55

neighbour in a feature space so in a

play16:57

higher-level feature space and then we

play17:00

want our reconstructed output like grass

play17:04

in this case look like the real grass

play17:07

around and also it is constructed the

play17:10

loss is constructed in a way that this

play17:12

is the diversified part of the name so

play17:15

we don't want one patch from the from

play17:18

the original image to be repeated many

play17:20

times we want to look for different

play17:22

patches around all similar but all

play17:24

different and we want our our output to

play17:27

be realistic and also diversify it's not

play17:29

not a simple repeating pattern so we are

play17:32

also adding this to our model right now

play17:35

okay so mmm that's more or less the

play17:39

whole pipeline so then we start we

play17:40

having these components data model and

play17:43

loss we train with with standards as

play17:46

Gidi algorithms like atom for example

play17:49

the problem is that we in this with this

play17:52

architecture training time is quite long

play17:54

so it we've on the whole image net data

play17:57

for example it takes a week to train

play17:59

train a reasonable model on a single GPU

play18:02

machine so let me come right now to to

play18:07

some practical issues I would like to

play18:09

share with you here so the first one is

play18:13

with batch normalization in general

play18:16

masks cause some problems with with with

play18:21

personalization because various mask

play18:24

sizes in general affect activation

play18:28

distributions and you can observe it as

play18:31

a several problem so for example you can

play18:33

observe I'm not sure if you are able to

play18:35

see it but this kind of artifacts so in

play18:37

the place of masks you see some non

play18:40

smooth nurses and some some kind of this

play18:43

artifacts here so this is an example of

play18:47

but normalization related artifact and

play18:50

our problem is that actually our model

play18:53

treats the boundaries of the image also

play18:56

as masks and there is a problem if you

play18:58

train the model in a lower resolution

play19:00

like 500 per 500 pixels for example and

play19:04

then as it is a fully convolutional

play19:08

model you can apply it to a high

play19:10

resolution but then when the model is

play19:13

processing the input of the image that

play19:15

the middle part then it gets some

play19:18

different of activations because it is

play19:21

used to seeing a boundary around and

play19:23

here there is no boundary there is still

play19:25

image so the activations slightly differ

play19:27

and in this extreme case when we do the

play19:30

reconstruction without a mask with a

play19:32

empty mask we see that here are some

play19:36

problems with with normalization also

play19:39

are visible so what we can do about it

play19:41

first of all and this was proposed in

play19:44

the original paper that I mentioned we

play19:48

can use to face training so first we do

play19:50

the training with bash normalization and

play19:52

then we we freeze personalization layers

play19:56

the learn about trainable parameters of

play19:59

the

play20:00

raishin in the encoder part and then we

play20:02

do the fine tuning with the

play20:04

personalization freeze so the model can

play20:05

just adapt itself to this different

play20:09

different activations coming from

play20:10

different masks that's this one

play20:13

technique then we observe that also

play20:15

using diversified masks mask sizes

play20:18

including also empty masks can can help

play20:21

with this then of course you can replace

play20:24

standard batch normalization with some

play20:26

other normal normalizations like

play20:28

instance normalization for example which

play20:30

does generalization not on batches but

play20:33

on single images this could help but we

play20:36

haven't tried yet but there are some

play20:38

papers showing that maybe this could be

play20:40

a good direction and also it's it's

play20:43

quite a good idea it can make sense to

play20:45

remove botulinum ization layers at all

play20:47

because all of these problems and also

play20:49

some other problems with with color

play20:52

coherence mentioned in many many papers

play20:54

some recent papers remove the bottom

play20:57

ization completely and it is it can make

play20:59

sense also because usually with these

play21:02

kind of models we are training on small

play21:03

batches so because the model size is

play21:05

huge

play21:06

we are training with a size on a single

play21:09

GPU we can we can train using the batch

play21:12

size of 4 for example which where the

play21:14

benefits of personalization are not that

play21:18

visible so it also works without

play21:21

personalization actually quite quite

play21:23

well now I would like to tell you about

play21:26

several issues which are related to high

play21:29

resolution painting I know what is high

play21:32

resolution in painting most of the

play21:36

papers claim that they do actually high

play21:37

resolution so they call 512 per her high

play21:41

512 pixels a high resolution because the

play21:44

the first works on in painting were on

play21:47

much much smaller images like 64 / 64

play21:50

pixels for example but if you are a

play21:52

smart phone or smart TV manufacturer

play21:54

manufacturer or like TCL for you high

play21:57

resolution is at least this one and

play22:00

there are some problems appear because

play22:03

moving from this resolution to this

play22:05

resolution even so only only changing it

play22:08

twice in a single dimension then we have

play22:11

4 times longer prediction time

play22:14

because it is proportional to the number

play22:15

of pixels and also the memory

play22:17

consumption is is bigger for this so the

play22:22

first problem is with CPU and memory on

play22:24

especially on a mobile devices and what

play22:26

what can we do about it

play22:28

of course we can reduce the model size

play22:30

and train smaller models we can optimize

play22:33

the model for inference for example we

play22:37

can use the quantization techniques and

play22:39

move from higher precision to lower

play22:41

precision and in our calculations during

play22:45

drink addiction we can also optimize the

play22:47

critical the critical parts of the

play22:49

inference code and we also in our in our

play22:52

R&D centre we also have a group working

play22:54

on it so optimizing the convolutions and

play22:57

so on for mobile devices and we are also

play23:02

extensively trying to verify the

play23:05

possibility of launching our models on

play23:08

mobile devices using the their GP GPU

play23:11

and DSP digital signal processing units

play23:15

so for example Qualcomm claims that you

play23:18

can receive you can get up to eight

play23:20

times speed-up using a DSP processor so

play23:24

we are trying with this but it's you

play23:26

have to know that it's not that simple

play23:27

actually to use this DSP even if you are

play23:30

the if we are the phone manufacturer we

play23:32

need to use some developed developer's

play23:35

boards it's not that simple to just run

play23:38

it on on a normal phone and test it yes

play23:42

and whenever possible probably you

play23:43

should do that in painting in lower less

play23:45

Ellucian so you should play with some

play23:47

crops and rescaling techniques and maybe

play23:49

super-resolution in your pipeline just

play23:51

to avoid high level resolution because

play23:54

it's just expensive and not only it's

play23:57

expensive but it's it's also difficult

play23:59

so another problem related to to higher

play24:04

resolution is big maths problem we call

play24:07

it big bath problem because when we we

play24:09

have the same picture in the same image

play24:11

like here and we try to remove this

play24:13

mountain there is a mountain here

play24:15

actually in this result resolution it's

play24:18

much simpler and it worked works battle

play24:20

better than in the case of a higher

play24:22

resolution because here we need to

play24:23

reconstruct many many more pixels and

play24:26

it becomes really difficult for a model

play24:28

so how can we help this basically we

play24:33

need to increase the receptive fields of

play24:36

our models we can achieve this by

play24:39

increasing the size of the convolutional

play24:42

filters or increasing the number of

play24:44

layers but again this is expensive and

play24:47

or we can use some other other kind of

play24:51

modifications of convolutional layers

play24:53

like as I mentioned at the Leighton

play24:55

convolution or we can play with

play24:59

architectures I also mentioned about it

play25:01

so we can use some initial course part

play25:05

of the network and then again refining

play25:09

Network or this multi stream model with

play25:13

different sizes of receptive fields from

play25:16

small to bigger ones and the last

play25:20

problem related to high resolution I'm

play25:22

talking about this high resolution

play25:23

because it's really important from the

play25:25

practical point of view if you want to

play25:27

apply it within your product is a

play25:31

detailed textures issue so what what

play25:33

what looks dyes in a lower laser Lucien

play25:35

as I mentioned most models most

play25:37

publications show results in this

play25:39

resolution then it becomes unacceptable

play25:41

if you move to the high resolution so

play25:43

like here we are reconstructing

play25:44

reconstructing this part this part of

play25:47

the image and we clearly see the

play25:48

difference between the reconstructed

play25:50

level of details and the texture around

play25:53

so somehow we need to address it address

play25:56

it as well so first of all we should

play25:59

train on higher resolution images at

play26:01

least on crops of high resolution images

play26:04

of course and then right now we are

play26:07

playing as I mention a lot with

play26:09

different with different loss functions

play26:12

for example this adversarial loss is

play26:14

quite promising here because you know

play26:16

the discriminator trying to distinguish

play26:18

between real and generated photos it

play26:20

should learn somehow to detect this

play26:22

these artifacts these patterns here

play26:24

inside and then our generator in

play26:29

generative adversarial training should

play26:31

learn to fool the discriminator so it

play26:34

should avoid this kind of patterns so we

play26:36

believe that this kind of these can help

play26:38

also this

play26:39

MRF like a loss seems to be a good idea

play26:42

to improve here and also as I mentioned

play26:46

we can combine in painting with some

play26:50

other techniques like super resolution

play26:51

for example so we can use either super

play26:55

resolution as a post-processing or we

play26:57

can build in super resolution to our

play27:01

model there's specialized to present for

play27:03

the in painting that's one idea and

play27:06

after all if if nothing helps then we

play27:09

can do some post-processing and it is

play27:10

also post-processing similar to

play27:13

traditional techniques so then after

play27:15

after finishing in baiting we can

play27:17

somehow analyze our patches and look for

play27:21

search for some similar patches around

play27:23

and maybe try to blend the original high

play27:25

resolution patches with our

play27:27

reconstructed patches to to make it more

play27:29

realistic

play27:31

okay the last the last issue I want to

play27:34

mention is the issue of mask generation

play27:37

so in fact you may need some special and

play27:40

kind of masks and you can use many

play27:42

techniques like semantic segmentation

play27:44

object mating silent object detection

play27:47

face facial and manga detection to

play27:49

generate some kind of special as much

play27:51

specialized masks on objects on our

play27:54

particular elements of faces and what

play27:58

can you use it for of course for

play27:59

automatic scene editing it would be a

play28:01

nice feature if you don't need to draw a

play28:04

mask you just point an object and it

play28:06

disappears from your photo so it's quite

play28:08

quite obvious and also during training

play28:10

you can use this smart masks to let's

play28:15

say make your training more in line with

play28:18

the business application so if you want

play28:19

to remove objects in your with your

play28:21

model in your business application then

play28:23

you can use this this mask at the at

play28:26

least to fine-tune your model and

play28:28

similar in the cases of faces if you

play28:30

want to do the in painting and face

play28:32

retouching removing some defects of

play28:34

wrinkles on faces and probably you don't

play28:37

need to train your model to reconstruct

play28:38

eyes and nose because that's much much

play28:41

more difficult

play28:42

and maybe people don't want to just

play28:44

reconstruct daily eyes because then they

play28:47

don't look that similar to them so smart

play28:52

masks can be also helpful okay let me

play28:56

come to some some examples some sample

play28:59

results of our in painting model so here

play29:03

are two examples we have a nice scene

play29:05

here on the Left we want to remove some

play29:08

people and some buildings from that

play29:10

scene because we don't like them and

play29:12

this is the the output generated by our

play29:15

model it looks pretty nice

play29:18

just remember it's in low resolution so

play29:20

it's 512 / 4 512 here you have the Levin

play29:24

dusky family you also have you also have

play29:28

Clara here and you can just remove it

play29:32

from the future if you if you don't like

play29:34

if you prefer levin dosti without clara

play29:37

for example and this is actually a nice

play29:39

nice example showing benefit of of deep

play29:42

learning approach because if you do the

play29:44

same with a classical approach some

play29:46

strange things happen because they

play29:49

usually the classical approach tries to

play29:52

get some some patches from the

play29:54

neighborhood and what you can see is I

play29:56

don't have an example here but you can

play29:57

see the third the third leg of the van

play29:59

das key for example in this place so

play30:01

it's that also shows how how it works

play30:04

and as I'm not planning to sell it to

play30:07

you right now so I also show some more

play30:09

difficult and not that beautiful results

play30:12

so we still work we are still working on

play30:14

improving this like here we are removing

play30:16

the lamp and something and the results

play30:21

is also different than the neighborhood

play30:23

details around I mentioned about it and

play30:26

here we are removing a big object a

play30:28

table in a quite complex scene and we

play30:32

get something like this so when you look

play30:34

your first look may may say okay it's

play30:36

quite okay there is a floor and and so

play30:38

on but when you look closer you will

play30:40

you'll see some strange artifacts and

play30:42

also this chair here is not

play30:43

reconstructed perfectly so still there

play30:46

is a there is a big big place for

play30:49

improvement here and okay some face

play30:52

example

play30:53

facing painting examples actually it

play30:56

really works well so we trained the face

play30:58

model on celebrities photos and as you

play31:01

can see the reconstruction is really

play31:02

nice so we can reconstruct complex

play31:05

semantic parts of faces like nose and

play31:07

eyes in a realistic way so this is

play31:09

original this is in painted just looking

play31:12

on into this image and also we can use

play31:14

this model to to do the face retouching

play31:17

like in this case we are smoothing the

play31:20

area under the eyes and removing some

play31:23

wrinkles and we get a smooth celebrity

play31:26

face out of your face so that's that's

play31:30

the idea okay let me summarize quickly

play31:34

so in painting is a cool and useful

play31:38

topic and it can be solved with deep

play31:42

learning as I showed and it worth to

play31:49

remember that there's a long way or

play31:51

always from the initial results from the

play31:53

paper to production if you want to

play31:55

actually make a function for a

play31:58

smartphone for example for a smartphone

play32:00

gallery for example and also I would

play32:04

like you to remember that we are doing

play32:06

some pretty cool projects in TCL

play32:08

research Europe so if you are interested

play32:10

don't hesitate to visit our webpage or

play32:13

contact me directly or we have a tenth

play32:16

one floor up from here where where you

play32:20

can talk to us at any rate and between

play32:22

the trade but between the breaks as well

play32:25

ok thank you very much

play32:28

[Applause]

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
AI ResearchComputer VisionInpaintingDeep LearningImage RestorationVideo EnhancementTCLR&DWarsawSmart TVs
هل تحتاج إلى تلخيص باللغة الإنجليزية؟