43 R CNNs, SSDs, and YOLO

exter minar
22 Apr 202115:49

Summary

TLDRThis video explores the evolution of object detection in computer vision, from traditional methods like HOG and SVM to deep learning approaches. It covers the breakthrough of R-CNNs, their evolution into Fast R-CNNs and Faster R-CNNs, and introduces Single Shot Detectors (SSDs) and the revolutionary YOLO algorithm. The video explains how these advanced detectors work, their speed and accuracy, and touches on metrics like Intersection over Union (IoU) and Mean Average Precision (mAP). It concludes with a look at the latest advancements in YOLO, setting the stage for implementing these detectors using Python and OpenCV.

Takeaways

  • 🔎 **Object Detection Overview**: Object detection is a crucial part of computer vision that combines object classification and localization to identify and locate objects within images.
  • 🐾 **Historical Context**: Early object detection methods like Haar cascade classifiers and Histogram of Gradients (HOG) with SVMs have been largely replaced by deep learning approaches due to their inefficiency with multiple object classes.
  • 🚀 **Deep Learning Breakthrough**: In 2014, the introduction of Region-based Convolutional Neural Networks (R-CNNs) marked a significant advancement in object detection, achieving high performance in the PASCAL VOC challenge.
  • 📈 **Selective Search Algorithm**: R-CNNs use selective search to propose regions of interest within an image, which are then classified by a CNN and refined using a linear regression for bounding boxes.
  • 📏 **Intersection over Union (IoU)**: IoU is a metric used to evaluate the accuracy of object detection by measuring the overlap between the predicted bounding box and the ground truth.
  • 🏎️ **Fast R-CNN**: An evolution of R-CNN, Fast R-CNN improved speed by reducing the need for multiple model training and execution, using region of interest (ROI) pooling.
  • 🚀 **Faster R-CNN**: Building on Fast R-CNN, Faster R-CNN further increased speed by eliminating the need for selective search, thus speeding up the region proposal stage.
  • 🏁 **Single Shot Detectors (SSDs)**: SSDs introduced multiscale features and default boxes to significantly increase detection speed, achieving near real-time performance with minimal accuracy loss.
  • 🌐 **YOLO (You Only Look Once)**: YOLO revolutionized object detection by using a single neural network applied to the full image, allowing for fast and efficient detection by predicting bounding boxes and class probabilities directly.
  • 🏆 **YOLOv3**: The latest version of YOLO, YOLOv3, includes improvements like multi-scale training and batch normalization, enhancing its ability to detect objects of various sizes and resolutions.

Q & A

  • What is object detection in the context of computer vision?

    -Object detection is a crucial aspect of computer vision that combines object classification and localization. It not only identifies the presence of an object within an image but also determines the region or bounding box of the object, allowing for multi-class detection unlike earlier single-class detectors.

  • How do traditional non-learning methods for object detection compare to deep learning methods?

    -Traditional methods such as Haar cascade classifiers and Histogram of Gradients (HOG) with Linear SVMs are considered outdated compared to deep learning methods. Deep learning object detectors significantly outperform these traditional methods, offering better accuracy and applicability in a wider range of scenarios.

  • What was the breakthrough in deep learning object detection in 2014?

    -In 2014, the introduction of Regions with CNNs (R-CNNs) marked a significant breakthrough in deep learning object detection. R-CNNs achieved remarkably high performance in the PASCAL VOC challenge, a benchmark in computer vision for object detection.

  • How does the Selective Search algorithm contribute to object detection?

    -The Selective Search algorithm contributes to object detection by segmenting the image into different regions based on similarities in color or texture. It proposes regions of interest that are then passed through a CNN for classification, thus streamlining the process of generating bounding box proposals.

  • What is Intersection over Union (IoU) and why is it important in object detection?

    -Intersection over Union (IoU) is a metric used to measure the accuracy of an object detector by calculating the overlap between the predicted bounding box and the ground truth box. An IoU score above 0.5 is generally considered acceptable, indicating a good match between the predicted and actual object location.

  • How does the Mean Average Precision (mAP) metric help in evaluating object detectors?

    -Mean Average Precision (mAP) is a metric used to evaluate the performance of object detectors, especially when dealing with multiple detections for the same object. It provides a measure of accuracy by considering both the precision and recall of the detector across various classes of objects.

  • What improvements did Fast R-CNN bring over the original R-CNN?

    -Fast R-CNN improved upon the original R-CNN by reducing the computational overhead. It eliminated the need for separate models for feature extraction, classification, and bounding box regression by running the CNN only once over the entire image and then applying Region of Interest (ROI) pooling.

  • How do SSDs achieve near real-time performance in object detection?

    -SSDs (Single Shot MultiBox Detectors) achieve near real-time performance by using multiscale features and default boxes, which allow for efficient processing without the need for region proposal networks. They also reduce the resolution of images fed into the classifier, which contributes to their speed.

  • What is the core concept behind YOLO (You Only Look Once) object detection?

    -The core concept behind YOLO is to use a single neural network that processes the full image at once, allowing it to reason globally about the image content. It divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell, which simplifies the detection process and enables fast performance.

  • How does YOLO's architecture differ from other object detection methods?

    -YOLO's architecture differs by using a single, unified model that combines region proposal, object recognition, and classification into one full convolutional neural network. This design allows for fast and efficient object detection without the need for multiple stages or models, which is a departure from methods like R-CNNs and SSDs.

Outlines

00:00

🔎 Introduction to Object Detection

This paragraph introduces the concept of object detection as the 'holy grail' of computer vision, emphasizing its importance in identifying and localizing multiple classes of objects within images. It contrasts single-class object detection with the more complex multi-class detection, highlighting the inefficiency of traditional methods like sliding windows. The paragraph also briefly mentions the advent of deep learning object detectors like R-CNNs, SSDs, and YOLO, which have revolutionized the field by offering efficient and accurate detection across various object classes.

05:01

📊 Understanding Object Detection Metrics

The second paragraph delves into the metrics used to evaluate object detection models, focusing on the Intersection over Union (IoU) metric. It explains how IoU measures the overlap between the predicted bounding box and the ground truth, with an IoU of over 0.5 being considered acceptable. The discussion includes the challenges of handling multiple bounding boxes for the same object and introduces the Mean Average Precision (mAP) metric as a way to assess model performance in such scenarios. The paragraph also touches on the evolution of deep learning object detectors, starting from R-CNNs to Fast R-CNNs and Faster R-CNNs, each improving upon the previous in terms of speed and efficiency.

10:02

🚀 Speeding Up Object Detection with SSDs

Paragraph three discusses the Single Shot Detectors (SSDs) and their significant improvement in speed over the R-CNN family, achieving near real-time performance. It explains how SSDs use multiscale features and default boxes, along with reduced image resolution for classification, to enhance speed without a substantial loss in accuracy. The paragraph also mentions the trade-off between increasing the number of default boxes for better accuracy and the consequent decrease in speed. The discussion concludes with an overview of SSD's structure, including the feature map extractor and the convolutional filter used for object detection.

15:03

🏆 The Rise of YOLO: Real-Time Object Detection

The final paragraph introduces the You Only Look Once (YOLO) object detection framework, which uses a single neural network applied to the full image, allowing for global reasoning and real-time detection. It outlines the evolution of YOLO from its initial version to YOLOv3, highlighting improvements such as batch normalization, higher resolution training, and multi-scale training to enhance the detection of smaller objects. The paragraph also provides a brief overview of YOLO's architecture, consisting of 24 convolutional layers followed by fully connected layers, and encourages viewers to read the original paper for a deeper understanding.

Mindmap

Keywords

💡Object Detection

Object detection is a fundamental concept in computer vision that involves identifying and locating objects within an image or video frame. It is described in the video as the 'holy grail' of computer vision, highlighting its significance. The video script mentions object detection in various contexts such as face, pedestrian, and vehicle detection, emphasizing its broad applications. Object detection is crucial as it not only classifies the objects but also localizes them within a frame, which is essential for tasks like autonomous driving or surveillance systems.

💡RCNNs (Regions with CNNs)

RCNNs, or Regions with CNNs, are a type of deep learning object detector that were a breakthrough in the field, achieving high performance in the PASCAL VOC challenge. The video script explains how RCNNs work by proposing bounding boxes and passing these to an image classifier. RCNNs were significant in advancing object detection by addressing the exhaustive search problem previously performed by sliding windows, thus improving efficiency and accuracy.

💡Selective Search

Selective Search is an algorithm used in the context of RCNNs to generate region proposals for object detection. The video script describes how it segments the image into different groups by combining similar areas such as colors or textures, which are then proposed as regions of interest. This process is crucial for RCNNs as it reduces the number of regions that need to be classified, thus improving the speed and efficiency of the object detection process.

💡CNNs (Convolutional Neural Networks)

CNNs are a class of deep learning models that are widely used in image recognition tasks, including object detection. The video script mentions how CNNs are used to classify the regions proposed by the Selective Search algorithm. CNNs are fundamental to the operation of RCNNs, as they extract features from the proposed regions and classify them, which is a critical step in the object detection pipeline.

💡SVM (Support Vector Machine)

SVM, or Support Vector Machine, is a machine learning algorithm used for classification tasks. In the context of the video, SVM is used after the CNN has extracted features from the regions proposed by Selective Search to classify the objects. The video script explains that the CNN features are fed into an SVM classifier, which then determines the class of the objects within the proposed regions.

💡Bounding Box

A bounding box is a rectangular frame that is used to localize an object within an image. The video script discusses how bounding boxes are proposed by the Selective Search algorithm and then refined by the CNN and SVM classifiers. Bounding boxes are essential in object detection as they provide the spatial information needed to identify the exact location of objects within an image.

💡Intersection over Union (IoU)

IoU, or Intersection over Union, is a metric used to evaluate the accuracy of object detection models. It is defined as the area of overlap between the predicted bounding box and the ground truth box, divided by the area of their union. The video script uses IoU to illustrate how well a predicted box aligns with the actual object's location. An IoU score over 0.5 is generally considered acceptable, indicating a good match between the predicted and true bounding boxes.

💡Faster R-CNN

Faster R-CNN is an evolution of the original RCNN, designed to improve speed and efficiency. The video script explains how Faster R-CNN addresses the slow performance of RCNNs by running the CNN across the entire image just once, using a technique called Region of Interest (ROI) pooling. This approach reduces the need for multiple passes and speeds up the object detection process.

💡SSD (Single Shot Detector)

SSD, or Single Shot Detector, is another deep learning object detector that is highlighted in the video script for its speed and efficiency. Unlike RCNNs, SSDs use multiscale features and default boxes, which allow them to operate at near real-time speeds with minimal loss in accuracy. The video script mentions SSDs in the context of their high frame rates and their ability to handle multiple objects and scales within an image.

💡YOLO (You Only Look Once)

YOLO, or You Only Look Once, is a state-of-the-art object detection algorithm that is praised in the video script for its speed and accuracy. YOLO operates by dividing the image into a grid and predicting bounding boxes and class probabilities for each grid cell. The video script describes YOLO's architecture and how it uses a single neural network applied to the full image, allowing for global reasoning and fast detection. YOLO's approach results in a balance between speed and accuracy, making it suitable for real-time applications.

Highlights

Introduction to deep learning object detectors like R-CNNs, SSDs, and YOLO.

Object detection is the 'holy grail' of computer vision, enabling multi-class detection.

Object detection combines classification and localization.

Traditional non-learning methods like HOG with SVM are outdated compared to deep learning detectors.

Deep learning object detectors are more efficient than sliding window approaches.

R-CNNs were a breakthrough in 2014, achieving high performance in the PASCAL VOC challenge.

Selective Search algorithm is used to generate region proposals for R-CNNs.

SVM is used to classify features extracted by CNNs in R-CNNs.

Linear regression is applied to refine the bounding boxes in R-CNNs.

Intersection over Union (IoU) is a key metric for evaluating object detection.

Mean Average Precision (mAP) is used to handle multiple detections of the same object.

Fast R-CNN improved the speed of R-CNNs by reducing the need for separate models.

Faster R-CNN further increased speed by eliminating the selective search bottleneck.

SSDs use multiscale features and default boxes for faster detection.

YOLO uses a single neural network applied to the full image for real-time object detection.

YOLO divides the image into an SxS grid and predicts bounding boxes and probabilities for each region.

YOLOv2 introduced batch normalization and was fine-tuned for higher resolutions.

YOLOv3 introduced multi-scale training for better detection of smaller objects.

SSDs and YOLO are currently among the top-performing object detectors.

Upcoming video will cover implementing object detection with SSDs and YOLO in Python using OpenCV.

Transcripts

play00:00

hi and welcome to video where we take a

play00:02

look at some deep learning

play00:03

object detectors such as rcnns ssds

play00:06

and yolo so let's start what exactly is

play00:10

object detection well it is actually the

play00:12

holy grail of computer vision

play00:14

it allows us to do things like this and

play00:16

you would have seen this before in our

play00:18

face object detectors and pedestrian and

play00:20

vehicle object detectors

play00:21

in earlier videos however those were

play00:24

limited to just

play00:25

one class and if you were to add

play00:26

multiple classes you would have to pile

play00:28

on multiple object detections and that

play00:30

would slow things down significantly

play00:32

so we're taking a look at now some deep

play00:34

learning some very advanced

play00:36

object detectors that make things like

play00:38

this possible this is pretty cool isn't

play00:40

it

play00:40

car dog horse person in the back over

play00:42

here so what exactly is this now object

play00:45

detection as i said in earlier chapters

play00:47

is a mix of

play00:48

object classification and localization

play00:51

classification means determining what

play00:53

the object is and localization means

play00:55

identifying the region or the bounding

play00:57

box of the object and it's used in face

play00:59

detection quite often which we've seen

play01:01

in previous chapters

play01:02

now the difference with object detection

play01:04

and classification

play01:05

is that object detection doesn't just

play01:06

tell you when an image contains a cat or

play01:08

not but it actually tells you where the

play01:10

cat order objects or

play01:12

animals could be in the image so that's

play01:15

pretty cool

play01:15

now non-learning methods of object

play01:17

detection included

play01:18

what we discussed earlier which was

play01:20

horror cascade classifiers and another

play01:22

one we didn't discuss in earlier

play01:24

chapters was histogram of gradients hog

play01:26

with linear svms that support vector

play01:28

machines it's a machine learning

play01:30

classifier

play01:31

and the reason we didn't discuss these

play01:33

is because they are pretty much

play01:34

outdated now because deep learning

play01:36

object detectors

play01:38

are significantly better and it can be

play01:40

used in a wide variety of applications

play01:42

that make it very powerful but the

play01:44

reason why these aren't useful although

play01:46

they are

play01:46

technically very useful when you're

play01:48

detecting just one object that's why

play01:50

they're still very useful for face

play01:51

detection

play01:52

however when you try to apply this for

play01:54

multiple objects

play01:55

detectors it actually breaks down it

play01:57

doesn't work too well and it can often

play01:59

be quite slow and that's because it

play02:01

involves

play02:01

an approach called sliding windows where

play02:03

you have to slide this window at various

play02:05

scales across the image

play02:07

and that's why it is not a very

play02:08

efficient method

play02:10

so in 2014 deep learning object

play02:12

detectors had

play02:13

a huge breakthrough with something

play02:15

called regions with cnns or rcnns for

play02:18

short

play02:18

and this achieved remarkably high

play02:20

performance in the pascal

play02:22

voc challenge that is object detection

play02:24

data set that has been used to compete

play02:27

and score in computer vision journals

play02:29

and researchers

play02:30

to assess how good their object

play02:32

detection methods are so basically

play02:34

imagenet

play02:34

variation of it and in 2014 as i said

play02:37

our cnn's achieved remarkable success

play02:40

now let's take a look at how our cnns

play02:42

work so rcnn's attempted to solve the

play02:44

exhaustive search problem

play02:46

previously performed by sliding windows

play02:48

by proposing

play02:49

bounding boxes and passing these

play02:51

extracted boxes to an image classifier

play02:53

so now how do we get these bounding box

play02:55

proposals and that's where we use an

play02:57

algorithm called selective sig

play02:59

so imagine we have an input image here

play03:01

and we just extract

play03:02

interesting regions here we send these

play03:04

regions now to a cnn

play03:05

to classify and then we get different

play03:07

classifications of what it thinks it is

play03:10

so now let's take a look at the

play03:11

selective search algorithm

play03:13

this attempts to segment the image into

play03:15

different groups by combining

play03:16

similar areas such as colors or textures

play03:19

this could be

play03:19

interpreted as blobs or contours and

play03:22

proposes which of these regions

play03:24

it is interesting using some of the

play03:26

metrics within the classified algorithm

play03:28

and it proposes these boxes here coming

play03:30

out of this

play03:31

segmentation type task and basically

play03:34

feeds these

play03:34

boxes here to the cnn to classify

play03:38

one selective search found these

play03:39

interesting boxes it takes these boxes

play03:42

and passes it through the cnn as i said

play03:44

before

play03:44

one was trained on imagenet dataset so

play03:46

with the imagenet dataset so it's

play03:48

quite well tuned and well exposed to a

play03:50

lot of data

play03:51

and then what we do we don't actually

play03:52

use the cnn directly for classification

play03:54

all that we can

play03:56

we then use the svm i said it's a

play03:58

support vector machine classifier to

play04:00

classify the cnn extracted features we

play04:02

just compute the cnn features and feed

play04:04

it into another classifier

play04:06

and svm type classifier so after the

play04:08

region proposal has been classified

play04:10

we then use a simple linear regression

play04:12

to generate a tighter bounding box

play04:15

but what exactly makes a good box how do

play04:17

we know

play04:18

this box puzzle over this guy here is a

play04:21

good box

play04:22

now before we move on to other types of

play04:24

deep learning object detectors

play04:26

we're going to discuss some metrics and

play04:27

how you assess object detectors

play04:30

now this brings us to something called

play04:32

intersection over union

play04:34

or the iou metric so iou is defined as

play04:37

the size of the union

play04:39

over the size of the prediction box so

play04:41

typically an iou over

play04:43

0.5 is considered acceptable now let's

play04:45

take a look at this metric in a little

play04:46

bit more detail

play04:47

imagine this green box over this car

play04:49

here is our true

play04:51

human labeled box to identify the car

play04:53

remember

play04:54

what we want in localized boxes we don't

play04:56

want a box that predicts

play04:58

half of the car and then it says at the

play04:59

back wheel or we don't want a box that

play05:01

covers

play05:02

the car yes but covers a lot of extra

play05:04

area around the car we want a box to be

play05:06

nice and tight

play05:07

exactly over the object in question that

play05:10

we detected

play05:11

so let's imagine this red box here is

play05:13

the box

play05:14

our object detector has proposed now it

play05:17

looks fairly good it covers roughly

play05:18

about 80 percent of the image

play05:20

that's pretty good now let's take a look

play05:22

at the one on the right

play05:24

here this box which is actually covers a

play05:27

lot more of the car covers almost 95

play05:28

percent or more of the car

play05:30

pretty much just missing out this little

play05:31

little back piece here covers even more

play05:33

so technically shouldn't this be

play05:35

classified as a better box than this

play05:37

well that's not exactly what we want

play05:39

there is it

play05:40

we don't want our box to be all over

play05:42

here so that's how the iou metric

play05:45

is developed it's the size of the union

play05:47

over the size of the prediction box

play05:49

so the union basically is this zone here

play05:52

what is this zone here this zone is we

play05:54

compute the area of this zone here

play05:56

and the size of the prediction box is

play05:58

put over here so

play06:00

what we do okay this is the size

play06:02

objection box is always going to be

play06:03

bigger

play06:03

so what this tells us is that we want

play06:05

this prediction box

play06:07

to cover as much of the green as

play06:09

possible

play06:10

without with being as small on the green

play06:12

on the out as small as possible

play06:14

over the green essentially hope you

play06:16

understand that maybe i'm saying it

play06:17

wrong

play06:18

but essentially we just want this red

play06:20

hair to cover

play06:21

the green as much as possible without

play06:24

being too

play06:25

much over it or too much inside of it so

play06:27

that's how we get this metric here

play06:29

so anything over 0.5 is considered

play06:32

fairly acceptable in my opinion

play06:33

a lot of there are a lot more stricter

play06:35

guidelines when it depends on type of

play06:37

classifier you're building i mean

play06:38

detective

play06:39

but essentially 0.5 is a reasonable

play06:41

score as you can see in this

play06:43

box over here let's take a look at where

play06:45

the iou would be

play06:46

so the size of the union here looks

play06:48

let's say estimated at 50 percent

play06:50

maybe even a bit yeah 50 and then

play06:53

besides the prediction box is definitely

play06:56

maybe another 50 percent here

play06:58

so you're going to end up with five

play07:00

basically it's a fifty percent over a

play07:02

hundred percent which is double the size

play07:03

of this

play07:04

so you're going to end up with something

play07:05

like 0.5 whereas in this one

play07:07

this one is basically you can say this

play07:09

is 80 percent here

play07:11

and this area that is on the outside is

play07:14

probably like

play07:15

0.6 of the ratio of this thing here so

play07:18

you can understand how you get a much

play07:19

higher iou

play07:20

with this metric here now there's

play07:23

another problem

play07:23

when we have object vectors often times

play07:26

we will have

play07:27

multiple boxes being proposed over the

play07:30

same object

play07:31

and the object detector may not even

play07:32

know it's doing it more than once

play07:34

because remember it doesn't actually

play07:35

know what's in the image it is

play07:36

predicting interesting boxes in the

play07:38

image

play07:38

that it may or may not be correct it's

play07:40

very hard to determine

play07:42

for algorithm to determine is it two

play07:44

cars packed side by side

play07:45

is it one car so what we have to do is

play07:48

if we have our ground shoot labels which

play07:50

is the green

play07:51

box here when we're testing our object

play07:53

detectors you may often get

play07:55

multiple windows proposed like this so

play07:57

there's another metric called mean

play07:58

average position or map for short

play08:00

that we can use to get when we have

play08:02

multiple boxes over the same thing

play08:05

we can actually over the same true label

play08:06

here ground truth

play08:08

we can actually use this metric now to

play08:09

determine the map score

play08:11

and that determines the form of accuracy

play08:13

of

play08:14

our object detector so this one is

play08:16

actually a bit confusing you can read

play08:17

some pretty long blogs explaining this

play08:19

metric that will help you understand

play08:21

however for now just remember it's a

play08:23

metric used to determine the performance

play08:25

of

play08:25

object detectors so now now that you

play08:28

understand the metrics we can go back on

play08:29

to the evolution of deep learning object

play08:31

detectors

play08:32

and let's take a look at something

play08:34

called fast rcnn which is the evolution

play08:36

of

play08:36

our cnns by the same researchers and

play08:39

what they did

play08:40

they basically made some speed

play08:41

improvements over the original rcnn

play08:43

the original rcnn was quite slow because

play08:46

it required three models to

play08:47

train separately and use them in

play08:49

conjunction when you're doing the

play08:50

classification type work

play08:51

execution in the end and this required

play08:54

feature extraction

play08:55

model an svm to predict a class and then

play08:57

a linear regression to tighten the

play08:58

bounding boxes in the end so you can see

play09:00

in real time this would be quite slow so

play09:02

fast rcnn solved this problem by

play09:04

removing the overlap generated

play09:06

now how what they did is that they ran

play09:08

the scene and across the image just once

play09:10

using a technique called

play09:11

region of interest or roi pool so

play09:14

essentially the problem the faster our

play09:16

xenon solved was instead of running the

play09:18

cnn

play09:18

to extract features on all different

play09:20

regions identified what they did sorry

play09:22

they actually ran the cnn just once over

play09:24

the entire image and then used those

play09:26

features going forward into the second

play09:28

and third

play09:28

layers of models in the network so even

play09:31

still that wasn't fast enough

play09:32

and the researchers developed an even

play09:34

faster rcnn and named it appropriately

play09:36

faster or cnn however note there is no

play09:39

fastest rcnn

play09:40

just yet so what the researchers of the

play09:42

microsoft research team did

play09:44

to speed up our cnns or fast rcnns to

play09:47

give it the faster flavor

play09:49

was that they eliminated the bottleneck

play09:51

that was involved when using selective

play09:53

siege

play09:54

so this sped up regent proposal

play09:56

significantly as well

play09:58

which now brings us to something called

play10:00

single shot detectors or

play10:02

ssds so we've just discussed and linked

play10:04

the rcnn family and we've seen how

play10:06

successful they can be and how they

play10:08

evolved

play10:08

however typically they still run at

play10:10

roughly 7 frames per second

play10:12

even on some fairly powerful hardware

play10:14

however

play10:15

ssds and even yellow are significantly

play10:18

faster and you can actually see some of

play10:20

the stats right here this is the fbs

play10:22

as well you can see yellow and ssd is

play10:24

topping the fps at 45 and 46 and you can

play10:27

see the different scores

play10:28

and in resolutions as well okay so how

play10:31

did ssds improve speed so significantly

play10:34

well

play10:34

ssds use multiscale features and they

play10:36

use something called default boxes

play10:38

instead of looking at interesting

play10:39

regions

play10:40

and furthermore they actually dropped

play10:42

resolution of the images that were fed

play10:44

into the classifier itself

play10:45

so this allowed ssds to achieve near

play10:48

real-time speed performance

play10:49

with almost no drop in accuracy and

play10:51

sometimes they actually saw improved

play10:53

accuracy when with well-trained ssds

play10:55

so this is a general ssd structure it's

play10:58

composed of two main parts

play10:59

there's the feature map extractor and

play11:01

vdd-16 was used

play11:03

in the published paper but however

play11:04

resonant and dead snap may provide

play11:06

better results now

play11:07

and then they used a convolutional

play11:08

filter for the object detection part of

play11:10

it

play11:10

as well here so you can read the paper

play11:12

in this link here if you want to get an

play11:14

in-depth explanation of it but just to

play11:16

summarize ssds are definitely faster

play11:18

than faster rcnns however they are less

play11:21

accurate in detecting

play11:22

smaller objects now accuracy increases

play11:25

if we increase the number of default

play11:26

boxes

play11:27

that allows us to get a more finer grid

play11:29

on the image

play11:30

however that actually slows down the

play11:32

speed as well

play11:33

and what it does with multi-scale

play11:35

feature maps it allows it to improve the

play11:37

optic detection

play11:38

at various scales that's why we actually

play11:40

can increase the accuracy

play11:42

if we increase the density of the grid

play11:44

so now

play11:45

let's move on to yolo so the idea behind

play11:48

yellow instead of using

play11:49

a lot of different models and networks

play11:51

and different

play11:52

sections to do different things in the

play11:54

object detector

play11:55

yolo uses a single neural network that's

play11:57

applied to the full image and this

play11:59

allows yellow to reason about globally

play12:01

across the image

play12:01

when generating its predictions now it

play12:03

is the direct development and evolution

play12:05

of something called multibox

play12:07

it basically takes multibox which was

play12:09

used for region proposal

play12:10

and turns it into object recognition and

play12:13

then adds

play12:14

a softmax layer in parallel with the box

play12:15

regressor that combines everything into

play12:18

a box classifier as well so you have the

play12:20

entire object detector

play12:21

built into it here so how it works it

play12:24

divides the images into regions and

play12:25

predicts boundary boxes and

play12:26

probabilities for each region

play12:28

so euler then uses a full convolutional

play12:30

neural network allowing for inputs of

play12:33

various sizes and i must say yolo is one

play12:35

of the most impressive object detectors

play12:37

that have

play12:38

ever been built and so nice to use to

play12:41

train

play12:41

to evolve into different different

play12:43

situations so now let's take a further

play12:44

look at it and by the way if you want to

play12:46

read and learn more about yolo you can

play12:48

click this link here or go to the site

play12:49

here to read the actual paper it's quite

play12:51

good and actually

play12:52

surprisingly entertaining as well so how

play12:54

exactly does yolo work

play12:56

well the image is firstly divided into

play12:58

an s by s

play12:59

grid and if the center of an object

play13:01

falls into this grid cell

play13:03

that cell is responsible for detecting

play13:04

that object so let's say

play13:06

this dog here falls into the center of

play13:08

the cell here this cell is responsible

play13:10

for

play13:10

saying that this is a dog technically

play13:12

however let's move on to the other steps

play13:13

and you get a bigger picture of how it

play13:15

works

play13:16

so each grid then predicts a number of

play13:17

bounding boxes and confidence scores for

play13:19

those boxes and our confidence here is

play13:21

defined

play13:22

as the probability of an object

play13:23

multiplied by the thresholded iu score

play13:26

and iu scores of less than 0.5 are

play13:29

typically just given a confidence

play13:30

level of zero then by multiplying the

play13:33

conditional

play13:34

class probability and individual box

play13:36

confidence predictions we get something

play13:38

called the class specific confidence

play13:40

score for each box

play13:41

so you can see the steps here we have

play13:43

all these bonding boxes here

play13:44

you can see it proposes now a box region

play13:46

around this dog here

play13:48

and this is all based on the class

play13:49

probability map and then once it gets

play13:52

the final boxes here that it wants to

play13:54

classify you can actually see the final

play13:55

detections here

play13:56

and the yellow is quite good and

play13:58

effective in practice and quite

play14:00

fast you can take a look at your

play14:02

architecture here as well

play14:03

or if you want as i said i encourage you

play14:05

to read this paper here

play14:06

now the architecture here has 24

play14:08

convolutional

play14:09

layers and followed by two fully

play14:11

connected layers and

play14:12

alternating one by one confidential

play14:14

layers to reduce the feature spaces from

play14:16

proceeding layers

play14:17

now this may not make much sense to you

play14:19

but you can read the paper to actually

play14:20

understand the reasoning behind this

play14:22

design

play14:23

so let's talk about the evolution of

play14:25

yolo now and was voted the people's

play14:27

choice award people

play14:28

at cvpr that's a big computer vision

play14:31

conference

play14:31

and then later on yellow vision 2 was

play14:34

later released

play14:34

and they were introduced batch

play14:36

normalization which resulted in

play14:38

improvements in the map score by two

play14:39

percent and it was also fine-tuned to

play14:41

work at higher resolutions

play14:42

this is fairly high resolution for the

play14:45

computer vision optic detection that is

play14:46

working in real time i must say

play14:48

and it gave a four percent increase in

play14:50

map overall

play14:51

yolo vision tree now was fine-tuned even

play14:54

further and introduced multi-scale

play14:55

training to help better detect smaller

play14:57

objects

play14:58

so this concludes our discussion on

play15:00

yellow i hope you appreciated

play15:02

what eola has brought to the table

play15:04

yellow and ssds are basically neck and

play15:06

neck

play15:07

in the two best type of object detectors

play15:09

out there

play15:10

so now let's take a look at the summary

play15:11

of what we just learned we learned

play15:13

basically the evolution

play15:14

of object detectors from hara cascade

play15:16

classifiers and hogs

play15:18

to the first deep learning object

play15:19

detector which was our cnns then we

play15:21

introduced fast rcnns and faster rcnns

play15:24

we then explored single shot detectors

play15:26

with ssds which are also quite good and

play15:29

give quite good performance

play15:30

and then we took a high level overview

play15:32

of the yellow object detectors

play15:34

where euler vision 3's latest one and

play15:36

it's quite good right now

play15:37

so now let's move on to the next video

play15:39

where we start taking a look at

play15:40

implementing object detection with

play15:42

ssds and yolo in python using opencv 4.

play15:45

so stay tuned thank you

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Object DetectionDeep LearningComputer VisionRCNNsSSDsYOLOImage ClassificationLocalizationAI TechnologyReal-time Detection
Benötigen Sie eine Zusammenfassung auf Englisch?