Reading text from images with either Tesseract or Darknet/YOLO

Stephane Charette
26 Feb 202211:31

Summary

TLDRThis video script discusses the differences and limitations of using Tesseract and YOLO for Optical Character Recognition (OCR). The presenter demonstrates Tesseract's effectiveness on simple, black and white text but shows its struggles with complex images. They then showcase a YOLO model trained to detect street signs and text, highlighting its ability to identify objects but not read them as text. The script concludes with a sorted YOLO example that improves readability, emphasizing the need for sufficient training data for better accuracy.

Takeaways

  • πŸ“– The presenter is comparing Tesseract OCR and YOLO for text recognition, highlighting the limitations of both.
  • πŸ–ΌοΈ Tesseract performs well on simple, black and white text images but struggles with complex images.
  • 🚫 Tesseract's limitations are evident when it fails to recognize text in images with more complex backgrounds.
  • πŸ“ˆ The presenter demonstrates a neural network trained to read street signs, showcasing its ability to detect but not necessarily read text correctly.
  • πŸ” YOLO identifies text as objects within an image, which can be reconstructed but is not as straightforward as Tesseract's output.
  • πŸ› οΈ With additional code, the presenter sorts YOLO's detection results to improve readability, simulating a more coherent text output.
  • πŸ”„ The presenter emphasizes the importance of training data quantity and quality for neural network performance.
  • πŸ”’ YOLO's text recognition is hindered by a lack of diverse training images, leading to misinterpretations.
  • πŸ’» The presenter provides a simple CMake setup for compiling the code, showcasing the ease of setting up the projects.
  • πŸ”— Source code and additional resources are offered for those interested in experimenting with the presented methods.

Q & A

  • What are the two text recognition techniques discussed in the script?

    -The two text recognition techniques discussed are Tesseract OCR and YOLO object detection.

  • What is the primary difference between Tesseract and YOLO when it comes to reading text?

    -Tesseract is an OCR engine that reads text as text, while YOLO is an object detection system that identifies and reads text as objects within an image.

  • What type of images does Tesseract perform well on according to the script?

    -Tesseract performs well on simple black and white images with clear text on a white background, typically images that have been processed through a fax machine or a flatbed scanner.

  • What limitations does Tesseract have when processing complex images?

    -Tesseract struggles with complex images that have text in various colors and backgrounds, or where there are many other elements in the image.

  • How does the script demonstrate the limitations of Tesseract?

    -The script demonstrates Tesseract's limitations by showing instances where it fails to recognize text in images that are not simple black and white, highlighting its inability to process complex images effectively.

  • What is the approach used to improve YOLO's text recognition as described in the script?

    -The script describes training a neural network to read street signs and then sorting the detected text objects based on their x and y coordinates to reconstruct the text in a readable order.

  • What additional lines of code were added to the YOLO application to improve its output?

    -A few lines of code were added to sort the detection results from left to right, making the text more readable and understandable.

  • How many classes were used in the neural network trained for the YOLO application?

    -The neural network used in the YOLO application had 26 classes for the letters of the alphabet and a few more for signs like 'yield', 'speed', and 'stop', totaling 30 classes.

  • What was the size of the image dataset used to train the YOLO model in the script?

    -The YOLO model was trained with 156 images, which the script suggests is not enough for the number of classes it has.

  • What is the script's recommendation for the minimum number of images needed to train a robust YOLO model?

    -The script does not specify an exact number but implies that more images are better, particularly highlighting the issue of misclassification due to a relatively small training set.

  • What is the script's conclusion about the effectiveness of YOLO for text recognition?

    -The script concludes that while YOLO is not perfect for text recognition, it can be effective for certain applications, such as reading street names, especially when the results are sorted correctly.

Outlines

00:00

πŸ–₯️ Tesseract vs. YOLO OCR Limitations

The speaker begins by contrasting Tesseract, an optical character recognition (OCR) tool, with YOLO, an object detection system, highlighting their respective capabilities and limitations in reading text. They demonstrate Tesseract's proficiency with a simple black and white image containing text against its struggles with more complex images. The speaker then showcases a directory of images, emphasizing the challenges faced when the text is not in a binary format or when additional elements are present in the images. The script also details the process of using Tesseract with OpenCV to load and process an image, extracting text with varying degrees of success depending on the image's complexity.

05:00

🚦 Training YOLO for Street Sign Recognition

The second paragraph delves into the speaker's experience training a neural network, specifically YOLO, to identify street signs. They explain how YOLO interprets text as objects within an image, which can lead to incorrect text sequences. The speaker then describes a method to reconstruct the text using the x and y coordinates of the detected objects. They present an improved version of their YOLO application that sorts the detected text from left to right, making it more readable. The speaker also discusses the challenges of training with a limited dataset and the impact on the model's accuracy, particularly in recognizing specific characters like 'v' and 'o'. Lastly, they touch on the broader application of YOLO for detecting various signs and street names, despite the need for a more extensive training set.

10:01

πŸ” YOLO Object Detection in Practice

In the final paragraph, the speaker discusses the practical application of YOLO for object detection, focusing on its use in a loop that processes a series of images. They provide insights into the code structure, mentioning the use of CMake files for simplicity and the creation of three executables corresponding to different functionalities. The speaker emphasizes the use of YOLO v4 tiny due to its efficiency and adequacy for most scenarios. They also share their approach to marking up images for training, which involves identifying and categorizing various elements such as speed limits, street names, and traffic signs. The paragraph concludes with an offer to share the source code and a brief overview of the code's functionality, including loading images, obtaining predictions, and displaying annotated images.

Mindmap

Keywords

πŸ’‘Tesseract

Tesseract is an open-source Optical Character Recognition (OCR) engine that is capable of recognizing and 'reading' text within images. In the context of the video, Tesseract is used to demonstrate its effectiveness on simple, black and white text images, where it performs well, but also to highlight its limitations when dealing with more complex images that contain text in various colors and backgrounds.

πŸ’‘YOLO (You Only Look Once)

YOLO is a popular real-time object detection system that is capable of identifying objects within images. In the video, YOLO is used to illustrate its application in reading text on street signs by recognizing text as objects within an image. However, it is noted that YOLO does not read the text sequentially but rather identifies individual characters, which can lead to incorrect interpretations unless sorted correctly.

πŸ’‘OCR (Optical Character Recognition)

OCR is a technology that allows the conversion of various types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. The video discusses the use of Tesseract for OCR, emphasizing its strengths with simple text images and its challenges with more complex scenarios.

πŸ’‘OpenCV

OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library. It is used in the video to load and process images before they are passed to Tesseract for OCR or to YOLO for object detection, highlighting its role as a foundational tool in image processing for both OCR and object detection tasks.

πŸ’‘Binary Image

A binary image is a type of image where each pixel is either black or white, with no shades of gray. The video mentions binary images in the context of Tesseract's preference for simple, high-contrast images for effective text recognition.

πŸ’‘Neural Network

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the operation of the human brain. In the video, a neural network is trained to read street signs, demonstrating how machine learning can be applied to improve text recognition in complex scenarios.

πŸ’‘Darknet

Darknet is an open-source neural network framework, written in C and CUDA, which YOLO is based on. The video references Darknet in the context of using YOLO for object detection, indicating the underlying technology that powers YOLO's capabilities.

πŸ’‘Image Annotation

Image annotation is the process of labeling the contents of an image, which is a crucial step in training machine learning models for tasks like object detection. The video script mentions the process of annotating images for training the YOLO model to recognize various signs and street names.

πŸ’‘Training Data

Training data is the set of data used to 'teach' a machine learning model to perform a specific task. The video discusses the importance of having a sufficient amount of training data, as the limited dataset used for training the YOLO model affected its accuracy in recognizing text on street signs.

πŸ’‘Inference

Inference in machine learning refers to the process of making predictions or decisions based on a trained model. The video describes running inference on images using a trained YOLO model to demonstrate its ability to detect and 'read' text on street signs.

πŸ’‘CMake

CMake is a build system generator and cross-platform native build environment. In the video, CMake is used to configure the build process for the applications that utilize Tesseract and YOLO, showing how it simplifies the compilation of complex projects involving multiple libraries and dependencies.

Highlights

Comparison between Tesseract OCR and YOLO for text reading capabilities.

Tesseract performs well on simple black and white text images.

Tesseract struggles with complex images that are not black and white.

YOLO is used to detect and read text on street signs.

YOLO identifies text as objects within an image rather than reading it sequentially.

Adding sorting code to YOLO improves text readability.

YOLO's performance is dependent on the similarity between training and test images.

The presenter trained a neural network with a limited dataset of 156 images.

YOLO v4 tiny is preferred for its efficiency in most situations.

The presenter provides insights on how to mark up images for training YOLO.

The limitations of Tesseract are highlighted through various image examples.

YOLO's ability to detect text is showcased through annotated images of street signs.

The presenter demonstrates how to sort YOLO's detection results for better text reconstruction.

Misidentifications by YOLO are attributed to a lack of diverse training images.

The simplicity of the code used for YOLO detection and annotation is emphasized.

Source code and CMake files for the demonstrations are made available for public use.

Transcripts

play00:01

i want to show the difference between

play00:04

using

play00:05

tesseract to do ocr

play00:09

and show the limitations of tesseract

play00:12

and then do the same thing with yolo

play00:16

both yolo and tess rack

play00:18

can read text but there's there's huge

play00:21

limitations on both of them

play00:24

so i'm going to start with

play00:26

showing

play00:27

i've got a directory of files a

play00:29

directory of images here

play00:33

this one

play00:35

is an image that has just a bunch of

play00:37

text it's really simple black and white

play00:40

it's not a binary image but it's just

play00:43

black text on white background

play00:46

and it doesn't have anything else

play00:49

meanwhile i have other images

play00:53

that have text on it but it's not

play00:56

black and white

play00:57

it's complex there's lots of things

play01:00

going on in the images

play01:03

all right so

play01:04

knowing this

play01:07

here's the first application

play01:10

there's not too much code here i

play01:13

instantiate one of these test rack

play01:15

objects

play01:17

i load up an image

play01:19

i tell tesseract

play01:21

here's the image i want you to use

play01:24

and i pass in the

play01:25

opencv image that i read in

play01:28

and then

play01:30

this is how you get the text

play01:34

and tesseract does really well with that

play01:36

image that i

play01:38

showed me uh

play01:42

this one here

play01:49

all right this is the image this window

play01:51

here

play01:52

and this is the text that it extracted

play01:54

from that particular image so you can

play01:56

see

play01:57

even though there's a few formatting

play01:59

differences

play02:00

like here there's a blank line before

play02:03

fundamental freedoms and here there's no

play02:06

blank line before fundamental freedoms

play02:08

but otherwise

play02:10

tessrack does a really good job of

play02:13

reading this text

play02:17

however if i do the same thing

play02:26

but i tell it to use these images

play02:28

instead

play02:37

this one didn't find anything

play02:42

um

play02:44

nothing there

play02:47

nothing there

play02:49

on several of them

play02:51

you can

play02:53

see a little bit of the text

play02:57

yeah i don't think any of these ones are

play03:01

good examples here

play03:08

oh there you go so that previous one you

play03:10

can see

play03:11

maximum

play03:15

went too fast

play03:20

uh nothing to read there nothing to be

play03:22

there

play03:25

you can

play03:26

see where it says kilometers per hour on

play03:28

this

play03:30

one anyway it's not very exciting

play03:35

uh test rack does not do a good job on

play03:37

this so test rack does a really good job

play03:40

for things that have gone through

play03:42

like a fax machine and a flatbed scanner

play03:46

just

play03:47

black and white text that's it

play03:51

so the next thing that i did

play03:53

let me go back to my id here

play03:56

all right this is the next example i

play03:58

trained a neural network to read those

play04:01

street signs

play04:06

and then what i do is i print out all

play04:08

the results i show the annotated image

play04:11

and then i wait for a key

play04:15

so let's do the same thing we'll do

play04:21

option number two

play04:23

there we go

play04:28

let me move this over

play04:31

so we can see m c

play04:33

anyway this is the whole thing street

play04:35

name um

play04:37

let me cycle through a few of these

play04:38

images here you can see i trained it to

play04:40

find stop signs yield signs speed signs

play04:44

and

play04:45

the street names

play04:47

and then within the street names

play04:50

i created a class

play04:52

from a to zed

play04:55

and you can see the results from here

play05:00

so it can find the text

play05:03

but it doesn't read it as text it reads

play05:05

it as

play05:06

objects within an image and if you try

play05:09

to

play05:11

read this you'll see it doesn't make

play05:13

sense it it's not it's not in the right

play05:15

order

play05:17

the order is actually there if you take

play05:18

a look at the x and y coordinates and

play05:20

the width and height of all of these

play05:22

boxes you can rebuild that text

play05:26

it's not great it's not like tess rack

play05:29

but for something like a street name if

play05:31

if that's what you have to do it can be

play05:33

done

play05:36

and that's what i want to show

play05:39

go back to the id in the third example

play05:42

the third one is

play05:44

the same as the second

play05:46

with one difference i added

play05:49

just a few lines of code these lines

play05:52

here

play05:52

to sort the results

play05:55

there's a big block of text here you can

play05:58

pause the youtube video if you want to

play06:00

look at it but let me jump straight to

play06:02

the results here

play06:05

uh so what we want is

play06:09

the third application here

play06:14

and let me move the window aside

play06:17

so you can see now that it's sorted

play06:20

from

play06:21

left to right

play06:23

it makes a lot more sense

play06:26

and an application could also look for

play06:29

blank spaces between two boxes

play06:32

and then they would know that there's

play06:34

actually a blank space here

play06:36

uh between the l and the r

play06:40

so you could split words up this way if

play06:43

as long as it's relatively simple what

play06:45

it is you're looking for

play06:49

so it's not

play06:50

perfect

play06:53

uh let me look for an example uh here go

play06:56

for example

play06:58

it thinks this is a v so if you take a

play07:00

look b v

play07:02

l a

play07:03

it's actually b y l a

play07:08

uh by land and the

play07:10

the d here it in incorrectly thinks that

play07:14

that is an o

play07:17

so this is not a problem with yolo

play07:20

this is a problem due to me not having

play07:22

enough images to train

play07:28

so i had a relatively small set of

play07:30

images to train and then these are the

play07:33

images that i kept out of training that

play07:35

we're looking at right now

play07:39

but in all let's see what i have

play07:41

remember i have

play07:43

26 classes for the letters

play07:46

and then i have a few more classes for

play07:49

yield speed stop that kind of thing

play07:54

there you go

play07:56

and if i bring up the menu here you can

play07:58

see there you go so stop yield street

play08:01

name speed limit

play08:02

back of stop sign and then everything

play08:04

from a to z

play08:08

and in all i had 156 images that i use

play08:12

to train

play08:13

which

play08:15

for the number of classes that i have

play08:17

for

play08:18

30 classes

play08:21

156 images is definitely not enough the

play08:24

only way that this works is the

play08:26

similarities between the test images

play08:29

that i'm using and the images that i

play08:31

used to train here

play08:34

and i'll just

play08:35

scroll through some so that

play08:39

if people are looking for hints as to

play08:41

how to markup images this is how i did

play08:44

it

play08:51

all right so

play08:53

that's dark mark

play08:54

and this is the results of running

play08:56

inference on those images that i kept

play08:58

out of training

play09:02

speed limits not

play09:04

very interesting these are the ones that

play09:06

are interesting the ones that have

play09:09

sha and nano so shannon lake road you

play09:13

can see that here

play09:15

that was found

play09:17

i'm using

play09:18

yolo v4 tiny in this case

play09:22

i don't

play09:23

i very rarely use the full version of

play09:27

yolo i find that

play09:30

yolo tiny is normally

play09:32

perfect for almost every situation

play09:37

and the code is relatively simple

play09:40

let me

play09:41

this is the one that doesn't sort so you

play09:43

can see there's only a few lines of code

play09:45

it loads the network here

play09:48

it configures

play09:50

just a few items

play09:53

i have a for loop that loops through all

play09:55

of the images

play09:57

and what i do is i load an image

play10:00

i get the predictions

play10:02

i display the predictions which is

play10:05

what we see here being displayed

play10:11

i show the annotated image on the screen

play10:14

and then

play10:15

i wait for a key to be pressed and it

play10:17

just keeps looping until it's gone

play10:19

through all of the images

play10:23

and that's what it looks like

play10:29

so i will include a link in the

play10:31

description below

play10:33

to

play10:34

the source code

play10:36

if people uh want to play with it i've

play10:38

got some cmake files that are really

play10:40

really simple

play10:42

actually let me bring that up real quick

play10:44

um

play10:47

this is the cmake file here

play10:49

so i look for opencv i look for tessrack

play10:53

dark help and darknet of course and then

play10:55

these are the three executables

play10:58

that i create in the three cpp files

play11:03

that are used so this first one is the

play11:05

test rack one

play11:07

the second one is

play11:09

a very simple yolo one

play11:12

using dark help and dark net

play11:16

and then the last one is the same as the

play11:19

second one

play11:20

but i added a sort function here in the

play11:23

middle

play11:24

to make it easier to read

play11:27

hope this was helpful

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
OCRTesseractYOLOText RecognitionImage ProcessingMachine LearningNeural NetworksComputer VisionData AnalysisAI Applications