Reading text from images with either Tesseract or Darknet/YOLO
Summary
TLDRThis video script discusses the differences and limitations of using Tesseract and YOLO for Optical Character Recognition (OCR). The presenter demonstrates Tesseract's effectiveness on simple, black and white text but shows its struggles with complex images. They then showcase a YOLO model trained to detect street signs and text, highlighting its ability to identify objects but not read them as text. The script concludes with a sorted YOLO example that improves readability, emphasizing the need for sufficient training data for better accuracy.
Takeaways
- đ The presenter is comparing Tesseract OCR and YOLO for text recognition, highlighting the limitations of both.
- đŒïž Tesseract performs well on simple, black and white text images but struggles with complex images.
- đ« Tesseract's limitations are evident when it fails to recognize text in images with more complex backgrounds.
- đ The presenter demonstrates a neural network trained to read street signs, showcasing its ability to detect but not necessarily read text correctly.
- đ YOLO identifies text as objects within an image, which can be reconstructed but is not as straightforward as Tesseract's output.
- đ ïž With additional code, the presenter sorts YOLO's detection results to improve readability, simulating a more coherent text output.
- đ The presenter emphasizes the importance of training data quantity and quality for neural network performance.
- đą YOLO's text recognition is hindered by a lack of diverse training images, leading to misinterpretations.
- đ» The presenter provides a simple CMake setup for compiling the code, showcasing the ease of setting up the projects.
- đ Source code and additional resources are offered for those interested in experimenting with the presented methods.
Q & A
What are the two text recognition techniques discussed in the script?
-The two text recognition techniques discussed are Tesseract OCR and YOLO object detection.
What is the primary difference between Tesseract and YOLO when it comes to reading text?
-Tesseract is an OCR engine that reads text as text, while YOLO is an object detection system that identifies and reads text as objects within an image.
What type of images does Tesseract perform well on according to the script?
-Tesseract performs well on simple black and white images with clear text on a white background, typically images that have been processed through a fax machine or a flatbed scanner.
What limitations does Tesseract have when processing complex images?
-Tesseract struggles with complex images that have text in various colors and backgrounds, or where there are many other elements in the image.
How does the script demonstrate the limitations of Tesseract?
-The script demonstrates Tesseract's limitations by showing instances where it fails to recognize text in images that are not simple black and white, highlighting its inability to process complex images effectively.
What is the approach used to improve YOLO's text recognition as described in the script?
-The script describes training a neural network to read street signs and then sorting the detected text objects based on their x and y coordinates to reconstruct the text in a readable order.
What additional lines of code were added to the YOLO application to improve its output?
-A few lines of code were added to sort the detection results from left to right, making the text more readable and understandable.
How many classes were used in the neural network trained for the YOLO application?
-The neural network used in the YOLO application had 26 classes for the letters of the alphabet and a few more for signs like 'yield', 'speed', and 'stop', totaling 30 classes.
What was the size of the image dataset used to train the YOLO model in the script?
-The YOLO model was trained with 156 images, which the script suggests is not enough for the number of classes it has.
What is the script's recommendation for the minimum number of images needed to train a robust YOLO model?
-The script does not specify an exact number but implies that more images are better, particularly highlighting the issue of misclassification due to a relatively small training set.
What is the script's conclusion about the effectiveness of YOLO for text recognition?
-The script concludes that while YOLO is not perfect for text recognition, it can be effective for certain applications, such as reading street names, especially when the results are sorted correctly.
Outlines
đ„ïž Tesseract vs. YOLO OCR Limitations
The speaker begins by contrasting Tesseract, an optical character recognition (OCR) tool, with YOLO, an object detection system, highlighting their respective capabilities and limitations in reading text. They demonstrate Tesseract's proficiency with a simple black and white image containing text against its struggles with more complex images. The speaker then showcases a directory of images, emphasizing the challenges faced when the text is not in a binary format or when additional elements are present in the images. The script also details the process of using Tesseract with OpenCV to load and process an image, extracting text with varying degrees of success depending on the image's complexity.
đŠ Training YOLO for Street Sign Recognition
The second paragraph delves into the speaker's experience training a neural network, specifically YOLO, to identify street signs. They explain how YOLO interprets text as objects within an image, which can lead to incorrect text sequences. The speaker then describes a method to reconstruct the text using the x and y coordinates of the detected objects. They present an improved version of their YOLO application that sorts the detected text from left to right, making it more readable. The speaker also discusses the challenges of training with a limited dataset and the impact on the model's accuracy, particularly in recognizing specific characters like 'v' and 'o'. Lastly, they touch on the broader application of YOLO for detecting various signs and street names, despite the need for a more extensive training set.
đ YOLO Object Detection in Practice
In the final paragraph, the speaker discusses the practical application of YOLO for object detection, focusing on its use in a loop that processes a series of images. They provide insights into the code structure, mentioning the use of CMake files for simplicity and the creation of three executables corresponding to different functionalities. The speaker emphasizes the use of YOLO v4 tiny due to its efficiency and adequacy for most scenarios. They also share their approach to marking up images for training, which involves identifying and categorizing various elements such as speed limits, street names, and traffic signs. The paragraph concludes with an offer to share the source code and a brief overview of the code's functionality, including loading images, obtaining predictions, and displaying annotated images.
Mindmap
Keywords
đĄTesseract
đĄYOLO (You Only Look Once)
đĄOCR (Optical Character Recognition)
đĄOpenCV
đĄBinary Image
đĄNeural Network
đĄDarknet
đĄImage Annotation
đĄTraining Data
đĄInference
đĄCMake
Highlights
Comparison between Tesseract OCR and YOLO for text reading capabilities.
Tesseract performs well on simple black and white text images.
Tesseract struggles with complex images that are not black and white.
YOLO is used to detect and read text on street signs.
YOLO identifies text as objects within an image rather than reading it sequentially.
Adding sorting code to YOLO improves text readability.
YOLO's performance is dependent on the similarity between training and test images.
The presenter trained a neural network with a limited dataset of 156 images.
YOLO v4 tiny is preferred for its efficiency in most situations.
The presenter provides insights on how to mark up images for training YOLO.
The limitations of Tesseract are highlighted through various image examples.
YOLO's ability to detect text is showcased through annotated images of street signs.
The presenter demonstrates how to sort YOLO's detection results for better text reconstruction.
Misidentifications by YOLO are attributed to a lack of diverse training images.
The simplicity of the code used for YOLO detection and annotation is emphasized.
Source code and CMake files for the demonstrations are made available for public use.
Transcripts
i want to show the difference between
using
tesseract to do ocr
and show the limitations of tesseract
and then do the same thing with yolo
both yolo and tess rack
can read text but there's there's huge
limitations on both of them
so i'm going to start with
showing
i've got a directory of files a
directory of images here
this one
is an image that has just a bunch of
text it's really simple black and white
it's not a binary image but it's just
black text on white background
and it doesn't have anything else
meanwhile i have other images
that have text on it but it's not
black and white
it's complex there's lots of things
going on in the images
all right so
knowing this
here's the first application
there's not too much code here i
instantiate one of these test rack
objects
i load up an image
i tell tesseract
here's the image i want you to use
and i pass in the
opencv image that i read in
and then
this is how you get the text
and tesseract does really well with that
image that i
showed me uh
this one here
all right this is the image this window
here
and this is the text that it extracted
from that particular image so you can
see
even though there's a few formatting
differences
like here there's a blank line before
fundamental freedoms and here there's no
blank line before fundamental freedoms
but otherwise
tessrack does a really good job of
reading this text
however if i do the same thing
but i tell it to use these images
instead
this one didn't find anything
um
nothing there
nothing there
on several of them
you can
see a little bit of the text
yeah i don't think any of these ones are
good examples here
oh there you go so that previous one you
can see
maximum
went too fast
uh nothing to read there nothing to be
there
you can
see where it says kilometers per hour on
this
one anyway it's not very exciting
uh test rack does not do a good job on
this so test rack does a really good job
for things that have gone through
like a fax machine and a flatbed scanner
just
black and white text that's it
so the next thing that i did
let me go back to my id here
all right this is the next example i
trained a neural network to read those
street signs
and then what i do is i print out all
the results i show the annotated image
and then i wait for a key
so let's do the same thing we'll do
option number two
there we go
let me move this over
so we can see m c
anyway this is the whole thing street
name um
let me cycle through a few of these
images here you can see i trained it to
find stop signs yield signs speed signs
and
the street names
and then within the street names
i created a class
from a to zed
and you can see the results from here
so it can find the text
but it doesn't read it as text it reads
it as
objects within an image and if you try
to
read this you'll see it doesn't make
sense it it's not it's not in the right
order
the order is actually there if you take
a look at the x and y coordinates and
the width and height of all of these
boxes you can rebuild that text
it's not great it's not like tess rack
but for something like a street name if
if that's what you have to do it can be
done
and that's what i want to show
go back to the id in the third example
the third one is
the same as the second
with one difference i added
just a few lines of code these lines
here
to sort the results
there's a big block of text here you can
pause the youtube video if you want to
look at it but let me jump straight to
the results here
uh so what we want is
the third application here
and let me move the window aside
so you can see now that it's sorted
from
left to right
it makes a lot more sense
and an application could also look for
blank spaces between two boxes
and then they would know that there's
actually a blank space here
uh between the l and the r
so you could split words up this way if
as long as it's relatively simple what
it is you're looking for
so it's not
perfect
uh let me look for an example uh here go
for example
it thinks this is a v so if you take a
look b v
l a
it's actually b y l a
uh by land and the
the d here it in incorrectly thinks that
that is an o
so this is not a problem with yolo
this is a problem due to me not having
enough images to train
so i had a relatively small set of
images to train and then these are the
images that i kept out of training that
we're looking at right now
but in all let's see what i have
remember i have
26 classes for the letters
and then i have a few more classes for
yield speed stop that kind of thing
there you go
and if i bring up the menu here you can
see there you go so stop yield street
name speed limit
back of stop sign and then everything
from a to z
and in all i had 156 images that i use
to train
which
for the number of classes that i have
for
30 classes
156 images is definitely not enough the
only way that this works is the
similarities between the test images
that i'm using and the images that i
used to train here
and i'll just
scroll through some so that
if people are looking for hints as to
how to markup images this is how i did
it
all right so
that's dark mark
and this is the results of running
inference on those images that i kept
out of training
speed limits not
very interesting these are the ones that
are interesting the ones that have
sha and nano so shannon lake road you
can see that here
that was found
i'm using
yolo v4 tiny in this case
i don't
i very rarely use the full version of
yolo i find that
yolo tiny is normally
perfect for almost every situation
and the code is relatively simple
let me
this is the one that doesn't sort so you
can see there's only a few lines of code
it loads the network here
it configures
just a few items
i have a for loop that loops through all
of the images
and what i do is i load an image
i get the predictions
i display the predictions which is
what we see here being displayed
i show the annotated image on the screen
and then
i wait for a key to be pressed and it
just keeps looping until it's gone
through all of the images
and that's what it looks like
so i will include a link in the
description below
to
the source code
if people uh want to play with it i've
got some cmake files that are really
really simple
actually let me bring that up real quick
um
this is the cmake file here
so i look for opencv i look for tessrack
dark help and darknet of course and then
these are the three executables
that i create in the three cpp files
that are used so this first one is the
test rack one
the second one is
a very simple yolo one
using dark help and dark net
and then the last one is the same as the
second one
but i added a sort function here in the
middle
to make it easier to read
hope this was helpful
Voir Plus de Vidéos Connexes
Automatic number plate recognition (ANPR) with Yolov9 and EasyOCR
YOLOv7 | Instance Segmentation on Custom Dataset
Introduction to OCR (OCR in Python Tutorials 01.01)
How to Improve OCR Accuracy
Optical Character Recognition (OCR)
Image classification + feature extraction with Python and Scikit learn | Computer vision tutorial
5.0 / 5 (0 votes)