Training Tesseract 5 for a New Font
Summary
TLDRThis video tutorial guides viewers on training Tesseract OCR with a custom font for improved recognition. It covers generating ground truth data with images, text, and box files, using 'text2image' from Tesseract's training tools. The script provided automates creating single-line text files and images, setting up training with 'makefile', and evaluating model performance. Tips on adjusting parameters for better training outcomes are included.
Takeaways
- 😀 The video is a tutorial on training Tesseract, an optical character recognition engine, with a custom font to improve its recognition capabilities.
- 📄 To train Tesseract, you need to provide ground truth data which includes generating images with the custom font and corresponding text and box files that describe the content and location of the text.
- 🖼️ The script uses 'text2image' application from Tesseract's training tools to generate images from text files.
- 📝 The generated images should be in TIFF or PNG format, accompanied by a '.txt' file for the text and a '.box' file that describes the location of each character.
- 🔍 The video creator developed a Python script to automate the process of creating single-line text files from a large text file and generating the corresponding images and box files.
- 🔢 The script uses the 'unicharset' file from Tesseract to define the rules of English, which helps the neural network understand how words are formed.
- 💻 The video demonstrates how to set up the folder structure and run commands for training Tesseract with the new data.
- 🔄 The training process involves running iterations where Tesseract learns from the provided data, and the number of iterations can be adjusted based on the desired accuracy and time frame.
- 📊 The video shows how to evaluate the trained model using a test image and compares the results before and after training to demonstrate improvement.
- 🔧 The creator encourages viewers to modify the provided script to suit their needs, such as changing the font, model name, and output directory, to ensure a deeper understanding of the training process.
Q & A
What is the main purpose of training Tesseract with a custom font?
-The main purpose is to improve Tesseract's recognition capabilities for a specific font that it may not recognize well by default.
What does 'ground truth' mean in the context of training Tesseract?
-'Ground truth' refers to the correct or expected output that Tesseract should recognize from the images, which includes both the text and the location of each character.
What file formats are typically used for the image and ground truth data when training Tesseract?
-For the image, TIFF or PNG formats are used, while a .txt file is used for the ground truth text and a .box file describes the location and character of each element in the image.
How can one generate images with a custom font for training Tesseract?
-One can use the 'text2image' application that comes with Tesseract's training tools to generate images with the custom font.
What is a 'box file' and why is it necessary for training Tesseract?
-A 'box file' describes the location and identity of each character in the image. It is necessary because it provides the ground truth data that Tesseract uses to learn where to find characters in the images.
What is the source of the training text used in the script?
-The training text is sourced from the 'training-text' repository, specifically the English folder, which contains a large file full of English text.
How does the script separate the training text into single-line text files?
-The script reads the large training text file and creates a separate text file for each line, then uses 'text2image' to generate an image and a box file for each line.
What is the significance of using the 'unicharset' file in training?
-The 'unicharset' file contains the rules of English that help the neural network understand how words are formed, which is crucial for accurate recognition.
How can one evaluate the performance of a trained Tesseract model?
-One can evaluate the model by running a command that uses the trained model to recognize text from an image and compares it against the expected output.
What is the recommended approach for increasing the accuracy of the trained model?
-The recommended approach is to increase the number of training iterations and to use a larger dataset of images, while being cautious not to overfit the model.
How can one install and use a custom font for training Tesseract?
-One can install the font on their system and then specify the font in the training script. On Linux, this might involve updating the font cache, while on Windows, it's as simple as installing the font and using it in the script.
Outlines
🖥️ Training Tesseract OCR with Custom Fonts
This paragraph introduces the process of training Tesseract, an optical character recognition engine, with a custom font to improve its recognition capabilities. The speaker explains the need to generate images with the custom font and accompanying text and box files that describe the content and location of the characters on the image. The speaker also mentions the use of the 'text2image' application that comes with Tesseract's training tools to generate these images and files. The focus is on creating a ground truth for Tesseract to learn from, which involves using a large text file and a script to split it into single-line text files, each generating an image and a corresponding box file.
📂 Organizing Data and Training Setup
The speaker discusses the folder structure and setup required for training Tesseract with custom data. They mention creating a 'data' folder and a subfolder named after the model, in this case, 'Apex'. The paragraph details the process of using a script to split a large text file into single-line files, generating images and box files for each. The speaker also covers the use of a specific font, downloaded from the web, to be used in the training process. They explain the command to initiate the training process, which includes specifying the model name, using English as a base model, and setting the number of training iterations. The speaker emphasizes the importance of not overfitting the network by experimenting with the number of iterations and suggests generating more line files for training to improve results.
🔧 Fine-Tuning the Training Process
In this paragraph, the speaker delves into the training process, explaining how to use a make file to run training commands with specified variables such as the model name and test data folder. They discuss the importance of starting with a high error rate and reducing it through iterations, suggesting an increase from 100 to 400 iterations for better results. The speaker also talks about generating more line files for training to improve the model's accuracy and provides a command to evaluate the trained model. They emphasize the importance of understanding the process rather than just using a script with arguments, encouraging viewers to modify the script to suit their needs and contribute to its improvement.
🏁 Wrapping Up and Further Customization
The final paragraph wraps up the training process, explaining how to use Tesseract with the test data folder and the trained model. The speaker discusses the possibility of training with different fonts by installing them and specifying them in the training command. They also mention the potential for optimizing the image size and character spacing for better training results. The speaker invites viewers to download the provided script, make changes, and contribute to its improvement. They conclude by encouraging viewers to ask questions, subscribe for more content, and engage with the community for support and further learning.
Mindmap
Keywords
💡Tesseract
💡Ground truth
💡Box file
💡Texture image
💡Line images
💡Training text
💡Python script
💡Model training
💡Iterations
💡Error rate
💡Custom font
Highlights
Introduction to training Tesseract with custom fonts for improved recognition.
Explanation of providing ground truth to Tesseract for custom font training.
Generation of images with custom fonts and corresponding text and box files.
Documentation review for training Tesseract with tiff or PNG images and Dodge et.txt files.
Challenges with automatic lead generation of box files and the solution to generate them manually.
Using texture image application to generate images compatible with Tesseract 5.
Script development to automate the generation of single-line text files and corresponding images.
Utilization of link data lstm for training text and creation of a data folder.
Customization of the script to generate images with specific fonts like Apex Legends font.
Folder structure explanation for organizing training data and scripts.
Command crafting for training Tesseract with custom data and model names.
Importance of not overfitting the network and the iterative training process.
Evaluation of the trained model using a single line of text and its accuracy.
Availability of the script as a wrapper story for easy access and modification.
Encouragement for users to understand and modify the script for their custom needs.
Discussion on generating ground truth data manually for non-custom font training scenarios.
Practical tips on optimizing image size and character spacing for better training results.
Final thoughts on the training process, potential improvements, and community engagement.
Transcripts
hello there so you decided to train
Tesseract with your custom fund so he
recognizes things a little bit better
that's the video for you so let me jump
right into it
um
so let's first start on how do you train
Tesseract I'm going to show a little bit
of the documentation
um and filling the gaps basically so
I think the most important thing is how
do you even provide the ground group so
how do you provide to Tesseract what you
think is right so basically what we want
to do here for a custom fund is generate
images with that custom fund and on the
same file name kind of attached to the
same file you want to generate like a
text file that describe what's written
on the image and also a box file
that or that describes that on the image
each character what's the location of
the character and what character is it
that way with that ground truth
Tesseract can go and train itself on the
new image right so I think that's the
first thing that you will try to figure
it out right so let's start
um
so you can see here on the documentation
he mentions you gotta have a tiff fire
or PNG file whatever we're going to use
Tiff for the image that I just mentioned
in a Dodge et.txt file for the ground
truth so that would be like that
whatever is written on the image I'm
going to show later how exactly that
looks
um
but also we've got to provide something
called The Box file in my experience
um the automatic lead generation of box
files of Tesseract can be a little bit
finicky and as we're generating the
ground through ourselves from a custom
fund we're going to have both the Box
all the text file
um the
um
the image file
into the in the Box file
um so let me show you how exactly that
looks like
um
the first thing I had to do was
figure out a way to generate those
images right so turns out you use the
texture image application uh it comes
with the Tesseract training tools so
watch my video in the description if you
missed that so after you install
Tesseract with its training tools you're
going to have texture image on your path
so
and the problem with texture image is it
generates images in a way that was
compatible with tester X4 but now
Tesseract five test track 5 needs
something called line images and those
are nothing more than images there's
just a line a single line instead of a
full text full page of text
so the first thing I had to do is figure
out training text so like a textbook
somewhere in the web we can grab text in
this case I used the training text from
the link data lstm if you go to the
English folder you're going to see you
have the english.training text here and
that this is just a big file full of
English text so that's where I got it
from the problem is texture image gets
all this text and generates just an
insane amount of pages and pages of text
with it and we don't want that remember
Tesseract 5.01's line images so I
developed a quick script that gets this
training text file and just separates
into a bunch of files that only has one
line each I'm going to also show this in
a second
and after you generate those files you
can just run text to image just fine and
it's going to generate for you the the
Box file and the image itself
cool so the first thing I did was
creating a link data folder you can see
on the top left here and all it is is
just this
folder here with all those files here
we're actually just going to use
training text but that address
um and that's about it
um so and I wrote this python script
here which is also going to be available
as a wrapper story I'm going to link in
the description
and this is you can just run it and
it's going to do exactly what I said
it's going to get this big text file
it's going to get each line and create a
separate text file for that and then
call text to image on that new file to
generate everything for us
um so let me explain some stuff here
actually yeah we're going to use the
link data uni chart set to generate uh
the new images inertia sets basically
kind of the rules of English
that helps the neural network figure out
exactly how words are formed in certain
languages English in this case
and you can use whatever you want here
there's plenty
um those arguments were mostly I got
from what Tesseract for test drain.sh
file uh used the only thing I changed
was the Y size because now instead of
being a full page just a small line
uh the shower spacing so it's not too
not too tight
and just one page of course and in this
case I didn't say it yet but I'm going
to use the Apex Legends font uh I just
downloaded from the web let me see if I
can find it
feedbacks regular OTF so I just download
this OTF file from the first Google
Chrome Google search I found and that's
about it so that's what I'm going to try
using
okay so let me show you I'm going to
walk exactly how my whole folder
structure is laid out how exactly
everything
any of this works in a second but for
now let me just run the script so to get
an ideal what's going on
uh first actually I'm going to create
the data folder and the grout data
folder so I think the best documentation
for training I found to be the readme of
this rapper story actually tells you
exactly what to do instead of some high
level ideas it actually tells you do
that and you get training going
one thing it says is you got to create a
folder called data and then the model
name you're going to train in my case
it's Apex Dash ground Dash truth so
that's what I did here I just created
Apex dashboard
and now if I run split training.text.pi
it should do exactly what I mentioned
it's going to get the training text file
which is an insane amount of words and
you're going to separate into a bunch of
small single line text files generate an
image from that in a box file from that
let me show you how that looks
so I'm gonna this is uh Windows WSL by
the way with your Ubuntu so I'm going to
copy those files over to my actual
Windows machine
and that looks good actually I'm gonna
get to my download folder
and done
wheel
so this is what those files looks like
so you can see just an image Windows
will walk about page small feeds Garden
foreclosure index payment auto
this is the text file I generated
windows we walk about page mode okay
cool though same thing and most
importantly generates a box file the Box
file is basically each letter in what
exact position it is on the image and I
imagine size rotation scale maybe I
don't know but you can see you see
Windows right in here vertically so this
is each letter in any space here we and
so on so this is all generated by the
texture image this is not Tesseract here
there's nothing running no machine
learning anything so this is just us
generating a ground proof so we're
generating like a beginning like
something that we know Not Human by you
know something we completely trust it's
not generated by AI
and we're going to use that tutoring
okay cool
so
so
let's begin training I guess
um let me go to test train and this is
what it looks like I kind of crafted
this command I'm gonna put in the
description of the video as well it's
gonna be this bad boy okay so
explaining some things here
um it's going to use the Tesseract
folder one folder of test data so what
is that let me show you so
I basically cloned Tesseract any side
pass director is a folder called test
theater and it has some English network
default settings so this file wasn't
here at putting here but all the letters
were including those configs fire which
is what we're actually looking for which
the lstm.train and a bunch of others
here so basically if you get clone
that's correct inside here
if you clone it this holder is going to
be in there right but it's missing the
english.train data file so we also got
to get that
so to get that we go here and we click
clone the test data best and on that
test database you're just gonna find the
train data file then you just get this
train data file and let it inside
a tutorial inside of this right Builder
clone inside test data this is all going
to be on GitHub again like this whole
folder structures I don't care don't
care too much about exactly the folder
structure just continue watching the
video so you understand what I'm doing
and then you can do your own folder
structures the way you want and so on
um Okay so
back to explaining exactly what I'm
doing with that command
um which is this okay so we're saying
this direct test data is where the sum
of the test data is going to be like the
English and so on and then we're going
to run this make file here uh oh right
this is the test screen repo story this
is the one I was talking about that the
readme is good so
it would definitely want to clone that
as well again it's going to be on my
GitHub but it would call that yourself
if you want to do it from a scratch
um and then
um inside here right inside test string
there's a make file that make files will
actually runs a bunch of commands and
one of them is the training command
um then as a bunch of variables you can
specify I'm going to specify the model
name being Apex they start model in
English that means I'm going to train
all those things on top of the English
model
and then again for some reason also
needs this test data again folder same
path as before okay and then I'm going
to put next iterations actually 100. oh
that means I'm going to begin with
English and I'm going to run 100
iterations this is very low this is
really not a lot 100 iterations you
should definitely bump that to something
that finishes within the reasonable
amount of time you've got to experiment
but something like ten thousand twenty
thousand might be good you don't want to
overfit so you don't want to do a lot
um hear the best things experiment you
want use the best results see what you
can weigh and so on let me actually do a
little bit more I'm gonna do 400
iterations so you can see kind of the
progress so
you begin with 66 66 error rate which is
a place of optimal so you can see 100
iterations is really not great so I
bumped this to 400 let's see if that's
going to improve any
so now it's rating of 53 so that's what
I'm saying if you spend if you let your
computer do this for hours and then
maybe two days uh you can probably get a
very low error array just again be aware
to not over fit the network so
um one thing you can do is instead of
generating a hundred files I'm going to
show later you can generate way more
line files for the training data so
I can tune the script I created here
instead of going through this whole file
which has a total of 193
000 lines I just did a hundred right
because we gotta finish right for the
video but if you remove this limitation
here if you just go on my script and
comment those few lines that means now
it's going to generate 193 000 images
for your network your network should
train so that might be a little bit
better for you so you can see it's
dropping roughly 10 each time so you can
see if I continue a bunch of iterations
it's going to get a really good result
and if I try actually evaluating it so
this is a command to evaluate so I'm
going to evaluate English 1.f so this is
an image I'm going to print to the
standard output I'm going to use the
test data there as this
the data folder we just created that has
now the check it out
apex.train data so this is our model
here's the finished model of a new newly
generated data
I'm going to say
this file is just a single line of text
which we know it is and I'm going to say
the languages that just don't want to
create it now Apex and log level all
body kicks of it and you can see it
works pretty well
um so that's kind of it I'm gonna get
this rapper story up on the URL so you
guys can just download it and start
messing with the script um one of the
reasons on this script I didn't do any
arguments is because I want you guys to
change it I want you to go here and
change the output directory the name of
the model the name of the font I don't
want to create a tool that someone can
just
call it with a bunch of arguments and
call it a wrap and it failed and you
have no idea why I want you guys to
understand what's going on here so you
can do way more than just train your
custom phone you can do more stuff you
can get ground truth which isn't
generated by text to image you can get
ground truth that is generated by
yourself by a human there's many many
options here okay
so no let me explain exactly how it
works
uh let me see where do I even begin
um so I think I explained Well the box
file which is just uh getting the image
uh with a ground truth text that the
image has written in it in a box file
which basically says each character
where it's positioned on the screen and
which character does it represent and
that's basically how you train Tesseract
period not just for a custom film for
anything really the trick here is how do
you get the ground truth in our case as
it's just a custom fund we can just
generate it automatically robotically
and just generate it and then train the
data on a massive amount of data right
but let's say you want to train with
handwritten data now you cannot
automatically generate handwritten data
right so you've got to scan things and
so on you're going to actually generate
the ground through tax files and not
only that you would also generate the
Box files there's there are tools for
that if you Google box fire generator
Tesseract you should probably find some
applications that can do that for you
and that's kind of the gist of it
um let me see what else I could explain
there's the test ring
um
and I think that's basically the gist of
it
let's Tesseract with the test data
folder and then you put English dot
train data inside that's basically all
you do and the link data you get from
the English folder on the link data
all STM rep story so I think that wraps
it up if you want to train a different
font which isn't uh the one I trained so
you can see a specified phone equal Apex
and here what you do is just report it
you can just select the fault you want
and how do you install this font so I
think in my case
it was something like going to use a
local share of funds yep then I put this
pump here
um which I downloaded from the web and
then you can just run let me think FD
cash scanner though FC cash FV that's
going to force
um Ubuntu to re-evaluate the cache of
funds it's going to find Apex Legends
and cache it and now it's a recognize
font on Windows you just double click a
font you're gonna see there's an install
button
see install and after installing the
fund you can just specify it here and
then you can just use it to train you're
getting just a bunch of stuff here as
well you can see on the images I kind of
generated them a little bit wider than
necessary so if you want to really
optimize it you could just reduce the
image size a bit C there's a lot of
white space in the right you could
remove a bunch of that you can see
there's also a lot of space below but I
could not remove it actually it was uh
texture image was failing anything too
small so 480 walked and I laughed at it
could also change the exposure character
spacing so you can see they're very
spaced out one of my generation so
there's a lot of space between them it
could increase that overall to remove
that whatever you want
um that's kind of the gist of it there's
not a lot of secret feel free to
download my rapper story and change as
you wish if you want to make an
improvement feel free to also make
changes and submit a pull request and I
hope it was helpful if you have any
problems any questions just let me down
let me know in the comments below and
we're going to try replying in the
timely manner and please subscribe if
you want to see more content leave a
like and Godspeed
Weitere ähnliche Videos ansehen
SDXL Local LORA Training Guide: Unlimited AI Images of Yourself
Reading text from images with either Tesseract or Darknet/YOLO
Automatic number plate recognition (ANPR) with Yolov9 and EasyOCR
EASIEST Way to Fine-Tune a LLM and Use It With Ollama
How to Improve OCR Accuracy
YOLO World Training Workflow with LVIS Dataset and Guide Walkthrough | Episode 46
5.0 / 5 (0 votes)