Extract Key Information from Documents using LayoutLM | LayoutLM Fine-tuning | Deep Learning

Karndeep Singh
28 Mar 202228:40

Summary

TLDRThis YouTube video tutorial introduces LayoutLM, a state-of-the-art model for understanding document layouts and extracting entities. It covers the limitations of traditional OCR and NER, explaining how LayoutLM incorporates text and positional information for more accurate document processing. The presenter demonstrates using the Funds dataset to train the model for key-value pair extraction, outlining steps from data preparation with tools like Label Studio to model training and inference using Hugging Face's Transformers library.

Takeaways

  • πŸ“„ The video introduces LayoutLM, a document understanding model that excels at extracting entities from structured documents.
  • πŸ” Traditional OCR and NER methods struggle with changing document structures, whereas LayoutLM considers both text and layout for better accuracy.
  • πŸ’Ύ The script discusses the use of a dataset called 'funds' to demonstrate how LayoutLM can extract key-value pairs from documents.
  • πŸ–ΌοΈ LayoutLM processes images of documents, identifies text, and determines the position of each word within the image.
  • πŸ”Ž The model generates embeddings that incorporate both text and positional information to understand document structure.
  • πŸ”§ A Faster R-CNN model is used in conjunction with LayoutLM to detect regions of interest within the document images.
  • πŸ“Š The video outlines the architecture of LayoutLM, explaining how it handles text, positional, and image embeddings.
  • πŸ› οΈ The tutorial covers the steps to train LayoutLM using the 'funds' dataset, emphasizing the importance of maintaining document structure.
  • πŸ“ˆ The presenter demonstrates how to preprocess data, train the model, and evaluate its performance, achieving 75% accuracy with five epochs.
  • πŸ”— The video provides a link to a GitHub repository containing the code for preprocessing and training the LayoutLM model.
  • πŸ”Ž The final part of the script shows how to use the trained LayoutLM model to infer and extract information from new, unstructured document images.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is about a document understanding model called LayoutLM, which helps in understanding documents and extracting relevant entities.

  • What does LayoutLM do differently compared to traditional OCR and NER?

    -LayoutLM takes into account more information than just text from OCR and named entity recognition (NER). It considers the layout and structure of the document to better understand and extract entities.

  • What kind of data set is used to demonstrate LayoutLM in the video?

    -The data set used in the video is called 'funds', and it is used to extract relevant information like key-value pairs from documents.

  • How does LayoutLM handle documents where the structure keeps changing?

    -LayoutLM helps maintain the structure of documents by keeping the layout information intact, which is crucial as the document structure can change and traditional OCR might fail.

  • What are the three key pieces of information that LayoutLM uses for training?

    -LayoutLM uses text information, the position of the text in a particular image, and the image embedding itself as the three key pieces of information for training.

  • What role does the Faster R-CNN model play in the LayoutLM architecture?

    -The Faster R-CNN model helps detect the region of interest where the words are located within the document.

  • How does the video demonstrate the process of training the LayoutLM model?

    -The video demonstrates training the LayoutLM model by using the 'funds' data set, fine-tuning the model, and evaluating its performance with metrics like accuracy, precision, recall, and F1 score.

  • What is the significance of the unique labels in the training process?

    -The unique labels are significant as they represent the different classes or categories that the model needs to learn to identify and classify during training.

  • How can one improve the accuracy of the LayoutLM model as shown in the video?

    -One can improve the accuracy of the LayoutLM model by increasing the number of training epochs, which allows the model more opportunities to learn from the data.

  • What is the final output the video aims to achieve using LayoutLM?

    -The final output aims to achieve is the ability to extract and annotate information from structured documents, such as invoices, with high accuracy using the trained LayoutLM model.

Outlines

00:00

πŸ“„ Introduction to LayoutLM Model

The speaker introduces the LayoutLM model, a state-of-the-art document understanding model that excels at extracting entities from various document types. Traditional methods like OCR and NER are mentioned as less effective due to their inability to handle changing document structures. The speaker contrasts this with LayoutLM's capability to understand and extract information while preserving document layout. An example dataset called 'funds' is introduced to demonstrate the model's application in extracting key-value pairs from documents, which is particularly useful in finance and retail sectors. The limitations of OCR for processing structured documents are discussed, emphasizing the need for a model like LayoutLM that can handle layout variations.

05:02

πŸ–₯️ LayoutLM Architecture and Data Processing

The speaker delves into the architecture of the LayoutLM model, explaining how it processes image documents. The model uses OCR to extract text and word positions, creating embeddings that include both text and positional information. These embeddings are then combined with image embeddings from a Faster R-CNN model to detect regions of interest. The process results in a comprehensive set of features that the LayoutLM model uses for training. The speaker also discusses the use of the 'funds' dataset for training, mentioning the need for GPU resources and the installation of necessary libraries. The process of data extraction and preparation is outlined, including the use of Hugging Face's resources and the structure of the dataset.

10:04

πŸ—οΈ Building and Training the LayoutLM Model

The speaker describes the process of building and training the LayoutLM model using annotated datasets. The dataset includes text, bounding box information, and labels for various elements within the document images. The speaker explains how to use tools like Label Studio to annotate documents and prepare datasets. The importance of understanding document structure for accurate information extraction is emphasized. The speaker also provides a link to Label Studio and discusses the steps for preparing the dataset, including pre-processing and mapping labels to ID codes.

15:05

πŸ” Data Preparation and Model Training

The speaker outlines the steps for preparing the data for training the LayoutLM model, including the use of a tokenizer from the LayoutLM library and data loaders from PyTorch. The process involves mapping labels to ID codes and converting the dataset into a format suitable for training. The speaker also discusses the training process, including the use of the Token Classification class from the Transformers library and the model training itself. The training is demonstrated with five epochs, but the speaker notes that more epochs could improve accuracy.

20:06

πŸ“Š Evaluating and Saving the Model

After training, the speaker discusses the evaluation of the model using a test dataset. The evaluation metrics, including loss, precision, recall, and F1 score, are presented, showing an accuracy of 75% with five epochs of training. The speaker suggests that training for more epochs couldθΏ›δΈ€ζ­₯提升 accuracy. The process of saving the trained model using PyTorch's `torch.save` method is also covered.

25:08

πŸ”Ž Inferencing with the Trained Model

The speaker explains how to use the trained LayoutLM model for inferencing on new images. This involves cloning a GitHub repository for preprocessing steps, installing Python Tesseract for OCR processing, and using the trained model to make predictions on new documents. The speaker demonstrates the process of loading the model, processing an image, and visualizing the predictions. The model's ability to classify different elements of the document, such as questions and answers, is shown. The speaker concludes by encouraging viewers to train their own models and seek further clarification in the comments if needed.

Mindmap

Keywords

πŸ’‘LayoutLM

LayoutLM is a state-of-the-art model for document understanding, which is designed to process images of text and extract relevant entities while preserving the layout information. It is integral to the video's theme as it is the primary technology being discussed and demonstrated for understanding and extracting data from documents. The script mentions using LayoutLM to process a dataset called 'funds' to extract key-value pairs from documents.

πŸ’‘OCR (Optical Character Recognition)

OCR is a technology that scans printed or written text from documents and converts it into machine-readable text. In the video, OCR is initially used to extract text from documents, but it's noted that it falls short when dealing with complex layouts, which is where LayoutLM excels as it takes into account not just the text but also the layout.

πŸ’‘NER (Named Entity Recognition)

NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, etc. The script discusses how NER is used in conjunction with OCR to extract entities, but it is also mentioned that this process can be improved with LayoutLM.

πŸ’‘Faster R-CNN

Faster R-CNN is a deep learning model for object detection in images. In the context of the video, it is used alongside LayoutLM to detect regions of interest within a document image, which aids in the extraction of relevant entities. It is part of the architecture that enables LayoutLM to understand the document's structure.

πŸ’‘Text Embeddings

Text embeddings are a representation of textual data into a vector space that allows for complex numerical operations. In the video, text embeddings are generated by LayoutLM by combining text information and positional information, which are then used to understand the context and meaning of the words in documents.

πŸ’‘Dataset

A dataset in this context refers to a collection of documents used to train and test the LayoutLM model. The script specifically mentions a dataset called 'funds' that is used to demonstrate how LayoutLM can extract information from documents.

πŸ’‘Fine-tuning

Fine-tuning is a machine learning technique where a pre-trained model is further trained on a specific task. The video describes fine-tuning the LayoutLM model using the 'funds' dataset to adapt it to the specific task of extracting key information from documents.

πŸ’‘Token Classification

Token classification is the task of classifying each token (word or subword unit) in a sequence into a predefined set of categories. In the video, token classification is used as the method for the model to predict the class of each word in the document, such as 'question', 'answer', or 'other'.

πŸ’‘Inference

Inference in machine learning refers to the process of making predictions or decisions based on a trained model. The video script describes inference as the final step where the trained LayoutLM model is used to predict the class of words in a new, unseen document.

πŸ’‘Hugging Face

Hugging Face is an organization known for its contributions to the field of natural language processing, including the development of the Transformers library. In the video, Hugging Face's resources are used to directly import the LayoutLM model and to access the 'funds' dataset for training and demonstration purposes.

πŸ’‘Label Studio

Label Studio is a data annotation tool mentioned in the script for preparing datasets with annotated information. It is used to label the data, which is essential for training models like LayoutLM to understand and classify different parts of documents.

Highlights

Introduction to LayoutLM model for document understanding and entity extraction.

LayoutLM is a state-of-the-art model that processes documents more effectively than traditional OCR and NER methods.

LayoutLM considers both text and layout information for document understanding.

Demonstration of the Funds dataset used for extracting key-value pairs from documents.

LayoutLM's ability to handle documents with changing structures where OCR might fail.

Explanation of how LayoutLM preserves document structure during entity extraction.

Architecture of LayoutLM, including OCR and positional embeddings.

Role of Faster R-CNN in detecting regions of interest within documents.

Process flow of information within LayoutLM for extracting entities.

Importance of text positioning information in image documents for accurate extraction.

How LayoutLM training is facilitated using the Hugging Face library.

Demonstration of data extraction and preparation for training the LayoutLM model.

Use of Label Studio for annotating and preparing datasets for training.

Explanation of the data processing steps required for training and inferencing with LayoutLM.

Training process of the LayoutLM model using the Funds dataset.

Evaluation of the trained LayoutLM model's performance on test data.

Instructions on saving the trained LayoutLM model for future use.

Inferencing process using the trained LayoutLM model on a new document image.

Potential applications of LayoutLM in finance, retail, and invoice processing.

Final thoughts on the power and utility of the LayoutLM model for document understanding.

Transcripts

play00:00

hello all and welcome to my youtube

play00:01

channel so today in this particular

play00:03

video we are going to see

play00:05

a very good document understanding model

play00:08

that is layout lm model which help us to

play00:12

understand the documents and extract the

play00:14

relevant entities from the documents

play00:17

so this is a

play00:18

kind of a state of art model which is

play00:20

available

play00:22

and we are able to process the document

play00:24

uh in a very easy way

play00:26

so earlier what happened was to extract

play00:28

the relevant entities from a tabular

play00:30

data or from any document

play00:32

data or any kind of unsafe data we used

play00:35

to have a

play00:38

ocr used to do ocr on those kind of

play00:40

documents and then

play00:43

do the ner to extract the relevant

play00:45

entities and then do the necessary

play00:47

processings to obtain the results in our

play00:49

required format

play00:51

but here this layout

play00:54

takes up the much more information than

play00:57

just taking up

play00:58

information from ocr and ner to do the

play01:02

understanding of a document and

play01:03

extracting the entities

play01:05

so we are going to see that so before we

play01:08

go into the layout

play01:10

architecture how it is helping us

play01:13

uh i just want to uh demonstrate a uh

play01:16

data set that we're going to use which

play01:18

is called funds t

play01:19

and in this uh data set or we want to

play01:22

extract this relevant information that

play01:24

is uh like this kind of information just

play01:26

present in this kind of documents so

play01:28

sometimes this is called as key and

play01:30

value key and value pairs so likewise uh

play01:33

we want to extract the information from

play01:34

a particular document in a real world

play01:36

scenario

play01:37

so this might be helpful in the finance

play01:39

world in a retail world or where you

play01:41

want to extract the

play01:43

some information from the invoices so

play01:45

such kind of information if you want to

play01:47

extract uh from the tableau kind of data

play01:50

or such kind of data then our ocr tends

play01:52

to get failed because ocr gets to start

play01:55

reading the data in a single line format

play01:57

and then applying on ocr uh becomes very

play02:01

difficult right so

play02:03

to

play02:04

keep an intact of the structure of the

play02:06

document

play02:07

that's where the layout alarm will help

play02:09

us

play02:10

so

play02:10

the ocr will just not help

play02:13

in the real scenario by just doing a

play02:16

ocr on this particular kind of document

play02:18

and just doing the ner because

play02:20

what happens is uh the document

play02:21

structure keeps on changing and ocr will

play02:24

keep on uh iterating new and new

play02:26

structure right so any uh for for the

play02:29

for the any other job becomes very

play02:31

difficult to extract the relevant

play02:32

entities

play02:34

because it is not taking care of the

play02:35

layout information since the layout

play02:37

keeps on changing according to the

play02:39

document so to keep the information of

play02:42

the structure of the document or the

play02:43

layout of the information of the

play02:44

document the layout alarm help us to

play02:47

keep the information intact and this is

play02:49

how uh

play02:51

this model will help us but uh uh this

play02:54

is how this will move the model be able

play02:56

to establish information like how it has

play02:58

been shown here so this is what we want

play03:01

yeah to this scenario right

play03:02

we just do it with ocr and do uh

play03:05

necessary anything but

play03:07

so uh to eradicate such kind of things

play03:10

such kind of problems of structure

play03:11

information uh for each and every

play03:13

document a layout term will help us so

play03:16

this is a data set we are going to use

play03:17

uh

play03:18

for our training uh layout area model

play03:21

and

play03:22

now let's let's just jump to the

play03:24

uh layout lm uh architecture

play03:27

so let me go through the architecture uh

play03:29

and this is a brief introduction of it

play03:32

so you can see this is the architecture

play03:34

of layout lm model so here you can see a

play03:37

image document image where we have this

play03:40

information lying up here

play03:42

and we want to get the relevant

play03:43

information from this particular page

play03:45

from this relevant document right so

play03:47

what happens is it takes up this

play03:49

document

play03:51

and we pre-process this particular

play03:53

document like we apply ocr above this

play03:56

document and with ocr we also get to

play03:59

know the position of the words in a

play04:02

particular document so let's suppose if

play04:06

i extract a word from this particular

play04:08

document a

play04:09

and then i want to also

play04:11

extract the information of

play04:13

uh the word a present in an image

play04:15

document so that bonding block

play04:17

information will be uh also be a store

play04:20

so i just

play04:22

so this is the information that it

play04:23

generally takes up uh this layout model

play04:26

like it takes the word

play04:28

it will extract the words from this

play04:30

particular page

play04:32

and then it also takes the position of

play04:34

those words in a particular image

play04:37

now so this information is being passed

play04:40

into this ocr and this ocr information

play04:43

will be text will be extracted in the

play04:45

form of uh text embeddings

play04:48

and positionability so these positional

play04:50

medics are nothing but uh the

play04:52

information of a particular word

play04:56

and this is a text

play04:58

which is present in a particular

play04:59

document so you can see this is the word

play05:02

e date which is embedding of a date so

play05:04

this word date

play05:06

is a text

play05:08

and it's a particular position of the

play05:11

word date present in a image so that's

play05:14

information it takes up

play05:16

and it takes the text embedding as well

play05:20

now this particular uh model this layout

play05:23

lm model will prepare uh embedding

play05:25

considering these two informations the

play05:27

text as well as the position and

play05:29

embeddings

play05:30

so positional bendings are nothing but

play05:31

the position of one particular word in

play05:33

an image so this two information is

play05:36

being passed into this layout lm model

play05:39

and and and a layout lm embeddings is

play05:42

generated

play05:43

so this embedding consists of the text

play05:46

information

play05:48

and

play05:48

the relevant position information of the

play05:51

particular word presented in particular

play05:53

image

play05:54

all right that's how the embedding of

play05:56

layout fml model listing process and

play05:58

parallely this image is also being

play06:01

forwarded to faster rcl model which help

play06:04

us to detect the region of interest

play06:06

where the words are being lying right so

play06:08

you can see this word date

play06:10

and its respective

play06:13

image of this

play06:14

date word in an in a particular document

play06:17

is being captured by this faster rcd

play06:19

model

play06:20

so that's what it is doing so you can

play06:21

see this is the layout embedding of date

play06:24

world and this is the image of a date

play06:25

word in a particular image that has been

play06:28

cropped and the embedding of this

play06:30

image or of the words are being prepared

play06:33

and that's how the the total embeddings

play06:36

like the image abilities and the layout

play06:38

embeddings will be uh calculated and

play06:40

added up and then it is prepared for the

play06:42

downstream tabs

play06:44

so this is how the information flows in

play06:46

a layout lm model it takes up the

play06:49

three entities or three informations

play06:51

that is a text information second is the

play06:54

position of a text in a particular image

play06:56

and third thing is

play06:58

image embedding itself so three

play07:00

information uh flows into this

play07:02

particular layout lm model and that's

play07:03

how the model is getting trained

play07:06

so this is a general article

play07:07

architecture of a layout lm model and

play07:09

this this layout information this extra

play07:11

information of

play07:13

the text positioning image help us to

play07:16

understand the uh image and structure of

play07:19

a particular image and extract the

play07:21

relevant entities from the image with a

play07:23

particular accuracy or with a good

play07:25

accuracy

play07:26

so that's how the layout alarm is

play07:28

working

play07:29

now

play07:30

to demonstrate the working of this

play07:33

faster rcn and

play07:35

layout lmm model uh we are going to use

play07:38

this fundsd data set

play07:40

and we are going to fine tune

play07:43

uh this fund's data set or on our uh on

play07:46

our data set by using the uh

play07:49

layout lm model

play07:51

so for that uh i'm going to use

play07:54

uh hugging face uh import uh that is

play07:58

available over here in the hugging phase

play08:01

so we can just directly use uh their

play08:03

model it is it is being available in the

play08:05

working phase so i'm just trying to

play08:07

directly use it

play08:09

so before that we have to just make sure

play08:12

that gpus are available in our local

play08:14

moment

play08:15

and now after this we are going to

play08:17

import some of the libraries

play08:19

which are necessary and required for

play08:22

installing the layout lm models so these

play08:24

are the dependencies that we need to

play08:26

install

play08:27

and once this uh

play08:29

information is being all libraries are

play08:31

being installed we can

play08:33

proceed with the data so data extraction

play08:36

is also a process

play08:38

but here i am just directly using the

play08:40

data that is available on the uh hugging

play08:42

face so you can just uh download this

play08:45

data by using this link uh present here

play08:48

so let me just run this

play08:51

and once once it get downloaded we can

play08:53

use this particular data set

play08:55

for our fine tiering of layout lm model

play08:58

so we will be able to extract the

play09:00

entities from

play09:02

from the image directly without using

play09:05

without using uh any kind of

play09:08

uh two steps model or three-step model

play09:11

uh it will just take up the structure

play09:13

information

play09:14

and uh text information and the

play09:16

immediate totally it will train a model

play09:18

and we'll get the embeddings right so

play09:20

this is what uh the flow will be

play09:22

so now we're just trying to uh get the

play09:24

data

play09:26

and once that is done uh we can uh move

play09:29

on to the uh preparation of this data

play09:32

so you can see the data has been

play09:33

downloaded and we can just take a look

play09:35

at the uh image data that is being

play09:38

downloaded so you can see the data is in

play09:39

this format

play09:41

it's a structure format and we want to

play09:43

extract this name or this key and the

play09:46

name the date and its date information

play09:49

supervisor manager so likewise you want

play09:51

to extract the information from this

play09:52

particular uh page so you can you can

play09:55

understand that this information or this

play09:57

this structure or this flow of a

play09:59

prevention of a document might change

play10:01

and accordingly we have to uh make sure

play10:04

that model also learn the structure of a

play10:06

document so that's how the layout model

play10:08

is helping us to understand the

play10:09

structure as well as the information

play10:11

which is present in the structure

play10:13

right

play10:14

so let's take the uh data set

play10:17

which will which are annotated

play10:19

so you can see

play10:20

in a data set i am getting a text

play10:23

rnd this is a text

play10:25

and it's a bonding box information that

play10:27

is what is the location in a particular

play10:30

image

play10:31

right so you can see rnd what is present

play10:33

here so its location is

play10:36

in the image is at this coordinates

play10:39

and you can see the label it has given

play10:42

as other so it has been labeled with

play10:44

other tag right

play10:47

like we are trying to label this

play10:49

particular word as in uh as an other

play10:52

class

play10:53

and the words

play10:54

uh information being stored as this and

play10:57

there is no link right this we are not

play10:59

providing any kind of relation between

play11:01

the two words it's just a single token

play11:03

classification model that we are going

play11:04

to build so it will just take up the

play11:06

word and it will classify into a

play11:08

particular category that will be other

play11:11

question or or any kind of other kind of

play11:14

classes right so we are going to uh

play11:16

produce such kind of things and uh

play11:18

likewise we are going to produce uh the

play11:21

whole scenario and we'll build a model

play11:24

so now let us see a particular image

play11:28

and draw some information over it this

play11:30

this embeddings or this particular

play11:32

labels that we have seen over here in

play11:34

the in the form of text so we just draw

play11:37

this particular

play11:39

labels over this image so you can see

play11:42

this is a

play11:43

same image that i have demonstrated

play11:45

above

play11:46

but now this is with the annotation so

play11:48

you can see this particular image has

play11:50

been or this particular word has been

play11:53

classified as other tag

play11:54

and this header

play11:56

these are the headers this is the

play11:57

question

play11:59

and this is the answer similarly this

play12:01

date has it has been annotated as

play12:03

question and this is answer so these are

play12:06

nothing but the classes that are being

play12:07

annotated by uh

play12:09

by the tools so we have to use a tool to

play12:12

annotate such kind of information and to

play12:14

prepare the dataset accordingly

play12:17

so you might be an uh

play12:19

you might be uh in a position to

play12:21

understand like what we are going to do

play12:23

we are just going to pass this

play12:24

particular image and get this kind of

play12:26

labels over the particular words in a in

play12:29

a particular document and that's how

play12:30

we're going to extract

play12:32

and to annotate this such kind of

play12:34

documents you can use label studio so

play12:37

here is the link which you can use it

play12:39

this is a free tool you can

play12:41

go to go through this particular

play12:42

documentation and install the label

play12:44

studio and prepare some script and

play12:47

do this kind of prepare the data set

play12:48

accordingly

play12:50

so i will provide this link into the in

play12:51

the description you can go through this

play12:53

particular documentation and you can

play12:55

prepare the data set likewise and update

play12:57

it in the same manner like how it is

play12:59

being shown here right

play13:01

so this is how we generally do we

play13:03

prepare a data set like this so we just

play13:05

take up the document we drag and drop or

play13:08

we

play13:09

pick up the image or word information

play13:12

and give a class to it

play13:14

and that's how we annotate it

play13:15

and once this annotations are done we

play13:17

prepare the data set and we pro we will

play13:19

pre-process it right

play13:21

so to pre proceed there is a uh

play13:24

there is a

play13:25

code given by the layout lm

play13:27

directly so we are just directly using

play13:29

it to pre-process the annotations

play13:31

according to the required format that

play13:33

model accepts so we're just going to run

play13:35

this particular cell so that the

play13:37

annotations and

play13:39

everything get

play13:40

into the required format

play13:42

so once that is done uh we are going to

play13:45

take up the annotations

play13:47

and we were just going to identify that

play13:49

what are the unique

play13:51

labels available so let me just run this

play13:53

and we'll understand what what is this

play13:55

unique label means

play13:57

so you can see it has been saved into

play13:59

the labels.txt

play14:01

so let's just go through it and

play14:03

it will save into label.txt

play14:06

so you can see these are the unique

play14:08

labels available

play14:10

uh so these are these are the classes

play14:12

you can say answer is a class

play14:14

header is a class question is a class

play14:17

and others is a class

play14:18

such why such such kind of things are uh

play14:21

the labels which are available you can

play14:22

see

play14:23

others is a class letter is a class

play14:25

question is a class an answer is a class

play14:27

and these are the unique labels right so

play14:30

that's what the information we need to

play14:31

get so these are the unique labels or

play14:33

unique classes you can say that's what

play14:35

we have extracted from this

play14:36

pre-processing uh that we have done

play14:38

so once this setup is done now we can

play14:40

start the training offer or we can

play14:43

process the data set in a form of

play14:44

pythons and then we can start the

play14:46

training

play14:47

so before we process that data uh we

play14:50

have to make sure that this uh

play14:52

this unique label uh should be prepared

play14:54

right

play14:55

and then

play14:56

we have to rest just run this particular

play14:59

cell uh so that we can prepare the data

play15:01

set in the python format so it will just

play15:05

take up this particular

play15:07

data set that that we prepared this

play15:08

label dataset

play15:10

unique label dataset and then it will

play15:12

prepare a

play15:14

map that it will prepare or it will map

play15:17

a particular label into an id code

play15:20

that means it will give a number

play15:22

uh to a to a label so

play15:24

we cannot just rightly pass a class

play15:25

named right in the form of text we have

play15:27

to convert into the some number right so

play15:29

that's what it is doing here it's it is

play15:31

loading up the data it is taking up the

play15:34

uh this label and it is mapping into the

play15:38

number right so that's what it is doing

play15:40

and that's what the simple function is

play15:42

also helping us to do it so we'll just

play15:44

run up this code and we'll convert the

play15:47

labels to the ids

play15:48

and now we can check on to the labels

play15:51

how it is there

play15:53

so you can see these are the unique

play15:54

labels available right and if you want

play15:56

to see this label map uh we can check it

play15:59

also

play16:02

so you can see uh each each label has

play16:04

been converted to the

play16:06

ids is its respective number right so

play16:09

this two is two represented by this

play16:10

particular thing and likewise other

play16:12

labels are uh been being represented

play16:16

so once this is once this setup is done

play16:18

then we have to prepare the pythos data

play16:20

set

play16:21

so for this we have to import uh

play16:23

tokenizer from layout lm

play16:25

uh that is available in transformers and

play16:28

then we have to import some

play16:30

classes from the

play16:32

layout

play16:33

library or you can say from the from the

play16:36

resource code so we are import using

play16:38

that

play16:39

and the some data loaders from the uh

play16:42

torch libraries to convert the

play16:43

particular data to a data loader form

play16:44

data data loader format right so uh

play16:47

these are the arguments that particular

play16:49

model takes up

play16:51

and

play16:53

this is uh the uh this is the class that

play16:55

will take up the uh this uh this

play16:57

particular uh dictionary that we've

play17:00

prepared to map the labels

play17:02

and it will prepare it in an argument

play17:04

format that's what this class is doing

play17:05

so this arguments is being prepared

play17:08

and then we are going to use this layout

play17:10

lm pretend model uh for the tokenizer

play17:13

and then once that is done now we are

play17:15

going to use the response data set from

play17:17

the transformer

play17:20

and then we are going to pass this

play17:21

arguments that we have given here in

play17:23

tokenizer and the labels that we have

play17:25

prepared at the top and then pattern

play17:28

tokens and we have to give the train

play17:30

mode and like likewise we have to

play17:32

prepare a data set for the training so

play17:34

this this whole step is for the

play17:36

preparing the data set in a form of

play17:38

for for loading up the data in a pythons

play17:41

right so this is this is for training

play17:43

and similar way the test

play17:45

has been done

play17:46

so likewise we'll prepare the data set

play17:48

for uh

play17:50

training and test by using the pytorch

play17:52

loader and once that is done we can see

play17:55

some data set here

play17:58

uh the length of there

play18:00

and then we'll print out some data set

play18:02

that is being prepared uh from this data

play18:04

loader so it's just gonna run so you can

play18:06

see this particular

play18:08

image uh information is being ocr

play18:12

and

play18:12

and it has been given an or it has been

play18:15

tokenized into this particular unknown

play18:18

tags and the padding has been done so

play18:20

you can see this is the uh input that we

play18:22

are going to pass

play18:25

so once that all the setup is done now

play18:28

we are finally into the training of the

play18:30

model

play18:31

so for that uh we have to just import

play18:32

this token classification class from the

play18:34

transformer and then we are going to

play18:37

load up that

play18:38

model

play18:39

from the

play18:40

transformer library

play18:43

so this is what we are going to do

play18:45

and once that is done

play18:46

we can just run up this

play18:49

particular code

play18:51

that has been provided

play18:52

and we can just start training the model

play18:55

on the prepared data set

play18:57

so right now i'm giving the five epochs

play18:58

to be trained

play19:00

but if you want more accuracy

play19:02

or more

play19:04

accurate predictions you can get up to

play19:06

more or you can train up to four more

play19:08

epochs but for this tutorial i'm just

play19:11

using five epochs for the training so

play19:14

let's just run this particular result to

play19:16

train the model

play19:18

and let's wait for few minutes to get

play19:20

the bottle drained

play19:29

okay so model has got trained now we'll

play19:32

just value the model on the test data

play19:35

set

play19:36

so

play19:36

this is the code that has been written

play19:39

to

play19:39

evaluate the data set uh on the trained

play19:41

model so we'll just run up this

play19:42

particular cell to get the predictions

play19:45

or look at the metrics uh for the

play19:47

evaluation

play19:48

so

play19:49

you can see the loss is point seven four

play19:52

and present prison is 71 percent and

play19:55

recall is 78 and f1 score is 75

play19:59

so you can see just with five epochs now

play20:02

we are able to achieve 75 percent

play20:04

accuracy but if i want more accuracy i

play20:06

have to just increase the epochs and

play20:08

continue training for more epochs right

play20:10

so once that is done

play20:12

uh we can save this particular model

play20:14

and we can just do the torch.save and we

play20:17

can save it in a dictionary format of

play20:19

this model state

play20:20

and then once that is done we can just

play20:23

do go for the inferencing so we'll just

play20:25

uh take up a particular image and

play20:29

we'll pass this image to our trained

play20:30

model

play20:31

and

play20:32

then we are going to see the predictions

play20:35

of this particular model how it is doing

play20:36

right

play20:37

so first uh we have to uh import this or

play20:41

you have to you can say you have to

play20:42

clone up this particular uh

play20:45

github and then we have to install this

play20:47

particular

play20:48

pi tesseract why we are trying to

play20:50

install this python set because whatever

play20:52

the processing we have done for the

play20:53

training data set while training the

play20:55

same processing has to be done for the

play20:57

new image right

play20:59

so in the in the processing uh we

play21:01

haven't seen any kind of uh the major

play21:04

step that that are being involved

play21:05

but yeah but internally what it is being

play21:08

done is white annotation uh we are we

play21:11

are just taking an ocr so you must

play21:12

remember this particular

play21:15

architecture we are taking this image

play21:17

and

play21:18

passing it to the ocr getting the uh

play21:20

text from the ocr and his respective uh

play21:23

words uh position or from the image so

play21:26

these are the processing steps uh which

play21:28

has been done and which has been done in

play21:31

a particular data set that is funds data

play21:33

set so we are not aware of it but it has

play21:35

been done and it has been readily

play21:37

available right so it becomes very easy

play21:40

for to train our model but while

play21:42

inferencing we have to do the same steps

play21:44

whatever whatever we have done for while

play21:46

training so the same steps are being

play21:49

applied so that's why we are using the

play21:51

pi tesseract uh to process the uh

play21:54

documentary page that way that is that

play21:56

is the new major that will be coming up

play21:58

we'll process it

play22:00

we'll pass it to the ocr we'll get the

play22:02

text of that particular text from those

play22:04

particular images and the respective uh

play22:07

bounding box information where the text

play22:09

are present in a particular image so

play22:11

once that is done we have to run this

play22:16

environment

play22:18

so let me restart this moment

play22:22

and

play22:23

we have to just

play22:25

load up this model that we have saved

play22:27

so

play22:28

let me just view the image that we are

play22:31

going to process

play22:32

okay i think it is not available let me

play22:38

okay so we have

play22:39

not processed it let me

play22:42

take up this particular github that we

play22:44

have imported right so this is the uh

play22:47

data that we've imported here so we'll

play22:49

just take up this uh importer github and

play22:51

this file which will help us to

play22:53

pre-process all the processing and that

play22:55

we have to do it for the new uh text or

play22:59

new image so that's how the all the code

play23:01

is being written if you go through this

play23:03

particular code all the pre-processing

play23:05

that has to be done is being given in

play23:06

this particular uh pi file so we're just

play23:09

writing going to use it so for that

play23:11

reason i have just imported this

play23:13

particular or cloned this particular

play23:15

repository and got this particular file

play23:17

to process the same thing whatever we

play23:19

have done for the

play23:20

training and we just import this

play23:24

and

play23:25

let it just let it get imported and once

play23:27

that is then we are able to

play23:29

see the image the new image you can see

play23:32

this is the new image

play23:34

that we are passing it to which which is

play23:36

not uh trained in the model so we are

play23:38

going to pass this particular new image

play23:40

to the model and want to extract this

play23:41

information right so we'll do up this uh

play23:44

by by uploading up the model that we

play23:46

have trained so it has the model has

play23:47

been saved here you can see the current

play23:49

directory

play23:50

out lm dot pt so i'm doing the same

play23:52

model

play23:53

and

play23:55

and and try to load up the model uh with

play23:57

the with the labels that we have trained

play23:59

right so i think we haven't

play24:03

uh we we don't have the numbers of the

play24:05

labels

play24:06

so let me just run this particular and

play24:08

then we'll check up the particular

play24:09

things you can see yeah it's a it's

play24:11

giving up the error so let me go

play24:13

and run a particular cell so that we can

play24:15

get the number of labels right

play24:18

so

play24:21

yeah here it is so we want to get this

play24:23

number of labels so i will just run up

play24:24

this code

play24:25

to get the numbers now we'll get back to

play24:28

the same step

play24:31

to look up the

play24:33

or to render or to influence the

play24:34

particular

play24:36

image right so you can see this model we

play24:38

have loaded now we will pass this uh

play24:41

this particular image a new image to

play24:43

pre-process it so you can see uh this

play24:45

pre process is coming from this

play24:47

particular uh

play24:48

import that we have done this particular

play24:50

uh preprocess file so this is from where

play24:53

the print process is coming up this

play24:55

function is coming up and we are passing

play24:57

an image

play24:58

so what is it is returning is you can

play25:00

see image it is returning the image it

play25:03

it returns the word it returns the

play25:06

boxes

play25:07

that means the

play25:09

boxes that has been uh processed that

play25:11

means uh us our scaled uh

play25:14

which we are not using the actual

play25:16

information of the image but we are

play25:17

doing it or scaling it into some uh

play25:19

information right of the image to make

play25:21

it in a same information level so we can

play25:24

say standardization we are doing it

play25:25

right and these are the actual box

play25:27

information that's the information i was

play25:28

talking about right when we do the

play25:29

pre-processing uh this is what happens

play25:31

actually it gets the image of the word

play25:33

the text information the bonding box

play25:35

information and its actual bonding box

play25:37

information right so that's what we

play25:39

generally get from the processing and

play25:41

then we pass this

play25:43

pre-processing word

play25:45

and convert this into the into the

play25:47

features that means we are doing the

play25:49

encoding uh using a

play25:51

layout element tokenizer so whatever the

play25:53

step we have done here uh

play25:56

like encoding and all everything that we

play25:58

have done here uh in this step right uh

play26:01

here you can see

play26:02

we use this tokenizer in order to encode

play26:04

this particular text information that

play26:06

has been coming so the same steps are

play26:08

being given inside this particular

play26:10

layout lm process dot pi file

play26:13

so we are uh using this convert two

play26:15

features if you go into this layout

play26:16

element processor you can see the

play26:18

function is you can see the previous

play26:21

function

play26:22

and the convert to feature

play26:24

uh

play26:25

is present you can see it is uh

play26:28

tokenizing the input

play26:30

whatever we are doing it and it is

play26:32

giving the uh tokenized information

play26:34

right so that's what it is doing you can

play26:36

see the i am passing the information

play26:38

process information to the tokenizer and

play26:40

it is getting the encoding uh for

play26:43

for the model to do the protections so

play26:45

once that processing is done

play26:48

and

play26:49

the predictions are being happening so

play26:52

once that is done we are able to operate

play26:54

the model

play26:56

so once that is in you can see the

play26:58

the the encoding is being done and all

play27:00

the predictions are being happened and

play27:02

now we are just going to check on or

play27:04

visualize it uh on the particular image

play27:07

right so these are the predicted images

play27:09

so these are the predicted information

play27:11

on the particular image

play27:12

so you can see the model is able to

play27:14

print this is a question this is the

play27:16

answer is the question is the answer the

play27:17

header and this is the other so pay

play27:20

violent points others other class

play27:23

you can see this this information is

play27:25

being uh

play27:26

pretty nicely you can see there are a

play27:28

lot of there is some missed information

play27:30

also being predicted but yeah it can be

play27:32

improved if we train for a longer time

play27:34

right it we are just streaming for five

play27:35

epochs so right another predictions are

play27:37

uh getting uh observed but yeah we can

play27:40

train it for a longer time and we can

play27:41

get the right predictions and we can

play27:43

just get the information saved into the

play27:44

required format in adjacent format

play27:47

so that's how

play27:48

we can train a particular layout lm

play27:50

model and get the information from a

play27:52

structured document

play27:54

uh getting the structure information of

play27:56

the document as well into a particular

play27:57

document

play27:58

so that's how it will help in

play28:00

understanding document and extracting

play28:02

the information from any kind of

play28:03

document any kind of structure document

play28:05

any kind of tabular data document

play28:07

any kind of invoice document any kind of

play28:10

unsuch document which you want to

play28:11

extract information

play28:13

so that's how

play28:14

this leotard model is powerful

play28:17

and this is how we can use it is how we

play28:20

can train it uh the whole uh

play28:23

code will be as well in the description

play28:25

so you can just go through the code and

play28:27

you can just train your own model and

play28:29

let me know if you have any doubts in

play28:31

the comments right so thank you this is

play28:33

all about this particular video

play28:36

and

play28:36

if you like my channel please subscribe

play28:38

to chat

play28:39

thank you

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
LayoutLMDocument UnderstandingEntity ExtractionOCRNLPMachine LearningData AnnotationPythonHugging FaceTransformers