Invoice Extraction: Extract PDF Invoice to Excel with UiPath

Thomas Janssen | Tom's Tech Academy
3 Jun 202316:29

Summary

TLDRIn this instructional video, Thomas from Tombstack Academy demonstrates a straightforward method to extract invoice data using UiPath, without the need for regular expressions. He guides viewers through the process of reading a PDF invoice, extracting metadata such as invoice number and date, and processing invoice lines. The tutorial includes creating directories for input and processed invoices, utilizing UiPath's PDF activities, and exporting the extracted data into an Excel file for further analysis, ensuring a user-friendly approach suitable for beginners and efficient for processing multiple invoices.

Takeaways

  • 😀 The video is a tutorial on extracting invoice data using UiPath without regular expressions.
  • 📄 The example invoice has an invoice number 100, a date, and various invoice lines including decorative clay potteries.
  • 🔍 The presenter, Thomas from tombstack Academy, demonstrates how to read and process a PDF invoice in UiPath.
  • 📁 The video includes instructions to create directories for input and processed invoices to manage files effectively.
  • 🛠️ The tutorial requires installing the 'UiPath.PDF.Activities' package for PDF text extraction functionalities.
  • 📝 It explains how to use 'Read PDF Text' activity to extract text from the PDF and write it to a text file for analysis.
  • 🔑 The method relies on the consistent use of double spaces between important data elements like quantity, product description, and price.
  • 📑 The script details how to extract specific data like invoice number and date using 'Text to Left/Right' activities in UiPath.
  • 📊 The presenter shows how to convert extracted text into a structured data table using 'Generate DataTable from Text' activity.
  • 🔄 The process includes dynamically processing multiple PDF files by setting up a 'For Each File' activity.
  • 📊 The final steps involve moving processed files and saving extracted data into an Excel file with appropriate sheet names.

Q & A

  • What is the main topic of the video?

    -The video is about extracting invoice lines and metadata from an invoice using UiPath without regular expressions.

  • Who is the presenter of the video?

    -The presenter is Thomas from Tombstack Academy.

  • What is the invoice number processed in the video?

    -The invoice number processed in the video is 100.

  • What is the date on the invoice shown in the video?

    -The invoice date is the first of January 2023.

  • What product was ordered in the invoice example?

    -The customer ordered 100 decorative clay potteries.

  • How much does each decorative clay pottery cost according to the invoice?

    -Each decorative clay pottery costs 13 units of currency per piece.

  • What is the method used in the video to extract the text from the PDF?

    -The method used is the 'Read PDF Text' activity in UiPath, which reads the entire text of the PDF into a variable.

  • How does the video suggest to handle multiple spaces in the PDF text?

    -The video suggests using a simple method where the number of spaces between words is consistent, and replacing these spaces with a unique symbol like a pipeline for easier extraction.

  • What activity in UiPath is used to split text based on a separator?

    -The 'Text to Left/Right' activity is used to split text based on a custom separator.

  • How can the extracted data be saved into an Excel file in the video?

    -The video demonstrates using the 'Write Cell' and 'Write DataTable to Excel' activities to save the extracted data into an Excel file.

  • What is the purpose of moving the processed PDF to a different folder?

    -The purpose is to prevent the processed file from being touched again and to keep the workflow organized.

  • How does the video ensure the workflow processes multiple files?

    -The video shows how to make the file selection dynamic by using 'For Each File' and 'For Each Folder' activities, allowing the workflow to process all PDF files in a specific folder.

  • What is the condition for using the simple extraction method shown in the video?

    -The condition is that the spaces between different elements of the invoice lines, such as quantity and product description, must be consistent, typically two spaces.

  • How does the video demonstrate extracting invoice number and date?

    -The video uses the 'Text to Left/Right' activity with specific separators to extract the invoice number and date from the PDF output.

  • What is the final output of the workflow shown in the video?

    -The final output includes an Excel file with two sheets: 'invoice lines' containing the details of the invoice items and 'invoice info' containing the invoice number and other metadata.

Outlines

00:00

📑 Introduction to Invoice Processing with UiPath

In this video, Thomas from Tombstack Academy demonstrates how to extract invoice lines and metadata from a PDF invoice using UiPath without regular expressions. The video begins with an introduction to the sample invoice, which includes an invoice number, date, and various invoice lines. Thomas guides viewers on how to download the invoice, set up directories for input and processed files, and navigate to UiPath to begin the automation process. The focus is on using the 'Read PDF Text' activity from the UiPath PDF package to extract text, which is then written to a text file for analysis.

05:01

🔍 Extracting Text and Data from PDF Invoices

This paragraph details the process of extracting specific information from the PDF invoice text. Thomas explains the importance of consistent spacing between words to facilitate the extraction method used. He demonstrates how to use the 'Text to Left/Right' activity to extract the invoice number and date by specifying separators and saving the results to new variables. The method relies on identifying patterns in the text, such as two spaces between quantities and descriptions, to accurately split the text into usable data.

10:02

📊 Converting Extracted Text into a Data Table

The speaker proceeds to show how to convert the extracted invoice lines into a structured data table. He uses the 'Generate DataTable from Text' activity, adjusting the column separator to a unique symbol to ensure correct data parsing. After replacing double spaces with a pipeline symbol for clarity, the invoice lines are successfully split into quantity, product name, price, and total line price. The resulting data table is then saved for further use in the automation process.

15:02

🔄 Automating File Processing and Excel Output

The final paragraph outlines the automation of processing multiple PDF invoices and dynamically writing the extracted data to an Excel file. Thomas demonstrates how to use a 'For Each File' activity to loop through PDF files in the input directory, read their content, and save the extracted invoice number and data table to an Excel sheet named 'Invoice Info' and 'Invoice Lines,' respectively. The automation also includes moving processed files to a 'Processed' folder and ensuring that the Excel file path is dynamic, allowing for the processing of any number of invoices.

👋 Conclusion and Future Tutorials

Thomas concludes the tutorial by summarizing the steps taken to extract and process invoice data using UiPath. He confirms that the method demonstrated is the simplest but may not work for all invoice formats. He mentions an alternative tutorial on his channel that covers extracting invoices with regular expressions for viewers whose documents do not fit the criteria for the method shown in this video. Thomas expresses hope that the tutorial was helpful and looks forward to engaging with viewers in his next video.

Mindmap

Keywords

💡UIPath

UIPath is a software platform that enables the automation of repetitive tasks. In the context of the video, it is the main tool used to demonstrate the process of extracting information from a PDF invoice without the use of regular expressions. The script mentions installing the 'UI bit dot pdf.activities' package, which is a set of activities for working with PDF files within the UIPath environment.

💡PDF Invoice Extraction

PDF Invoice Extraction refers to the process of extracting data from a PDF document, which in this case is an invoice. The video script focuses on this concept, showing how to use UIPath to automate the extraction of invoice details such as the invoice number, date, and line items.

💡Metadata

Metadata in the context of this video refers to the data about the invoice that is not part of the line items, such as the invoice number and date. The script mentions extracting 'the invoice lines and as well, the metadata from an invoice', indicating that both the detailed line items and the general information about the invoice are important.

💡Regular Expressions

Regular expressions are a powerful tool used in programming for pattern matching and text manipulation. The video script specifically mentions not using regular expressions for the task, suggesting an alternative, simpler method for extracting text from PDF invoices.

💡Text File

A text file is a plain text document that is used in the video to store the output of the PDF text extraction process. The script describes writing the entire text of the PDF to a text file named 'pdf.txt', which is then used for further processing and analysis.

💡Variable

In programming, a variable is a storage location paired with a name, which can be used to hold data. The script mentions creating variables such as 'PDF outputs' and 'invoice no' to store the extracted data from the PDF invoice.

💡Data Table

A data table is a structured representation of data, often used in software for organizing and manipulating information. The video script describes the process of generating a data table from the extracted text, which is then used to organize the invoice line items into a more readable format.

💡Excel

Excel is a widely used spreadsheet program that allows for data organization and analysis. The script discusses using Excel to save the extracted invoice data, mentioning writing to an Excel file and creating specific sheets for the invoice information and line items.

💡Dynamic

In the context of the video, 'dynamic' refers to the ability of the UIPath workflow to adapt and process multiple files or varying data without hardcoding specific values. The script mentions making the process dynamic by selecting files based on their extension and location rather than specifying a single file.

💡For Each

The 'For Each' activity in UIPath is used to iterate over a collection of items, such as files in a folder. The script describes using a 'For Each File' activity to process all PDF files in a specific folder, which is a key part of automating the invoice extraction process.

💡Message Box

A message box is a graphical user interface element used to display information to the user. In the script, a message box is used to display the extracted invoice number, serving as a simple way to verify that the data has been correctly extracted and processed.

Highlights

The video demonstrates a method to extract invoice lines and metadata from a PDF using UiPath without regular expressions.

Introduction to the channel 'tombstack Academy' by Thomas, the presenter.

Overview of the sample invoice with an example of 100 decorative clay potteries, their price, and tax.

Instructions on downloading the invoice and accessing the template for creating custom invoices.

Step-by-step guide to set up the workflow in UiPath, starting with adding the PDF package.

Explanation of creating directories for input and processed invoices within UiPath.

Demonstration of reading PDF text using UiPath's 'Read PDF Text' activity.

How to write the output text to a file for further analysis.

The importance of consistent spacing in the PDF for the extraction method to work effectively.

Technique to extract the invoice number using 'Text to Left/Right' activity in UiPath.

Using message boxes to display the extracted invoice number for verification.

Method to refine the extraction process to avoid capturing unwanted text.

Process of extracting invoice line items by identifying specific separators in the text.

Conversion of extracted text into a structured data table using 'Generate DataTable from Text'.

Adjusting the column separator to a unique symbol for accurate data table generation.

Automating the workflow to process all PDF files in a folder and move them post-processing.

Integrating Excel activities to save extracted data into an organized spreadsheet.

Final walkthrough of the complete workflow from PDF extraction to Excel output.

Alternative method mention for invoices that do not fit the criteria for the demonstrated extraction technique.

Conclusion and invitation to the next video in the series.

Transcripts

play00:00

PDF invoice extraction with uipad

play00:03

doesn't have to be difficult in this

play00:05

video I will show you how you can

play00:07

extract the invoice lines and as well

play00:08

the metadata from an invoice which

play00:10

uipath without using regular Expressions

play00:13

if you're new to the channel my name is

play00:15

Thomas and you're watching tombstack

play00:17

Academy let's start right away

play00:19

the invoice that we're going to process

play00:20

today is this one

play00:23

so you see that we have an invoice

play00:25

number 100 we have an invoice a date

play00:28

first of January 23 and then we have all

play00:31

kind of invoice lines so in this case

play00:33

the customer ordered a 100 decorative

play00:35

clay potteries for 13 per piece

play00:40

I see here as well the subtotal the

play00:42

total price and the sales tax there is a

play00:45

link in the description of this video

play00:46

where you can download this invoice and

play00:48

there is also a link to the page where

play00:50

you can find the invoice template so you

play00:51

can make the invoice yourself in word

play00:53

and then extract it to a PDF so let's

play00:56

download the invoice and I already have

play00:59

it here so I'm going to copy it

play01:01

and I'm going to navigate to uipath

play01:03

click on the main.xml open file location

play01:07

and then I'm going to create a new

play01:09

directory here new folder

play01:12

let's call this one invoices

play01:18

and let's make an input folder

play01:23

and a folder for invoices which have

play01:25

already been processed

play01:28

like this

play01:30

I'm going to move this invoice to the

play01:32

input folder and as soon as your advert

play01:34

has processed it you have it will move

play01:36

it then to the process folder so it

play01:38

won't be touched again so let's open UI

play01:41

pads and let's start with adding the PDF

play01:43

package

play01:45

manage packages make sure that you have

play01:47

all packages here and then let's search

play01:49

for PDF

play01:51

the one that we need is this one UI bit

play01:53

dot pdf.activities so click your

play01:57

uninstall and just install the latest

play01:58

version

play02:00

then navigate to activities search for

play02:03

PDF

play02:05

and the one that you're going to need is

play02:06

read PDF text

play02:08

so they get put in the main sequence

play02:11

and then let's also take the file that

play02:14

we want to process for now I'm not going

play02:16

to make it Dynamic I'm just going to

play02:18

select this file and later I'm going to

play02:20

show you how to make everything Dynamic

play02:23

so we can process multiple files instead

play02:25

of just one

play02:28

um if you click on read PDF text and you

play02:30

navigate to the properties panel so

play02:31

that's this one here

play02:33

then you see as well output text here

play02:35

I'm going to press Ctrl K to create a

play02:37

new variable

play02:38

and let's call this one PDF

play02:41

outputs

play02:43

like this

play02:44

and then I'm going to write to a text

play02:47

file

play02:49

my text file

play02:51

I'm going to write the entire text of

play02:52

this PDF a to a text file

play02:55

so use a variable PDF output

play02:59

right to file name let's open the

play03:02

advanced editor

play03:03

and let's call this one PDF oh sorry

play03:06

double quotation marks pdf.txt

play03:12

and now run your robot

play03:15

and if all went well when you navigate

play03:18

to project you should now see a text

play03:21

file here

play03:23

pdf.txt open it and now you will see

play03:26

what your iPad sees when it reads a PDF

play03:28

file and or test for today is to see how

play03:31

we can extract this text and how to make

play03:33

sense of this data

play03:35

what's really important is that the

play03:36

method that we're using today is the

play03:38

simplest method to extract PDFs but it

play03:41

only works

play03:43

so you see that this product consists of

play03:46

multiple words and between every word

play03:48

there is one space

play03:52

but between between for example the

play03:54

quantity and the products there are two

play03:57

spaces right

play03:59

I see also that here between the price

play04:01

and the product are also two spaces and

play04:04

between the product price and the Total

play04:06

Line price are also two spaces

play04:08

and this is a precondition if you want

play04:10

to use this method so let's say that you

play04:12

have for example a decorative clay

play04:14

pottery here but here you have one space

play04:17

and then here you have also one space

play04:19

then it's not possible to use this

play04:21

method

play04:22

if you have for example two or three or

play04:24

four spaces here it doesn't matter how

play04:26

many but it should be equal all the time

play04:28

then you can use this very simple method

play04:30

okay let's start with extracting the

play04:33

invoice number and you can do that you

play04:34

see that the invoice number is 100 so

play04:37

I'm just going to extract the text in

play04:38

front of it

play04:40

this one is Ctrl C I'm going to go to

play04:43

uipath

play04:45

and the activity that you're going to

play04:47

need is

play04:48

a text to left right so just search for

play04:51

left

play04:51

text to left slash right take it here

play04:55

text to split

play04:58

is variable that's the PDF output

play05:00

and then the separator I'm going to say

play05:03

custom and the separator in this case

play05:07

is going to be this text that was in

play05:09

front of the invoice number

play05:11

and then I'm going to say that I want to

play05:13

save the text to the right

play05:17

and I want to save that in a new

play05:18

variable let's call this one invoice no

play05:24

okay let's try this out let's use a

play05:25

message box

play05:29

and let's put invoice number in a

play05:31

message box

play05:36

so you will see that we get the exact

play05:39

invoice number that looks good but the

play05:41

robot doesn't stop there it also takes

play05:43

everything underneath the invoice number

play05:45

and we're gonna change that

play05:47

so let's use another text to left right

play05:54

and now we're not going to start with

play05:56

PDF output but now we're going to start

play05:58

with invoice number

play05:59

so text to split is now equal to invoice

play06:03

number

play06:04

and the separator

play06:06

is a new line and we're going to take

play06:08

everything on the left side of that new

play06:10

line so his variable inverse number

play06:14

and now you will see that it will work

play06:18

and now we only have the invoice number

play06:19

100 of course if you want to extract the

play06:22

date you just take the date instead of

play06:25

the invoice number

play06:27

just take this part and you can use the

play06:29

text left slash right to extract this

play06:31

information okay I think you're most

play06:33

interested in extracting the exact line

play06:35

items of this invoice so I'm going to

play06:37

show you how to do that so go back to

play06:39

pdf.txt

play06:41

and then take this line

play06:44

and copies

play06:47

and then we're gonna do another text to

play06:49

left right

play06:53

X to left right

play06:55

text split that is PDF outputs and the

play06:59

separator that's now going to be open

play07:01

advanced editor the line that I just

play07:03

added so basically the headers of the

play07:05

underneath table

play07:07

click ok

play07:08

and I'm going to say save text to in the

play07:12

right

play07:13

create variable

play07:15

let's call those ones um

play07:18

invoice

play07:21

lines

play07:23

and I want to stop extracting those

play07:25

inverse lines as soon as as soon as I

play07:27

see subtotal

play07:29

so let's take that one as well

play07:31

let's add another text to left right

play07:34

text to split

play07:36

invoice lines

play07:38

separator

play07:41

subtotal

play07:43

and I want to save the text on the left

play07:45

side

play07:47

and invoice lines

play07:49

okay let's just make sure that what we

play07:51

have done until now makes sense

play07:53

let's add a message box

play07:58

and use variable invoice lines

play08:01

and of course this invoice only has one

play08:03

invoice line but this method works just

play08:05

as good for invoices with notable

play08:06

invoice lines

play08:08

so you see it works for now okay so

play08:10

let's now see how we can extract this

play08:12

one to a data table okay let's just

play08:14

search for table

play08:16

any activity that you need is generate

play08:18

data table from text

play08:20

so take it put it below

play08:24

and now select inputs and input is of

play08:28

course invoice lines and navigate to the

play08:31

options

play08:33

and there let's take the actual invoice

play08:36

line or lines if you have a PDF with

play08:38

multiple invoice lines

play08:40

put them here in Sample inputs and now

play08:43

you can see that we can provide a format

play08:46

so we can say that we want to split the

play08:48

columns based on a column A separator

play08:50

and that that is for example one or two

play08:52

spaces but if you use two spaces and you

play08:55

provide preview you still see it's not

play08:57

working because it's now splitting every

play08:59

word instead of splitting the quantity

play09:01

and then the entire description of the

play09:04

products and then the price Etc so we're

play09:06

going to do something different and just

play09:08

make sure that you copy this one again

play09:09

click unlock

play09:12

add an assign activity

play09:15

here

play09:16

and then we're going to say that invoice

play09:19

lines

play09:21

is equal to invoice lines

play09:24

but I'm going to replace

play09:29

two spaces I'm going to replace them for

play09:31

a pipeline like this

play09:34

I'm using a pipeline but you can use any

play09:36

symbol as long as it's Unique

play09:38

so click here on Arc

play09:41

go back to options

play09:45

and now I'm gonna change all the two

play09:47

spaces for a pipeline

play09:53

pipeline

play09:56

pipeline I'm going to say that the

play09:58

column separator is a pipeline like this

play10:02

click preview and now you see that it

play10:04

works because your web is able to split

play10:05

the quantity it's able to split the

play10:08

product's name

play10:09

um the price and the total price of this

play10:12

line

play10:14

so now click on Ock and now we can also

play10:16

determine where we want to save the data

play10:18

table so I'm going to save this data

play10:19

table

play10:20

and let's call this one DT

play10:23

invoice lines

play10:26

press enter

play10:28

and others also build the rest of the

play10:30

flow

play10:32

so let's start with a four each

play10:36

for each file and folder

play10:39

so we're going to process all the files

play10:40

in a specific folder

play10:42

and this is my project

play10:45

invoices that's going to be input and I

play10:48

only want to process invoices that have

play10:51

a PDF extension

play10:54

so let's say star dot PDF

play10:58

look

play11:00

and what do we want to do with all of

play11:02

those just click here on read PDF text

play11:04

now press Ctrl and select all of them

play11:12

control

play11:18

and now make sure that you output them

play11:19

in this do box

play11:23

and now let's also change 100 so let's

play11:26

change this one

play11:27

to current file full name includes full

play11:30

pads and that way we make this Dynamic

play11:32

and we can pick up any PDF that is

play11:34

dropped in this folder okay so we're

play11:37

reading from current file full name this

play11:40

one we can remove because we don't need

play11:41

it anymore

play11:45

so we're done saving the invoice number

play11:49

and of course we're also saving the data

play11:51

table

play11:52

let's as well add a move file activity

play11:55

because we want to move to file

play11:57

to the other folder as soon as it has

play11:59

been processed move file

play12:01

and just put it here at the end

play12:05

and then I'm going to say that I want to

play12:07

move the file current file full name

play12:10

inclusive pads

play12:12

and I want to move it

play12:15

to invoices processed click select

play12:18

folder

play12:19

and make sure to also enable override so

play12:22

that the file is overwritten as soon as

play12:23

a new file with the same file name

play12:25

appears so click here

play12:27

again now there is one thing that we

play12:29

still need to do and let's save

play12:30

everything in an Excel because of course

play12:32

we want to process further what we have

play12:34

now extracted so

play12:37

search for use Excel file activity

play12:40

and this one

play12:42

and I'm just gonna add an Excel file

play12:45

here into invoices processed

play12:48

and let's call this one Excel

play12:53

I'm going to add two sheets

play12:56

I'm gonna let's call this one um invoice

play12:59

on lines

play13:03

and let's call this one invoice

play13:06

info

play13:11

let's select the file

play13:14

invoices processed Excel I'm later going

play13:17

to change this to make it dynamic as

play13:18

well but for now let's just keep it like

play13:20

this

play13:21

and then I'm going to say write cell

play13:26

this one the first thing I want to do is

play13:28

I want to write then the number of the

play13:31

invoice to the invoice info tab so what

play13:34

to write

play13:35

let's use variable and that's invoice no

play13:38

invoice number

play13:40

and where to write it

play13:41

let's say that we want to indicate that

play13:43

in Excel

play13:45

so let's write to invoice info and then

play13:48

B1 confirm

play13:51

and let's also write a description

play13:55

so what to write

play13:57

um fence editor

play14:01

invoice

play14:03

a number

play14:07

and we have to write

play14:09

and again in Excel A1 confirm

play14:13

okay so now we have written the invoice

play14:15

number and also a description of the

play14:16

invoice number

play14:18

to this invoice info sheet

play14:21

and now let's also write the data table

play14:27

write data table to excel that's exactly

play14:30

what we need

play14:31

and the content of this invoice we want

play14:34

to write it what to write

play14:36

so let's use variable

play14:41

DT invoice lines that's a data table

play14:44

and destination let's Excel and invoice

play14:48

lines

play14:49

you can exclude the headers

play14:52

I'm going to close the Excel file and

play14:54

I'm also going to make sure that this

play14:57

file is dynamic

play14:59

so I'm not going to use inverse is

play15:02

processed Excel

play15:04

but I'm going to use the invoice number

play15:06

here

play15:09

so add to double quotation marks here to

play15:12

plus signs

play15:13

and then I'm going to call this one

play15:15

invoice no

play15:17

click on Arc

play15:19

and of course you can do the same for

play15:21

the date that is on the inverse and

play15:22

automata data you can also write it to

play15:24

the same sheet invoice info but for now

play15:26

I'm just going to do the invoice number

play15:28

to keep this video really lean okay

play15:30

let's just go back

play15:32

let's remove this one let's make sure

play15:34

that the invoice is in the input folder

play15:36

which is the case

play15:38

and let's run about

play15:42

then let's go to the input folder you

play15:44

see that the input folder is now empty

play15:46

and let's go one folder up let's go to

play15:48

process I see in process that we have

play15:51

the PDF which has been moved and we also

play15:53

have the Excel file which has been

play15:54

generated

play15:56

you see that we have invoice info which

play15:58

contains invoice number and it also

play16:01

contains the number 100 and invoice

play16:03

lines which contains all the lines so

play16:05

you see the quantity see the name of the

play16:08

product you see the price of products

play16:10

and you also see the total price of the

play16:12

line This is the easiest way to extract

play16:15

information from an invoice it doesn't

play16:16

work for every format so if it doesn't

play16:18

work for your invoice format there is

play16:20

another video on my channel that

play16:21

explains how to extract invoices with

play16:24

regular expressions

play16:25

I hope this video was useful for you and

play16:27

I hope to see you in my next video

Rate This

5.0 / 5 (0 votes)

相关标签
UiPathInvoice ExtractionAutomationPDF ProcessingData ParsingWorkflow DesignRPA ToolsScript TutorialNo RegexEfficiency Tips
您是否需要英文摘要?