Apache NiFi Anti-Patterns, Part 1

NiFi Notes
2 Apr 202014:34

Summary

TLDRIn diesem Video beginnt Marc Payne eine Serie über Apache NiFi Anti-Patterns. Er zeigt häufige Fehler in Datenflussdesigns, die die Effizienz von NiFi beeinträchtigen. Anhand eines Beispiels erklärt er, wie Nutzer CSV-Daten effizienter verarbeiten können, indem sie reguläre Ausdrücke durch richtige Parser ersetzen, unnötige Datenaufteilungen vermeiden und Attribute von Inhalten trennen. Marc veranschaulicht, wie eine verbesserte Datenflussarchitektur nicht nur einfacher zu verstehen und zu pflegen ist, sondern auch die Verarbeitungszeit drastisch verkürzt – von 47 Sekunden auf unter eine Sekunde.

Takeaways

  • 🔪 Apache NiFi bietet viele Möglichkeiten zur Datenverarbeitung, aber es gibt wiederkehrende Fehler, die als Anti-Muster bezeichnet werden.
  • 📈 Marc Payne beginnt eine Serie über Apache NiFi Anti-Muster, um häufige, aber ineffiziente Datenflüsse zu untersuchen und zu verbessern.
  • 📋 Der Diskurs konzentriert sich auf einen Datenfluss, der CSV-Daten aus einem Dateisystem mit einem ListFile- und FetchFile-Processor liest.
  • 🔎 Der ursprüngliche Datenfluss verwendet reguläre Ausdrücke für die Datenextraktion, was zu Leistungsengpässen und Schwierigkeiten bei der Wartung führt.
  • 📝 Ein Anti-Muster ist das Verwenden von regulären Ausdrücken anstatt eines korrekten CSV-Parsers für strukturierte oder halbstrukturierte Daten.
  • 🔑 Ein weiteres Problem ist das Verwenden von FlowFile-Attributen, um Daten zu routen und zu verarbeiten, anstatt sie im FlowFile-Inhalt beizubehalten.
  • 🔄 Das Teilen und Zusammenführen von Daten (Splitten und Mergen) ist ein weiteres ineffizientes Anti-Muster, das die Verarbeitung und Speicherung erschwert.
  • 📉 Die ineffizienten Datenflüsse führen zu einer erhöhten Verarbeitungszeit und einer verschlechterten Nachverfolgbarkeit der Daten.
  • 🛠 Eine verbesserte Version des Datenflusses verwendet QueryRecord-Processoren, um die Datenverarbeitung zu vereinfachen und zu beschleunigen.
  • 📈 Durch die Verwendung von QueryRecord-Processoren und einer korrekten Datenverarbeitung wird die Verarbeitungszeit von 47,5 Sekunden auf 885 Millisekunden reduziert.
  • 📚 Der optimierte Datenfluss ist nicht nur performanter, sondern auch leichter zu verstehen, zu warten und korrekter in der Datenbehandlung.

Q & A

  • Was ist das Hauptthema des von Marc Payne gezeigten Video-Skripts?

    -Das Hauptthema des Skripts ist die Einführung in Apache NiFi Anti-Muster und deren Verbesserung, insbesondere im Zusammenhang mit der Verarbeitung von CSV-Daten mit Apache NiFi.

  • Was sind die Anti-Muster, die Marc Payne im Video-Skript identifiziert?

    -Die identifizierten Anti-Muster sind das Verwenden von regulären Ausdrücken anstelle eines korrekten CSV-Parsers, das Verschmieren der Grenzen zwischen Flow-File-Inhalten und -Attributen sowie das Aufteilen und Wiedervereinen von Daten.

  • Was ist das Problem mit dem Einsatz von regulären Ausdrücken zur Analyse von CSV-Daten?

    -Reguläre Ausdrücke können schwierig zu schreiben und zu pflegen sein, und sie ignorieren möglicherweise zitierte oder escapete Kommas in den Feldern der CSV-Daten, was zu unsauberen oder ineffizienten Datenanalyse führen kann.

  • Welche Rolle spielen die 'QueryRecord'-Prozessoren im verbesserten Datenfluss?

    -Die 'QueryRecord'-Prozessoren ermöglichen es, Datensätze in einem einzigen FlowFile zu analysieren und zu verarbeiten, ohne sie aufzuteilen und wieder zusammenzufügen, was zu einer effizienteren Datenverarbeitung führt.

  • Was ist der Hauptvorteil des Einsatzes von 'QueryRecord'-Prozessoren?

    -Der Hauptvorteil ist, dass sie die Datenverarbeitung vereinfachen und beschleunigen, indem sie die Notwendigkeit von Datenaufteilung und -wiedervereinigung vermeiden und eine effiziente Verarbeitung von Datensätzen in einem FlowFile ermöglichen.

  • Wie wird die Leistung des verbesserten Datenflusses im Vergleich zum ursprünglichen Datenfluss gemessen?

    -Die Leistung wird durch die Verarbeitungszeit gemessen, die im verbesserten Datenfluss von etwa 47 Sekunden auf weniger als eine Sekunde reduziert wurde, was eine signifikante Verbesserung darstellt.

  • Was ist der Zweck des 'PutEmail'-Prozessors im Datenfluss?

    -Der 'PutEmail'-Prozessor wird verwendet, um eine E-Mail zu senden, wenn Datensätze mit einem Kaufvolumen von mehr als 1000 USD erkannt werden.

  • Welche Rolle spielen die 'RecordReader'- und 'RecordWriter'-Komponenten in der 'QueryRecord'-Prozessor-Konfiguration?

    -Die 'RecordReader'-Komponente ist für das Parsen der eingehenden Daten verantwortlich, während die 'RecordWriter'-Komponente dafür sorgt, dass die verarbeiteten Daten in der gewünschten Ausgabeform (z. B. CSV oder JSON) geschrieben werden.

  • Was ist der Unterschied zwischen FlowFile-Inhalten und FlowFile-Attributen in Apache NiFi?

    -FlowFile-Inhalte beinhalten die tatsächlichen Daten, während FlowFile-Attribute Schlüssel-Wert-Paare enthalten, die Metadaten und Verarbeitungsinformationen bereitstellen und den Daten Kontext geben.

  • Wie kann man die Verarbeitungszeit in Apache NiFi verbessern?

    -Die Verarbeitungszeit kann verbessert werden, indem man Anti-Muster vermeidet, effiziente Prozessoren wie 'QueryRecord' verwendet und die Datenverarbeitung optimiert, um die Leistung zu steigern und die Verarbeitung korrekter zu gewährleisten.

Outlines

00:00

😀 Einführung in Apache NiFi Anti-Muster

Marc Payne beginnt mit einer Serie über Apache NiFi Anti-Muster, die ineffiziente Verwendungsmuster von NiFi aufzeigt. Er betrachtet häufig vorkommende Workflows, die nicht die Stärken von NiFi ausnutzen, und diskutiert ihre Schwächen sowie Möglichkeiten zur Optimierung. Das Beispielworkflow liest CSV-Daten aus einer Dateisystemquelle, filtert Einträge mit einem Kaufvolumen über 1.000 $ heraus, sendet E-Mails und wandelt die Daten in JSON um, bevor sie sie schreibt. Dabei werden jedoch Regelausdrücke und Text-Splitting verwendet, was zu Engpässen und ineffizienten Datenflüssen führen kann.

05:01

🤔 Analyse von Problemen im ursprünglichen Workflow

Der zweite Absatz behandelt die Probleme im ursprünglichen Workflow. Es wird ein Engpass durch den ExtractText-Processor und die Verwendung von Regelausdrücken kritisiert, da diese Schwierigkeiten mit dem Parsen von CSV-Daten und der Wartung solcher Ausdrücke haben können. Des Weiteren wird die Verwendung von Attributen kritisiert, um Daten aus dem FlowFile-Content zu extrahieren und wieder zusammenzufügen, was zu einer ineffizienten Verwendung von Ressourcen und einer unübersichtlichen Datenabstammung führen kann.

10:02

🛠 Überarbeitung des Workflows zur Effizienzsteigerung

Der dritte Absatz beschreibt die Überarbeitung des Workflows zur Verbesserung der Effizienz. Es werden zwei QueryRecord-Prozessoren verwendet, um die CSV-Daten zu parsen und zu filtern, ohne sie in kleinere FlowFiles aufzuteilen. Die Verwendung eines CSV-Readers und JSON-RecordSet-Writers ermöglicht eine korrekte Verarbeitung der strukturierten Daten. Zusätzlich wird ein Feld zur Verarbeitungszeit hinzugefügt, um das Daten-Enrichment direkt im Workflow durchzuführen. Die Überarbeitung führt zu einer deutlich schnelleren Verarbeitung und einer vereinfachten Datenabstammung, die leichter zu verfolgen ist.

Mindmap

Keywords

💡Apache NiFi

Apache NiFi ist eine Open-Source-Plattform für die automatisierte Verwaltung von Datenflüssen. Im Video wird sie als Hauptthema behandelt, um ineffiziente Datenflussmuster zu identifizieren und zu optimieren. Das Video zielt darauf ab, die Schwächen der aktuellen Datenflussdesigns zu zeigen und wie sie verbessert werden können, um die Stärken von NiFi besser auszunutzen.

💡Anti-Pattern

Ein Anti-Pattern bezieht sich auf eine schlechte Praxis oder ein Muster, das häufig in bestimmten Situationen wiederholt wird, aber ineffizient oder unzureichend ist. Im Kontext des Videos werden Anti-Pattern verwendet, um bestimmte Verhaltensweisen in der Datenflussgestaltung in NiFi zu identifizieren, die verbessert werden sollten.

💡CSV-Daten

CSV steht für Comma-Separated Values und ist ein einfaches Dateiformat zur Darstellung von tabellarischen Daten. Im Video werden CSV-Daten verwendet, um Datenflüsse zu demonstrieren, die ineffizient sind, da sie keine geeigneten Parser für CSV-Daten verwenden, sondern stattdessen reguläre Ausdrücke und einfache Textverarbeitungsmuster anwenden.

💡Datenfluss

Ein Datenfluss beschreibt die Bewegung von Daten durch verschiedene Prozesse oder Systeme. Im Video wird ein Datenfluss analysiert, der Daten aus einer CSV-Datei einliest, verarbeitet und dann in JSON-Format speichert. Der Fokus liegt auf der Optimierung dieses Flusses, um Leistung und Effizienz zu verbessern.

💡Datenverarbeitung

Datenverarbeitung umfasst die Sammlung, Analyse, Transformation und Speicherung von Daten. Im Video wird die Verarbeitung von Daten aus CSV-Dateien durch NiFi-Verarbeitungsschritte dargestellt, einschließlich der Auswahl von Datensätzen, die einen bestimmten Kaufbetrag überschreiten, und das Hinzufügen von Zeitstempeln zur Datenvervollständigung.

💡JSON

JSON (JavaScript Object Notation) ist ein leichtgewichtiges Datenformat, das zur Datenkommunikation im Internet weit verbreitet ist. Im Video wird JSON als Zielformat für die transformierten Daten verwendet, nachdem die CSV-Daten verarbeitet und erweitert wurden.

💡Reguläre Ausdrücke

Reguläre Ausdrücke sind ein Werkzeug zur Beschreibung, Suche und Manipulation von Textmustern. Im Video werden sie als ineffiziente Methode zur Analyse von CSV-Daten kritisiert, da sie keine berücksichtigung von zitierten oder escape-Zeichen in den Feldern machen.

💡Datenflussattribute

In NiFi werden Datenflussattribute verwendet, um Metadaten und Verarbeitungsinformationen zu speichern, die den Daten Kontext geben. Im Video wird eine schlechte Praxis kritisiert, bei der Daten aus dem Inhalt des Datenflusses in Attribute extrahiert werden, was zu einer Verwirrung zwischen Dateninhalt und Attributen führen kann.

💡Datenprovenance

Datenprovenance bezieht sich auf die Herkunft und den Ursprung von Daten sowie die Nachverfolgbarkeit ihrer Veränderungen. Im Video wird die Komplexität der Datenprovenance in einem ineffizienten Datenfluss diskutiert, wobei die Verwendung von Split- und Merge-Schritten die Nachverfolgung erschwert.

💡Leistungsoptimierung

Leistungsoptimierung ist der Prozess, Datenflüsse so zu gestalten, dass sie schneller und effizienter sind. Im Video wird gezeigt, wie bestimmte Verarbeitungsschritte und -muster in NiFi optimiert werden können, um die Verarbeitungszeit von Sekunden auf Millisekunden zu reduzieren.

Highlights

Marc Payne introduces a series on Apache NiFi anti-patterns focusing on inefficient use of NiFi's design and architecture.

Common flow of ingesting CSV data from the file system using list and fetch processors is discussed.

The process of routing CSV records with a purchase total over $1,000 to send an email is explained.

Enriching CSV data before converting it to JSON and writing it to the local file system is highlighted.

A bottleneck with the extract text processor due to inefficient regular expressions is identified.

The limitations of regular expressions in handling CSV data with quoted or escaped commas are pointed out.

The anti-pattern of treating structured data as unstructured text using simple patterns is criticized.

The importance of using proper CSV parsers instead of regular expressions is emphasized.

Blurring the lines between flow file content and attributes by extracting all fields into attributes is critiqued.

The inefficiency of creating many attributes, especially large ones, for NiFi's performance is discussed.

The anti-pattern of splitting and merging data, increasing system workload and storage requirements, is analyzed.

The difficulty in tracking data lineage and understanding data flow due to excessive splitting is highlighted.

The latency issues and inefficiency of the original data flow are demonstrated with an example.

A revised approach to processing data more efficiently using query record processors is proposed.

The configuration of query record processors with a CSV reader and writer is detailed.

The use of SQL within query record processors to filter and enrich data is explained.

The efficiency of the revised flow, reducing processing time from 47 seconds to 885 milliseconds, is showcased.

The simplification of the data flow, avoiding anti-patterns, and the use of proper parsers is advocated for better efficiency and maintainability.

Transcripts

play00:00

hey everybody I'm Marc Payne today I

play00:02

want to begin a series on Apache knife I

play00:04

anti-patterns

play00:05

so these are flows up seeing users

play00:07

creating over and over again but they

play00:09

don't really make very efficient use of

play00:11

knife is design and architecture so I

play00:14

want to start by looking at some of

play00:16

these different flows and then we'll

play00:18

discuss some of the weaknesses that they

play00:19

have and how we can redesign these flows

play00:22

to really better utilize knife AI and

play00:24

play to its strengths so let's get

play00:26

started so the flow that I want to look

play00:35

at today I've probably seen a dozen

play00:36

times or more so the intent of this flow

play00:39

is that we want to ingest some CSV data

play00:41

in this case we're pulling it in from a

play00:43

file system using the list file and

play00:45

fetch file processors the CSV data has

play00:48

four fields in it customer ID customer

play00:51

name customer

play00:52

I'm sorry purchased total and purchase

play00:55

time once we've ingested that CSV data

play00:57

we want to do some routing on it so in

play01:01

this case we're actually going to pull

play01:02

out any CSV record that has a purchased

play01:05

total of more than $1,000 and if we find

play01:09

any of those and we're going to send an

play01:10

email to someone once we've made that

play01:14

rallying decision we're going to enrich

play01:16

the CSV data once we've enriched that

play01:18

data then we're going to convert that

play01:20

CSV data into JSON and finally we're

play01:23

gonna write that data out somewhere so

play01:25

in this case we're using a foot file

play01:27

processor just to put it back to the

play01:29

local file system so we can go ahead and

play01:31

start this flow and we'll see the data

play01:35

moving through the system but if we

play01:37

refresh we're gonna see right away that

play01:39

we have a bottleneck right here with the

play01:41

extract text processor so if we look at

play01:43

how this processor works in the

play01:47

configuration we can see there's a

play01:49

handful of different regular expressions

play01:50

configured here

play01:51

I'd really like to avoid using regular

play01:54

expressions if I don't have to because

play01:56

they're really difficult to write and

play01:57

even more difficult to read and maintain

play01:59

these regular expressions really are

play02:02

ignoring the fact that our CSV data may

play02:04

actually have quoted or escaped commas

play02:06

in their fields so it's kind of a quick

play02:08

and dirty way to parse CSV data but it's

play02:11

not necessarily core

play02:12

and it's pretty expensive to evaluate

play02:14

which is why we have the bottleneck here

play02:16

and very much related to this the split

play02:20

text processor is actually splitting on

play02:22

any new line that it finds and so this

play02:25

is going to cause a lot of problems if

play02:27

we have CSV data that has quoted new

play02:29

lines within a field so the first

play02:31

anti-pattern that we have here really

play02:34

deals with treating structured or semi

play02:36

structured data is if it were

play02:37

unstructured text data and trying to use

play02:40

simple patterns to parse it we really

play02:42

should be using proper CSV parser now

play02:46

the reason this flow is using regular

play02:48

expressions it's promote certain fields

play02:50

from that data into flow file attributes

play02:53

and this allows the user to then route

play02:55

on those attributes and update the data

play02:58

using the update attribute processor

play03:00

then all these attributes are combined

play03:03

together again into JSON using this

play03:05

attributes to JSON processor now it

play03:09

definitely makes sense to sometimes

play03:11

extract information from a flow files

play03:12

content into an attribute but what this

play03:15

flow is doing there's really blurring

play03:17

the lines between flow file contents and

play03:19

attributes its extracting all of the

play03:22

fields from the content into attributes

play03:24

in the back again we really want to keep

play03:26

the notions of flow file content and

play03:28

flow file attributes separate attributes

play03:30

are designed to hold key value pairs

play03:32

such as metadata and processing

play03:34

information they give context to the

play03:36

data the data itself though should

play03:39

remain in the content of the flow file

play03:41

creating a bunch of attributes

play03:42

especially really big attributes becomes

play03:45

very expensive for an eye-fi so blurring

play03:47

the line between flow file content flow

play03:49

file attributes is our second

play03:51

anti-pattern that we have here the third

play03:53

anti-pattern that we have is this notion

play03:56

here that we see of split text and merge

play04:00

content or splitting and reemerging the

play04:03

data and again there are definitely

play04:05

reasons that we might want to do this in

play04:07

a data flow but we really want to avoid

play04:09

doing this if we can fully understand

play04:13

why we want to avoid it you really have

play04:15

to understand a lot about knife is

play04:17

architecture and how it stores the data

play04:19

and the flow file and content and

play04:21

Providence repositories and that would

play04:23

be a good topic for another whole video

play04:25

but for now just consider that in order

play04:28

for the data to flow through the system

play04:30

the night fight framework has to do some

play04:32

level of work and so if we take a may

play04:34

flow file that's a megabyte and contains

play04:37

10,000 lines of text and we break that

play04:40

apart into 10,000 different flow files

play04:42

we've now effectively taken that work

play04:45

that the knife a farmer has to do and

play04:47

we've multiplied that work by about

play04:49

10,000 and this also means that the

play04:52

amount of storage space required to

play04:54

store the provenance data is also

play04:56

multiplied by a factor of 10,000 so we

play04:59

end up not being able to hold nearly as

play05:01

much of that provenance data and that

play05:02

lineage information it also makes a

play05:05

lineage much harder to follow so for

play05:10

example if we take a look at the data

play05:12

provenance of the put file processor and

play05:16

we look at the lineage for this

play05:18

particular flow file we'll see that some

play05:22

data was joined together and if we

play05:24

choose to find its parents we see a lot

play05:27

of different things going on in this

play05:28

data lineage so it really makes it

play05:30

difficult to understand exactly what's

play05:31

happening throughout this entire flow

play05:36

and if we look at the event information

play05:39

for this drop event we can see a lineage

play05:42

duration of forty seven point five four

play05:44

seconds so that means that the latency

play05:47

through the system is about forty seven

play05:48

and a half seconds for this particular

play05:51

piece of data so it's not a particularly

play05:54

efficient data flow at all we should be

play05:57

able to process this data dramatically

play05:58

faster than that so what can we do

play06:04

differently ideally we'd like to build a

play06:06

flow that's easier to understand and to

play06:08

maintain we want to process the data

play06:10

much more efficiently and most

play06:12

importantly we want to ensure that we're

play06:13

processing the data correctly and using

play06:15

parsers for our data rather than regular

play06:18

expressions that we've come up with and

play06:20

splitting on new lines so let's take a

play06:22

look at what that would look like if we

play06:27

move up one level here and we go to our

play06:29

revised approach all right so now we get

play06:33

to the fun part of talking about how we

play06:35

can completely redesign this flow so

play06:38

we're going to

play06:39

with the same two processors list file

play06:41

and fetch file to bring data in but then

play06:44

all of the processing is gonna be

play06:47

dramatically simpler what we had before

play06:50

was I think a combination of about six

play06:52

different processors that would be

play06:53

responsible for breaking that data up

play06:55

are enriching and doing some routing on

play06:58

it converting CSV data into JSON and

play07:01

merging it all back together and in this

play07:03

version of the flow we're able to get

play07:04

rid of all those six processors and

play07:06

replace it with just these two

play07:08

processors that are of type query record

play07:11

in this processor out of all the

play07:15

processors and the entire knife I

play07:17

distribution it's really got to be my

play07:19

favorite and yes I do have favorite

play07:21

processors and you probably should too

play07:23

this processor is an absolute beast and

play07:26

we'll talk about how we can configure it

play07:28

and what it's doing in just second but

play07:32

we'll notice that we've got the same

play07:34

output here if any purchase is more than

play07:37

$1,000 and we're gonna go ahead and use

play07:39

a PUD email processor just like we did

play07:41

before and the final result is going to

play07:43

be put into the file system using put

play07:46

file just like we did before

play07:47

so what we're not doing in this

play07:50

particular flow though is using a split

play07:52

text processor to break it apart and

play07:54

reemerge it back together and that's

play07:56

because these query record processors

play07:58

and any of the record based processors

play08:00

for that matter are really designed to

play08:02

take in a stream of records in a single

play08:05

flow file and operate on each of those

play08:08

records independently instead of

play08:10

breaking them apart and so if we look at

play08:13

query record and look at this

play08:15

configuration we'll see that it's

play08:17

actually really simple to configure it

play08:19

has three key properties here a record

play08:22

reader and that's going to allow us to

play08:25

configure how we actually parse that

play08:27

incoming data so in this case we're

play08:29

using a CSV reader so that incoming data

play08:32

is going to be CSV and we want to parse

play08:35

it using a proper parser rather than

play08:37

splitting on new line and using regular

play08:40

expressions to to find all the delimiter

play08:44

x' and pick out little fields that we

play08:46

care about instead of doing all of that

play08:48

we'll just choose a CSV reader and it'll

play08:51

handle all of that for us

play08:52

and it will do it according to the CSV

play08:55

specification so we don't have to worry

play08:57

about our patterns being incorrect the

play09:01

next thing that we're gonna configure

play09:02

here is a record writer so after we've

play09:05

parsed that data we're going to run some

play09:09

sequel over that data and then the

play09:11

results we're going to write back out so

play09:14

in this case we're saying that we want

play09:15

to write the result out in CSV also and

play09:19

then we're going to define what it is

play09:22

that we actually want to do with that

play09:24

data

play09:24

once we've parse it and before we write

play09:26

it out so in this case all we really

play09:30

want to do is we want to pick out any

play09:32

records that have a purchased total of

play09:35

more than 1000 and so that's what this

play09:38

sequel says select asterisk which means

play09:42

all of the fields from the flow file

play09:45

where the purchased total field is more

play09:47

than 1,000 dollars so if there are any

play09:50

records that match that criteria they're

play09:52

going to be written out to this critical

play09:54

relationship and we can see that this

play09:58

critical relationship is again routed to

play10:00

put email and then we're going to send

play10:02

the original flow file so regardless of

play10:04

what matched or didn't match in that

play10:05

query record processor we're gonna go

play10:07

ahead and send the original full file

play10:10

down to the second query record

play10:13

processor the second query record

play10:16

processor is going to be responsible for

play10:18

again parsing CSV data because that was

play10:21

the original relationship and the data

play10:24

as it came in from the file system was

play10:26

in csv data was in csv format but this

play10:30

time we want the output to be in JSON

play10:32

because that's what we actually want to

play10:35

go out to our file system that's what we

play10:38

want the end result to be so we can just

play10:41

choose a record writer and the JSON

play10:43

record set writer and that's all we have

play10:45

to really configure and then we get to

play10:49

choose what sequel we actually want to

play10:51

run over this data in order to transform

play10:53

the data or to choose which fields we

play10:56

care about or which records we care

play10:58

about in this case we want to do our

play11:00

enrichment actually right here in line

play11:02

so we're gonna say select asterisk so

play11:05

select all of the field

play11:06

and we also want to select this

play11:09

additional field from flow file this

play11:12

additional field is going to use the

play11:14

expression language to pick out the

play11:16

current date and time and then we're

play11:18

going to format it using the

play11:20

month/day/year hour minute and second

play11:22

just like we did in the previous flow

play11:25

and then we're going to call that new

play11:27

field process time and that's all we

play11:30

have to really do to select all the

play11:32

existing fields and then add in this new

play11:35

enrichment field into our flow file and

play11:39

then we're going to write that data out

play11:41

like I said using the JSON record set

play11:44

writer so this is why I love this

play11:46

processor it's extremely powerful it's

play11:48

easy to use it's easy to configure and

play11:51

it's extremely efficient so we talked

play11:55

about how inefficient the previous flow

play11:58

was I promised you that this version of

play12:00

flow was going to be a lot more

play12:01

efficient so let's take a look and see

play12:04

how much more efficient this actually is

play12:06

I'm going to click start and then I'm

play12:10

going to immediately refresh the results

play12:12

and we can see that we've already

play12:14

processed almost all of the data we've

play12:16

now finished processing all of that data

play12:18

within that amount of time so if we come

play12:21

down here and look at the provenance

play12:23

data for the put file processor just

play12:25

like we did before and we look at the

play12:29

event details for this particular drop

play12:32

event what we saw before was about 40

play12:34

7.5 seconds

play12:36

and with this version of the flow that

play12:38

time has now been reduced to zero point

play12:41

eight eight five seconds or 885

play12:43

milliseconds so we went from about 47 48

play12:47

seconds to less than one second so we

play12:49

saw this flow performing about 50 times

play12:52

faster than the previous version of the

play12:55

flow it's also worth noting if we then

play12:59

come over here and look at the lineage

play13:00

for this particular flow file we see a

play13:03

fork event and we can expand that fork

play13:05

event and we can find its parents and we

play13:08

could see everything that happened to

play13:10

that flow file we don't see a hundred

play13:13

different children getting forked and

play13:15

then join back together it's much easier

play13:18

to look at this lineage and

play13:19

and understand exactly what happened to

play13:21

the data all the way through the system

play13:24

and so there you have it with this

play13:27

version of the flow we've eliminated the

play13:30

splitting and the reemerging of the data

play13:32

we've avoided blurring that line between

play13:36

flow file content and flow file

play13:38

attributes and we've treated our

play13:41

structured and semi-structured data as

play13:44

actual data that can be parsed using a

play13:47

legitimate CSV parser and JSON record

play13:51

set writer this allows us to avoid all

play13:54

of those common mistakes that were going

play13:56

to make if we're trying to

play13:59

unstructured textual data and so we've

play14:02

really avoided all of those three

play14:04

anti-patterns with this version of the

play14:06

flow and as we saw it's not only a lot

play14:08

simpler to look at and read and

play14:10

understand but it's dramatically more

play14:12

efficient so the processing time went

play14:15

from about 47 seconds to about 800

play14:18

milliseconds so that's an order of well

play14:21

over an order of magnitude faster I hope

play14:25

you liked the video and I hope you

play14:26

learned a lot if so please do like the

play14:28

video and share it with anybody else you

play14:30

know who may be interested thanks a lot

play14:32

for taking the time to watch it

Rate This

5.0 / 5 (0 votes)

Related Tags
DatenflüsseApache NiFiAnti-MusterEffizienzCSV-ParserJSON-VerarbeitungDatenoptimierungNiFi-TippsDatenanalyseWorkflowEffizientere Verarbeitung
Do you need a summary in English?