Apache NiFi Anti-Patterns, Part 1
Summary
TLDRIn diesem Video beginnt Marc Payne eine Serie über Apache NiFi Anti-Patterns. Er zeigt häufige Fehler in Datenflussdesigns, die die Effizienz von NiFi beeinträchtigen. Anhand eines Beispiels erklärt er, wie Nutzer CSV-Daten effizienter verarbeiten können, indem sie reguläre Ausdrücke durch richtige Parser ersetzen, unnötige Datenaufteilungen vermeiden und Attribute von Inhalten trennen. Marc veranschaulicht, wie eine verbesserte Datenflussarchitektur nicht nur einfacher zu verstehen und zu pflegen ist, sondern auch die Verarbeitungszeit drastisch verkürzt – von 47 Sekunden auf unter eine Sekunde.
Takeaways
- 🔪 Apache NiFi bietet viele Möglichkeiten zur Datenverarbeitung, aber es gibt wiederkehrende Fehler, die als Anti-Muster bezeichnet werden.
- 📈 Marc Payne beginnt eine Serie über Apache NiFi Anti-Muster, um häufige, aber ineffiziente Datenflüsse zu untersuchen und zu verbessern.
- 📋 Der Diskurs konzentriert sich auf einen Datenfluss, der CSV-Daten aus einem Dateisystem mit einem ListFile- und FetchFile-Processor liest.
- 🔎 Der ursprüngliche Datenfluss verwendet reguläre Ausdrücke für die Datenextraktion, was zu Leistungsengpässen und Schwierigkeiten bei der Wartung führt.
- 📝 Ein Anti-Muster ist das Verwenden von regulären Ausdrücken anstatt eines korrekten CSV-Parsers für strukturierte oder halbstrukturierte Daten.
- 🔑 Ein weiteres Problem ist das Verwenden von FlowFile-Attributen, um Daten zu routen und zu verarbeiten, anstatt sie im FlowFile-Inhalt beizubehalten.
- 🔄 Das Teilen und Zusammenführen von Daten (Splitten und Mergen) ist ein weiteres ineffizientes Anti-Muster, das die Verarbeitung und Speicherung erschwert.
- 📉 Die ineffizienten Datenflüsse führen zu einer erhöhten Verarbeitungszeit und einer verschlechterten Nachverfolgbarkeit der Daten.
- 🛠 Eine verbesserte Version des Datenflusses verwendet QueryRecord-Processoren, um die Datenverarbeitung zu vereinfachen und zu beschleunigen.
- 📈 Durch die Verwendung von QueryRecord-Processoren und einer korrekten Datenverarbeitung wird die Verarbeitungszeit von 47,5 Sekunden auf 885 Millisekunden reduziert.
- 📚 Der optimierte Datenfluss ist nicht nur performanter, sondern auch leichter zu verstehen, zu warten und korrekter in der Datenbehandlung.
Q & A
Was ist das Hauptthema des von Marc Payne gezeigten Video-Skripts?
-Das Hauptthema des Skripts ist die Einführung in Apache NiFi Anti-Muster und deren Verbesserung, insbesondere im Zusammenhang mit der Verarbeitung von CSV-Daten mit Apache NiFi.
Was sind die Anti-Muster, die Marc Payne im Video-Skript identifiziert?
-Die identifizierten Anti-Muster sind das Verwenden von regulären Ausdrücken anstelle eines korrekten CSV-Parsers, das Verschmieren der Grenzen zwischen Flow-File-Inhalten und -Attributen sowie das Aufteilen und Wiedervereinen von Daten.
Was ist das Problem mit dem Einsatz von regulären Ausdrücken zur Analyse von CSV-Daten?
-Reguläre Ausdrücke können schwierig zu schreiben und zu pflegen sein, und sie ignorieren möglicherweise zitierte oder escapete Kommas in den Feldern der CSV-Daten, was zu unsauberen oder ineffizienten Datenanalyse führen kann.
Welche Rolle spielen die 'QueryRecord'-Prozessoren im verbesserten Datenfluss?
-Die 'QueryRecord'-Prozessoren ermöglichen es, Datensätze in einem einzigen FlowFile zu analysieren und zu verarbeiten, ohne sie aufzuteilen und wieder zusammenzufügen, was zu einer effizienteren Datenverarbeitung führt.
Was ist der Hauptvorteil des Einsatzes von 'QueryRecord'-Prozessoren?
-Der Hauptvorteil ist, dass sie die Datenverarbeitung vereinfachen und beschleunigen, indem sie die Notwendigkeit von Datenaufteilung und -wiedervereinigung vermeiden und eine effiziente Verarbeitung von Datensätzen in einem FlowFile ermöglichen.
Wie wird die Leistung des verbesserten Datenflusses im Vergleich zum ursprünglichen Datenfluss gemessen?
-Die Leistung wird durch die Verarbeitungszeit gemessen, die im verbesserten Datenfluss von etwa 47 Sekunden auf weniger als eine Sekunde reduziert wurde, was eine signifikante Verbesserung darstellt.
Was ist der Zweck des 'PutEmail'-Prozessors im Datenfluss?
-Der 'PutEmail'-Prozessor wird verwendet, um eine E-Mail zu senden, wenn Datensätze mit einem Kaufvolumen von mehr als 1000 USD erkannt werden.
Welche Rolle spielen die 'RecordReader'- und 'RecordWriter'-Komponenten in der 'QueryRecord'-Prozessor-Konfiguration?
-Die 'RecordReader'-Komponente ist für das Parsen der eingehenden Daten verantwortlich, während die 'RecordWriter'-Komponente dafür sorgt, dass die verarbeiteten Daten in der gewünschten Ausgabeform (z. B. CSV oder JSON) geschrieben werden.
Was ist der Unterschied zwischen FlowFile-Inhalten und FlowFile-Attributen in Apache NiFi?
-FlowFile-Inhalte beinhalten die tatsächlichen Daten, während FlowFile-Attribute Schlüssel-Wert-Paare enthalten, die Metadaten und Verarbeitungsinformationen bereitstellen und den Daten Kontext geben.
Wie kann man die Verarbeitungszeit in Apache NiFi verbessern?
-Die Verarbeitungszeit kann verbessert werden, indem man Anti-Muster vermeidet, effiziente Prozessoren wie 'QueryRecord' verwendet und die Datenverarbeitung optimiert, um die Leistung zu steigern und die Verarbeitung korrekter zu gewährleisten.
Outlines
😀 Einführung in Apache NiFi Anti-Muster
Marc Payne beginnt mit einer Serie über Apache NiFi Anti-Muster, die ineffiziente Verwendungsmuster von NiFi aufzeigt. Er betrachtet häufig vorkommende Workflows, die nicht die Stärken von NiFi ausnutzen, und diskutiert ihre Schwächen sowie Möglichkeiten zur Optimierung. Das Beispielworkflow liest CSV-Daten aus einer Dateisystemquelle, filtert Einträge mit einem Kaufvolumen über 1.000 $ heraus, sendet E-Mails und wandelt die Daten in JSON um, bevor sie sie schreibt. Dabei werden jedoch Regelausdrücke und Text-Splitting verwendet, was zu Engpässen und ineffizienten Datenflüssen führen kann.
🤔 Analyse von Problemen im ursprünglichen Workflow
Der zweite Absatz behandelt die Probleme im ursprünglichen Workflow. Es wird ein Engpass durch den ExtractText-Processor und die Verwendung von Regelausdrücken kritisiert, da diese Schwierigkeiten mit dem Parsen von CSV-Daten und der Wartung solcher Ausdrücke haben können. Des Weiteren wird die Verwendung von Attributen kritisiert, um Daten aus dem FlowFile-Content zu extrahieren und wieder zusammenzufügen, was zu einer ineffizienten Verwendung von Ressourcen und einer unübersichtlichen Datenabstammung führen kann.
🛠 Überarbeitung des Workflows zur Effizienzsteigerung
Der dritte Absatz beschreibt die Überarbeitung des Workflows zur Verbesserung der Effizienz. Es werden zwei QueryRecord-Prozessoren verwendet, um die CSV-Daten zu parsen und zu filtern, ohne sie in kleinere FlowFiles aufzuteilen. Die Verwendung eines CSV-Readers und JSON-RecordSet-Writers ermöglicht eine korrekte Verarbeitung der strukturierten Daten. Zusätzlich wird ein Feld zur Verarbeitungszeit hinzugefügt, um das Daten-Enrichment direkt im Workflow durchzuführen. Die Überarbeitung führt zu einer deutlich schnelleren Verarbeitung und einer vereinfachten Datenabstammung, die leichter zu verfolgen ist.
Mindmap
Keywords
💡Apache NiFi
💡Anti-Pattern
💡CSV-Daten
💡Datenfluss
💡Datenverarbeitung
💡JSON
💡Reguläre Ausdrücke
💡Datenflussattribute
💡Datenprovenance
💡Leistungsoptimierung
Highlights
Marc Payne introduces a series on Apache NiFi anti-patterns focusing on inefficient use of NiFi's design and architecture.
Common flow of ingesting CSV data from the file system using list and fetch processors is discussed.
The process of routing CSV records with a purchase total over $1,000 to send an email is explained.
Enriching CSV data before converting it to JSON and writing it to the local file system is highlighted.
A bottleneck with the extract text processor due to inefficient regular expressions is identified.
The limitations of regular expressions in handling CSV data with quoted or escaped commas are pointed out.
The anti-pattern of treating structured data as unstructured text using simple patterns is criticized.
The importance of using proper CSV parsers instead of regular expressions is emphasized.
Blurring the lines between flow file content and attributes by extracting all fields into attributes is critiqued.
The inefficiency of creating many attributes, especially large ones, for NiFi's performance is discussed.
The anti-pattern of splitting and merging data, increasing system workload and storage requirements, is analyzed.
The difficulty in tracking data lineage and understanding data flow due to excessive splitting is highlighted.
The latency issues and inefficiency of the original data flow are demonstrated with an example.
A revised approach to processing data more efficiently using query record processors is proposed.
The configuration of query record processors with a CSV reader and writer is detailed.
The use of SQL within query record processors to filter and enrich data is explained.
The efficiency of the revised flow, reducing processing time from 47 seconds to 885 milliseconds, is showcased.
The simplification of the data flow, avoiding anti-patterns, and the use of proper parsers is advocated for better efficiency and maintainability.
Transcripts
hey everybody I'm Marc Payne today I
want to begin a series on Apache knife I
anti-patterns
so these are flows up seeing users
creating over and over again but they
don't really make very efficient use of
knife is design and architecture so I
want to start by looking at some of
these different flows and then we'll
discuss some of the weaknesses that they
have and how we can redesign these flows
to really better utilize knife AI and
play to its strengths so let's get
started so the flow that I want to look
at today I've probably seen a dozen
times or more so the intent of this flow
is that we want to ingest some CSV data
in this case we're pulling it in from a
file system using the list file and
fetch file processors the CSV data has
four fields in it customer ID customer
name customer
I'm sorry purchased total and purchase
time once we've ingested that CSV data
we want to do some routing on it so in
this case we're actually going to pull
out any CSV record that has a purchased
total of more than $1,000 and if we find
any of those and we're going to send an
email to someone once we've made that
rallying decision we're going to enrich
the CSV data once we've enriched that
data then we're going to convert that
CSV data into JSON and finally we're
gonna write that data out somewhere so
in this case we're using a foot file
processor just to put it back to the
local file system so we can go ahead and
start this flow and we'll see the data
moving through the system but if we
refresh we're gonna see right away that
we have a bottleneck right here with the
extract text processor so if we look at
how this processor works in the
configuration we can see there's a
handful of different regular expressions
configured here
I'd really like to avoid using regular
expressions if I don't have to because
they're really difficult to write and
even more difficult to read and maintain
these regular expressions really are
ignoring the fact that our CSV data may
actually have quoted or escaped commas
in their fields so it's kind of a quick
and dirty way to parse CSV data but it's
not necessarily core
and it's pretty expensive to evaluate
which is why we have the bottleneck here
and very much related to this the split
text processor is actually splitting on
any new line that it finds and so this
is going to cause a lot of problems if
we have CSV data that has quoted new
lines within a field so the first
anti-pattern that we have here really
deals with treating structured or semi
structured data is if it were
unstructured text data and trying to use
simple patterns to parse it we really
should be using proper CSV parser now
the reason this flow is using regular
expressions it's promote certain fields
from that data into flow file attributes
and this allows the user to then route
on those attributes and update the data
using the update attribute processor
then all these attributes are combined
together again into JSON using this
attributes to JSON processor now it
definitely makes sense to sometimes
extract information from a flow files
content into an attribute but what this
flow is doing there's really blurring
the lines between flow file contents and
attributes its extracting all of the
fields from the content into attributes
in the back again we really want to keep
the notions of flow file content and
flow file attributes separate attributes
are designed to hold key value pairs
such as metadata and processing
information they give context to the
data the data itself though should
remain in the content of the flow file
creating a bunch of attributes
especially really big attributes becomes
very expensive for an eye-fi so blurring
the line between flow file content flow
file attributes is our second
anti-pattern that we have here the third
anti-pattern that we have is this notion
here that we see of split text and merge
content or splitting and reemerging the
data and again there are definitely
reasons that we might want to do this in
a data flow but we really want to avoid
doing this if we can fully understand
why we want to avoid it you really have
to understand a lot about knife is
architecture and how it stores the data
and the flow file and content and
Providence repositories and that would
be a good topic for another whole video
but for now just consider that in order
for the data to flow through the system
the night fight framework has to do some
level of work and so if we take a may
flow file that's a megabyte and contains
10,000 lines of text and we break that
apart into 10,000 different flow files
we've now effectively taken that work
that the knife a farmer has to do and
we've multiplied that work by about
10,000 and this also means that the
amount of storage space required to
store the provenance data is also
multiplied by a factor of 10,000 so we
end up not being able to hold nearly as
much of that provenance data and that
lineage information it also makes a
lineage much harder to follow so for
example if we take a look at the data
provenance of the put file processor and
we look at the lineage for this
particular flow file we'll see that some
data was joined together and if we
choose to find its parents we see a lot
of different things going on in this
data lineage so it really makes it
difficult to understand exactly what's
happening throughout this entire flow
and if we look at the event information
for this drop event we can see a lineage
duration of forty seven point five four
seconds so that means that the latency
through the system is about forty seven
and a half seconds for this particular
piece of data so it's not a particularly
efficient data flow at all we should be
able to process this data dramatically
faster than that so what can we do
differently ideally we'd like to build a
flow that's easier to understand and to
maintain we want to process the data
much more efficiently and most
importantly we want to ensure that we're
processing the data correctly and using
parsers for our data rather than regular
expressions that we've come up with and
splitting on new lines so let's take a
look at what that would look like if we
move up one level here and we go to our
revised approach all right so now we get
to the fun part of talking about how we
can completely redesign this flow so
we're going to
with the same two processors list file
and fetch file to bring data in but then
all of the processing is gonna be
dramatically simpler what we had before
was I think a combination of about six
different processors that would be
responsible for breaking that data up
are enriching and doing some routing on
it converting CSV data into JSON and
merging it all back together and in this
version of the flow we're able to get
rid of all those six processors and
replace it with just these two
processors that are of type query record
in this processor out of all the
processors and the entire knife I
distribution it's really got to be my
favorite and yes I do have favorite
processors and you probably should too
this processor is an absolute beast and
we'll talk about how we can configure it
and what it's doing in just second but
we'll notice that we've got the same
output here if any purchase is more than
$1,000 and we're gonna go ahead and use
a PUD email processor just like we did
before and the final result is going to
be put into the file system using put
file just like we did before
so what we're not doing in this
particular flow though is using a split
text processor to break it apart and
reemerge it back together and that's
because these query record processors
and any of the record based processors
for that matter are really designed to
take in a stream of records in a single
flow file and operate on each of those
records independently instead of
breaking them apart and so if we look at
query record and look at this
configuration we'll see that it's
actually really simple to configure it
has three key properties here a record
reader and that's going to allow us to
configure how we actually parse that
incoming data so in this case we're
using a CSV reader so that incoming data
is going to be CSV and we want to parse
it using a proper parser rather than
splitting on new line and using regular
expressions to to find all the delimiter
x' and pick out little fields that we
care about instead of doing all of that
we'll just choose a CSV reader and it'll
handle all of that for us
and it will do it according to the CSV
specification so we don't have to worry
about our patterns being incorrect the
next thing that we're gonna configure
here is a record writer so after we've
parsed that data we're going to run some
sequel over that data and then the
results we're going to write back out so
in this case we're saying that we want
to write the result out in CSV also and
then we're going to define what it is
that we actually want to do with that
data
once we've parse it and before we write
it out so in this case all we really
want to do is we want to pick out any
records that have a purchased total of
more than 1000 and so that's what this
sequel says select asterisk which means
all of the fields from the flow file
where the purchased total field is more
than 1,000 dollars so if there are any
records that match that criteria they're
going to be written out to this critical
relationship and we can see that this
critical relationship is again routed to
put email and then we're going to send
the original flow file so regardless of
what matched or didn't match in that
query record processor we're gonna go
ahead and send the original full file
down to the second query record
processor the second query record
processor is going to be responsible for
again parsing CSV data because that was
the original relationship and the data
as it came in from the file system was
in csv data was in csv format but this
time we want the output to be in JSON
because that's what we actually want to
go out to our file system that's what we
want the end result to be so we can just
choose a record writer and the JSON
record set writer and that's all we have
to really configure and then we get to
choose what sequel we actually want to
run over this data in order to transform
the data or to choose which fields we
care about or which records we care
about in this case we want to do our
enrichment actually right here in line
so we're gonna say select asterisk so
select all of the field
and we also want to select this
additional field from flow file this
additional field is going to use the
expression language to pick out the
current date and time and then we're
going to format it using the
month/day/year hour minute and second
just like we did in the previous flow
and then we're going to call that new
field process time and that's all we
have to really do to select all the
existing fields and then add in this new
enrichment field into our flow file and
then we're going to write that data out
like I said using the JSON record set
writer so this is why I love this
processor it's extremely powerful it's
easy to use it's easy to configure and
it's extremely efficient so we talked
about how inefficient the previous flow
was I promised you that this version of
flow was going to be a lot more
efficient so let's take a look and see
how much more efficient this actually is
I'm going to click start and then I'm
going to immediately refresh the results
and we can see that we've already
processed almost all of the data we've
now finished processing all of that data
within that amount of time so if we come
down here and look at the provenance
data for the put file processor just
like we did before and we look at the
event details for this particular drop
event what we saw before was about 40
7.5 seconds
and with this version of the flow that
time has now been reduced to zero point
eight eight five seconds or 885
milliseconds so we went from about 47 48
seconds to less than one second so we
saw this flow performing about 50 times
faster than the previous version of the
flow it's also worth noting if we then
come over here and look at the lineage
for this particular flow file we see a
fork event and we can expand that fork
event and we can find its parents and we
could see everything that happened to
that flow file we don't see a hundred
different children getting forked and
then join back together it's much easier
to look at this lineage and
and understand exactly what happened to
the data all the way through the system
and so there you have it with this
version of the flow we've eliminated the
splitting and the reemerging of the data
we've avoided blurring that line between
flow file content and flow file
attributes and we've treated our
structured and semi-structured data as
actual data that can be parsed using a
legitimate CSV parser and JSON record
set writer this allows us to avoid all
of those common mistakes that were going
to make if we're trying to
unstructured textual data and so we've
really avoided all of those three
anti-patterns with this version of the
flow and as we saw it's not only a lot
simpler to look at and read and
understand but it's dramatically more
efficient so the processing time went
from about 47 seconds to about 800
milliseconds so that's an order of well
over an order of magnitude faster I hope
you liked the video and I hope you
learned a lot if so please do like the
video and share it with anybody else you
know who may be interested thanks a lot
for taking the time to watch it
Посмотреть больше похожих видео
5.0 / 5 (0 votes)