Azure Stream Analytics with Event Hubs
Summary
TLDRDustin Vannoy provides a comprehensive overview of Azure Stream Analytics, covering its setup, features, and capabilities. He explains how to create stream analytics jobs, emphasizing the platform's ease of use and seamless integration with Azure. Vannoy demonstrates how to set up Event Hubs and utilize them for data streaming, highlighting key features like partitioning and retention. He also shows how to use Stream Analytics Query Language for data processing and aggregation, while touching on advanced features like joining reference data and connecting to Power BI for real-time reporting.
Takeaways
- 📈 **Azure Stream Analytics Overview**: Dustin Vannoy introduces Azure Stream Analytics, emphasizing its ease of use for streaming within Azure due to its tight integration but noting its limited inputs and outputs.
- 🚀 **Serverless Auto Scaling**: Highlights the serverless nature of Stream Analytics, which auto scales based on the workload, making it a good fit for certain use cases within the Azure ecosystem.
- 🔒 **Data Security Considerations**: Mentions the importance of considering data security and protection, especially for production workloads, and references a helpful article in the Azure documentation.
- 📦 **Setting Up a Stream Analytics Job**: Provides a step-by-step guide on creating a new Stream Analytics job, including selecting a resource group and configuring streaming units.
- 🌐 **Event Hubs Integration**: Details the process of setting up Event Hubs for Azure streaming, including choosing a resource group, location, and pricing tier.
- 📝 **Input and Output Configuration**: Explains how to configure inputs and outputs for a Stream Analytics job, including selecting the serialization format and encoding.
- 🔍 **Query Language and Transformation**: Discusses the use of Stream Analytics query language for data transformation, including the ability to filter, aggregate, and join data streams.
- 🔗 **Joining with Reference Data**: Demonstrates how to join a Stream Analytics job with reference data, such as a SQL database, to enrich the data stream with additional context.
- ⏰ **Tumbling Window Function**: Introduces the concept of a tumbling window in Stream Analytics, which is used for time-based data aggregation.
- 📊 **Real-Time Reporting with Power BI**: Suggests the possibility of using Stream Analytics to create live reports in Power BI, showcasing the service's capability for real-time analytics.
- 🔧 **Monitoring and Testing**: Emphasizes the importance of monitoring the Stream Analytics job and testing the query with sample data to ensure it produces the expected results.
Q & A
What is Azure Stream Analytics?
-Azure Stream Analytics is a serverless, auto-scaling service that makes it easy to perform real-time analytics on streaming data within Azure. It is tightly integrated with other Azure services and offers features like easy setup, limited inputs and outputs, and the ability to scale with the workload.
What are the limitations of Azure Stream Analytics in terms of data sources?
-Azure Stream Analytics has limited support for data sources. While it works well with Azure Event Hubs, it does not natively support other sources like Apache Kafka or Confluent Cloud unless the data is also streamed into Event Hubs.
How does Azure Stream Analytics handle scalability?
-Azure Stream Analytics automatically scales based on the workload, which means it can handle varying amounts of data without requiring manual intervention for scaling.
What is an Event Hub in Azure and how does it relate to Stream Analytics?
-An Event Hub in Azure is a big data streaming platform and event ingestion service that can receive and process millions of events per second. It is used as an input or output for Azure Stream Analytics jobs, allowing for the streaming of large amounts of telemetry data from various devices or applications.
What is a tumbling window in the context of Stream Analytics queries?
-A tumbling window in Stream Analytics is a type of windowing function that is used to divide the incoming data stream into a series of non-overlapping time frames. Data within each window frame is grouped and processed separately, often used for aggregations like sum, average, or count.
How can Azure Stream Analytics be used with Power BI for real-time reporting?
-Azure Stream Analytics can output data directly to Power BI, allowing for the creation of live reports that reflect real-time data streams. This integration is useful for data engineers and analysts who need to provide up-to-date insights and visualizations to end-users.
What is the significance of partitioning in Event Hubs?
-Partitioning in Event Hubs allows for scaling of the consumer workload. Each partition holds a portion of the data, and consumers can read from these partitions independently. This enables parallel processing and can improve throughput and performance.
How does Azure Stream Analytics ensure data security?
-Azure Stream Analytics provides options for securing data, including access policies that control permissions and roles, network settings to limit access to specific virtual networks, and encryption options for data at rest.
What is the role of a reference data set in Azure Stream Analytics?
-A reference data set in Azure Stream Analytics is used to join with the streaming data to enrich it. It is typically static or slowly changing data that provides additional context or information to the real-time data stream, such as mapping vendor IDs to taxi zones.
How can one test a query in Azure Stream Analytics?
-In Azure Stream Analytics, one can test a query using the 'Test Query' feature, which allows users to see the results of the query based on the most recent data received from the input sources. This helps in validating the query logic before running the job at full scale.
What are streaming units in the context of Azure Stream Analytics?
-Streaming units in Azure Stream Analytics represent the computational resources allocated to a Stream Analytics job. They determine the job's processing capacity and can be adjusted according to the workload requirements.
Outlines
😀 Introduction to Azure Stream Analytics
Dustin Vannoy introduces Azure Stream Analytics, discussing its ease of use for streaming within Azure and its tight integration with the platform. He notes the limitations regarding inputs and outputs, suggesting that it's a good fit for Azure-centric use cases or when data is being brought into Azure from other sources like Kafka clusters. Dustin then demonstrates setting up a Stream Analytics job, touching on aspects like resource group selection, streaming units, and data protection considerations. He also mentions the quick deployment time and the availability of APIs for programmatic interaction.
📚 Setting Up Event Hubs for Azure Streaming
The paragraph covers the process of setting up Event Hubs for Azure streaming. It explains choosing a resource group, location, and the pricing tier, emphasizing the differences between Basic and Standard tiers, especially for Apache Kafka compatibility. Dustin also discusses throughput units, auto-inflate options, and the creation of an Event Hub namespace. Access control and access policies are highlighted as important considerations. The paragraph further details creating an Event Hub, including partition count for scalability, message retention options, and the capture feature for long-term data storage in Azure Storage.
🔌 Configuring Inputs and Outputs for Stream Analytics
This section details the configuration of inputs and outputs for a Stream Analytics job. Dustin adds an Event Hub as an input source, selecting a namespace and an existing Event Hub. He talks about creating a consumer group, the use of connection strings, and the selection of serialization format and encoding. For outputs, he chooses to write back to Event Hubs, creating a new Event Hub for output and discussing the importance of partition keys and data serialization format. The paragraph also touches on the Stream Analytics query language and the ability to use functions like Azure ML or JavaScript within the job.
🔍 Querying and Aggregating Data with Stream Analytics
Dustin dives into writing queries for Stream Analytics, starting with a basic 'select star' query that sends all incoming data to an output Event Hub. He demonstrates how to select specific fields and use filters, as well as how to perform aggregations like summing and averaging. The paragraph explains the use of a tumbling window for time-based data aggregation and the importance of defining the window duration. Dustin also discusses the use of reference data, such as joining with a SQL database to enrich the stream with additional information, and testing the query to ensure accuracy.
🚀 Starting the Stream Analytics Job and Viewing Results
The final paragraph outlines the steps to start a Stream Analytics job, including checking the number of streaming units and monitoring the output in the Event Hub. Dustin explains how to view the results using the 'Process data' feature, which allows for real-time monitoring of the output data. He also briefly mentions the capability to output data directly to Power BI for live reporting, highlighting the versatility of Stream Analytics for various use cases. The paragraph concludes with an invitation to follow for more content and a prompt to check out Dustin's website.
Mindmap
Keywords
💡Azure Stream Analytics
💡Event Hubs
💡Streaming Units
💡Data Protection
💡Auto-inflate
💡Partition Count
💡Serialization Format
💡Stream Analytics Query Language
💡Tumbling Window
💡Reference Data
💡Power BI
Highlights
Azure Stream Analytics is a serverless, auto-scaling service that allows for easy streaming within Azure.
It is tightly integrated within Azure but has limited input and output options, making it suitable for specific use cases.
Stream Analytics is particularly useful for Azure-based data streaming, or when bringing data into Azure from other sources like Kafka clusters.
The service can be quickly set up through the Azure portal, with a user-friendly interface and available REST APIs for automation.
Data security and protection options are available and should be considered for production workloads.
Event Hubs can be set up for Azure streaming, with options for Apache Kafka compatibility.
The choice of pricing tier in Event Hubs is crucial, with options ranging from Basic to Premium, affecting features like consumer limits and connections.
Stream Analytics jobs can be created and customized with inputs, outputs, and queries directly in the Azure portal.
The query language used in Stream Analytics is similar to SQL, with support for aggregations, filters, and windowing functions.
Reference data can be integrated into Stream Analytics queries, allowing for complex joins and transformations.
Data from Stream Analytics can be outputted to various sinks, including another Event Hub, for further processing or consumption.
The system supports real-time data processing and can be used to feed live reports in services like Power BI.
The input preview feature in Stream Analytics allows users to test queries with real-time data before full deployment.
Stream Analytics jobs can be monitored and managed through the Azure portal, with insights into message throughput and system performance.
Security features like access policies and encryption options are available to protect data within Event Hubs.
The ability to auto-inflate throughput units in Event Hubs ensures scalability to handle varying data loads.
Stream Analytics supports a range of serialization formats and encoding options to accommodate different data types and sources.
The demonstration provided a practical walkthrough of setting up and using Azure Stream Analytics for real-time data streaming and processing.
Transcripts
hey dustin vannoy here i'm going to
share with you a bit about azure stream
analytics and show you how we get that
set up talk about some of the features
and capabilities and then we'll look at
creating a stream analytics job or two
so stream analytics is a really easy way
to do streaming within azure it's very
tightly integrated within azure but it
also has limited inputs and outputs so
what that means is that if you're
working with azure and you find a use
case that fits well it's an easy
solution it's serverless auto scale some
really good features
but if you're working with a variety of
sources like apache kafka not not event
hubs or event hubs for apache kafka but
a true apache kafka or confluent cloud
setup it's not going to work for you
unless you start
also streaming that data into event hubs
which is all very possible and
reasonable but basically the point is
that if you're within azure if you're
looking for an easy way to do stream
analytics or you're bringing data into
azure from another you know kafka
cluster or something like that it might
be a good fit for you so without further
ado let's let's take a quick look and
see what we think of this
let's first set up our stream analytics
job and that'll help us get a feel for
what capabilities exist there
if we find stream analytics in the
portal and create a brand new job
demo stream analytics
very creative name i know and then we
will select a resource group i have a
streaming resource group that it'll fit
well in
hopefully that's the right location
we'll just keep going and then streaming
units at defaults to three i'm okay with
that for this
uh scenario
you may want to secure private data
assets you may want to look into some of
your options around data protection
there's a nice article in the docs that
i think will give you a lot of good info
if you have an edge case where you're
not hosting in the cloud you're actually
hosting elsewhere on edge devices
definitely look up into that but most
cases at least for uh getting started
testing it out we don't need to add all
this uh data security quite yet
definitely for production workloads
within your company take a look at that
and make the right decisions for your
use case
so as it's deploying we actually have
some interesting things we can see one
is that deployed really quickly so
that's nice like i said easy to get
started easy to use especially if you're
trying to do this from the ui there are
uh apis so that you can
do some things with stream analytics
jobs from
you know using some kind of client that
you write that's calling rest
rest apis
but we'll just stick with the ui for
this demo
that deployed pretty quickly and now we
can take a look at
uh the next steps for our job
so if we want to set up event hubs for
our azure streaming we can jump to the
event hubs page choose create
pick a
valid resource group and give it some
sort of name
okay i have that created and then you
need to choose a location usually you're
going to put it near your other
resources and then the pricing tier is
interesting if i'm doing a demo there is
this basic uh very you know very strict
number of consumers and broker
connections type of
option but i'm going to use event hubs
for apache kafka in some of my examples
so let me go and choose standard you
cannot do even hubs for apache kafka
with basic and then of course if you're
setting this up for your company and
production environment you might want to
take a look at this preview premium
option
and other options that are available
throughput units this is actually for
the whole namespace so if i create a lot
of event hubs then
this could become
you know something i need to to spin up
more for a demo one is fine
auto inflate is a good thing to know
exist i'll go and say one to two which i
probably will never even inflate with
the example i'll do but just to make
sure we have a little bit of flexibility
in how much throughput we can handle
you have the option to choose tags and
then you'll review the options you've
selected
and choose create
so once your event hub namespace is
created you can go and set some various
settings the access control is something
you'll always need to think about within
azure
and we can
add some access policies which are
typically used in this case i'll start
by doing some stream analytics example
which will add one and then i might just
use this
default root directory which probably
isn't what you'll do in production
you'll probably want to limit it to just
produce or just consume send or listen
but for now we'll just stick with that
for a demo i can go change my throughput
units after i've done it i can go think
about
geo recovery i have network settings
which is pretty typical to limit it to
some internal azure resource resources
some internal virtual networks
encryption is something you certainly
want to think about i often am pretty
comfortable not using a customer managed
key but
if it's you know at a at a company i let
them make that decision of course
and then
we can always go down and view our
properties
now we get to the good stuff under
entities we have a schema registry which
is an option that it's not going to work
exactly like a
confluence camera registry but it's
something you can work with within event
hubs maybe i can come back and talk
about that another time for now though
let's take a look at
the event hubs page
so here we are on the event hub section
and this is where i can actually create
an event hub
so when you go to create an event hub
basically what you're defining is here's
my event hub name and then the partition
count and so partition count is going to
let you
decide how much you can scale your
consumers if i only have one partition
then all of my data is going to one
single partition and i can really only
consume from that partition whereas if i
have let's say three partitions i could
have three separate consumers all with
the same consumer group id
and those will then each read only the
messages that that they should right so
it's going to kind of split our messages
into three
groups if you will and they'll each grab
a piece of that and that way you're
running
in parallel with with three different
consumers
so we have the option of one to seven
days here and for message message
retention
and uh you know you may only need it for
a day or two just in case you need to
catch up or replace some data i could
also choose to turn on capture which is
going to store that data in azure
storage and so then i could store it for
indefinitely instead of only for seven
days
you go ahead and give it a name click
create and now you've got an event hub
you can start to work with and again you
can either take that route policy with
all of the permissions it could have or
you could create one specific for an
event hub if you're really trying to
lock this down to
only those that need to use it
so here i'm in a stream analytics job i
created and it's blank let me go ahead
and add the inputs and we'll talk about
that as we go
so we only have the three options i'll
go ahead and add event hub here
and then we can give it a name and this
is really just for stream analytics so
i'll call it eh1 input sure
and then we need to pick an event hub's
namespace i'll go and choose
demo eh2
we can use an existing event hub that i
already set up i just find it's easier
to set up the event hubs in advance and
then
well to go ahead and let it create a new
user group i don't normally have
problems with that and then i have
trouble getting managed identity to work
you'd have to go set that yourself
typically at least in my
in my environment it fails quite a bit
so i just go ahead and use a connection
string for for demos at least and then
you can either create a new event hub
policy for this or you can use existing
i like to create new if i have the
permissions it just it gives it a
default name that's very obvious to me
where it's coming from
and that's all taken care of
you would want to think about partition
key if you're using event hubs
especially if you have a lot of data
coming through we won't get into that
right now let's focus on sort of just
setting up this technology for a basic
scenario
the serialization format you have a few
options to choose from we will go ahead
and stick with json for ours this is
really what the input data is going to
be coming from event hubs so you may not
have the choice in the real world
someone else may be producing the data
for you and you just need to find out
that
that format that they're using to
serialize that data
encoding is always going to be utf-8 at
this point event compression type you
have a few options if you're going to
deal with compressed data
we'll click save that should be able to
create that policy i think i have all
the right permissions here and then we
have an input that we can use in our
stream analytics job
let's go and decide what our output will
be so we have an input and an output and
then we get to define the query and
potentially functions and lookup data if
we choose to we'll go ahead and just
write it back to event hubs i think that
if we're doing
a data engineer practice that i think i
want to show you is if we are doing some
processing maybe some aggregation
we would often want to keep that data
streaming for multiple consumers and so
if everything's going to be done by us
in stream analytics then we could have
multiple outputs here but really let's
pretend that we've got additional
consumers maybe even
some kind of reporting micro service
that goes back to the customers and we
want to make sure they can get this
exact same data so for for this piece of
the stream analytics job we're going to
just write it back to event hubs
we'll call that eh out i'll use eventhub
namespace and
create a new eventhub topic
demo out
there we go
okay so we have event hub name demo out
i'll go ahead and do a connection string
we'll let it create a brand new one for
us
partition key i'm not really going to
worry about this for the example
i'm not going to work with custom with
property columns
and i'll go ahead and have it write the
data out as json that way it's the most
portable for the different consumers i
expect to have
typically with analytics and streaming
systems will do line separated json so
you'll have multiple json objects in
if you're writing the files within a
file maybe within a batch that that gets
sent from stream analytics
and then the
encoding is going to still be utf-8
within stream analytics
i'm mostly going to be working on adding
a query that will transform the data i
do want to point out very quickly that
there's this functions feature where you
can have an azure ml service or ml
studio or a javascript function
so the really important piece of the
stream analytics job is the query itself
and this is where that stream analytics
query language comes into play notice
you can jump here and and take a look at
the docs right from
stream analytics job which is which is
really handy because if you're new to
this you'll probably need to check those
out for the exact functionality you're
looking for
the select star means select all of the
fields that will be in the message
and where do we want that to go we're
going to put it into our output event
hub eh out one
and from is going to be
eh1 input
so if i were to run this all i'd be
doing is taking data from one event hub
and writing it to another event hub
there's a chance we'd want to do that
probably not with event hubs though i
don't think that's very likely to be our
case let's start looking at some of the
other capabilities we have
first before we can really do much here
i need to have data flowing and then i
have some handy features like input
preview that we can use
so i'll go kick off my producer and once
we have some data we can start to build
out this query a little more
for the time being i'll go ahead and
save it
let's take a look at azure stream
analytics hands-on and i'll do this with
the stream analytics job
so here we are in azure stream analytics
in the query pane
and i've already set up an
output topic and an input topic these
are both event hub topics at first
and what's going to happen is if i kick
this off is we will select every record
from my input data and send it along to
the output topic
if we take a look at our
sample data towards the bottom
we can see that i've got the new york
taxi trip data we can see our column
names and get a glimpse of the data
types and from there i could actually
instead of doing select star i could
start to choose
just a few of the fields to work with so
for example if i do
trip distance
and passenger count
and vendor id
and do a test query
it's going to show me a much you know
smaller number of columns and it's going
to show me the first
50 rows from the recent data that's
streamed in
in addition i could do
filters so a lot of the things that
you'd expect to do with sql especially
things that would work in t-sql
are available here um the more you get
into functions and the more advanced
things that the less likely it is to
match up but that's why the query
language docs are so helpful and they're
right here to open up and take a look at
what you care about
so for example if i wanted to only get
vendor 2
i could easily add vendor id
equals 2
4 1 2. so here i'm querying only for
vendor id equals two just by adding that
where clause then you can see that it's
narrowed down those results so that's
pretty good
i can also do aggregation so let's go
and take a look at an aggregate
as you probably know from running sql
queries we would need to use aggregate
functions so i can
sum passenger count
for a total number of passengers
i can sum
trip distance
and then maybe we'll do a pretty simple
calculation so we could do average of
the tip amount and total amount
let's go ahead and continue on this
group by
so if i hover over i'll see that it's
telling me that i need to have a group
by statement or i need to
use the over clause let's go and add a
group by vendor id
and then what it's looking for as well
is a window
basically data is going to continue to
stream so we need a way to define
what what window of time what section of
time do we want to aggregate this data
for
so let's go ahead and use a tumbling
window
give it a duration
and that will take care of the warnings
we're getting
what it's going to do is there's a event
time that comes in with this data from
event hubs and it's going to by default
use that for my window we really would
probably go back and take you may have
seen there's a pickup time that we might
use instead and basically you would just
define that on the from clause we'll
just let it use the default event
timestamp and run this test query and
see what we get
now i'll go ahead and produce some more
data and we'll see how this changes
things
so if we go back and refresh our input
preview that'll change the data that
we're looking at
now when i run my test query
i have numbers that are different than
before
likewise if i keep refreshing the input
then those numbers will change what that
means is i could go ahead and save this
and run this and we should see every
minute we're getting results
now this is okay but vendor id there's
only two vendors right now and the id
itself doesn't give me a lot of
information what i'd rather do is use
the taxi zone so let's see how we can
take some reference data as input and
find the right value and use that for
the grouping
so i have my query based on vendor id
with the the group by and the select
calls using it but there's only two
vendors and so i really want to go and
join that into the taxi zone data so i
can see what zone the pickup happened in
in order to do that we need to add a
reference data set and then we'll add a
join
to get my zone data and change from
vendor to zone i need to add a reference
input that reference input the easiest
option i've seen is the sql database so
we'll go ahead and get that set up
and
i have a serverless sql instance going i
can go ahead and
enter my
connection information here
and then i need to update the sql
statement to be valid for my database
okay and now i can save
now i have reference data and i can
actually join that in
so to get zone sql joined in i'll alias
my first input
i'll go ahead and
add in zone sql set up the join clause
and now i need to take this t1 alias and
use it in front of each of the fields
that comes from that initial input
and then rather than use vendor id i'm
actually going to replace it with t2
zone which is
the the name of my taxi zone
now i can test out this query and see if
i got all of that syntax correct
and there we go we've now have uh many
more records each time this tumbling
window runs and we can see our data by
zone which i think would be a little
more realistic than just the vendor id
then you can go ahead and save that
query and kick off your job and you'll
see data for in the output event hub
to kick off your job you can go to the
overview page and click on start it'll
take just a little bit of time to spin
up these resources for you i've got the
three streaming units check this before
you run it because this is where it
starts to cost you a little bit of money
kick off my job that'll take a little
bit of time and then i'll check and make
sure that i'm getting output in my event
hub
but now that it's run for a little bit
it's produced some events it's hit a
tumbling window
of one minute and it's output 23 events
let's go take a look at our our output
topic and see those results
so from my output event hub i can scroll
down to the monitoring and this will
show me that some messages have been
received and then i can go and
use process data which is actually going
to take me to this this other way of
getting a stream analytics job so this
is like a stream analytics a query built
into
viewing the the events coming out of
event hub so we get to that same view
same
stream analytics query language will
work we'll keep it simple and just kind
of run to see the test results
and like before
there we go
and so it might be a little lag but it
should come in pretty soon feel free to
hit it once or twice
and then if i'm looking through here i
have this event in queued time and i
should see that the same
zone might show up multiple times for
different tumbling windows and the event
in queue time is what will help me catch
that
now in our example i was outputting to
another event hub topic which i think is
fairly typical for data engineers to
have some parts of their processing that
are streaming read from event hubs or a
similar broker and right back out to
that that that broker to a different
topic and the reason for that is so that
multiple consumers can read the data
however you may have other use cases
where you need to write directly to
storage with stream analytics and so
there's quite a few outputs we can
choose from one of the one of the ones i
find really interesting is power bi i
can actually set this up to do a live
report within my power bi service
so obviously you'll need power bi set up
you'll need to authorize and all that
but
that just so you know that capabilities
out there if you're trying to do
real-time reports this might be a way
you pull that off
so that's our hands-on look at azure
stream analytics i hope that you learned
something along the way don't forget to
subscribe to this channel or check out
dustinvannoy.com for more content i'll
see you next time
浏览更多相关视频
"Azure Synapse Analytics Q&A", 50 Most Asked AZURE SYNAPSE ANALYTICS Interview Q&A for interviews !!
Part 1- End to End Azure Data Engineering Project | Project Overview
System Design: Apache Kafka In 3 Minutes
CIC Cockpit - Integrations Page
DP 203 Dumps | DP 203 Real Exam Questions | Part 2
Can’t INPUT DATA in Power BI? Here is a WRITE BACK Option with Power Apps!
5.0 / 5 (0 votes)