Azure Stream Analytics with Event Hubs

Dustin Vannoy
5 Nov 202121:15

Summary

TLDRDustin Vannoy provides a comprehensive overview of Azure Stream Analytics, covering its setup, features, and capabilities. He explains how to create stream analytics jobs, emphasizing the platform's ease of use and seamless integration with Azure. Vannoy demonstrates how to set up Event Hubs and utilize them for data streaming, highlighting key features like partitioning and retention. He also shows how to use Stream Analytics Query Language for data processing and aggregation, while touching on advanced features like joining reference data and connecting to Power BI for real-time reporting.

Takeaways

  • πŸ“ˆ **Azure Stream Analytics Overview**: Dustin Vannoy introduces Azure Stream Analytics, emphasizing its ease of use for streaming within Azure due to its tight integration but noting its limited inputs and outputs.
  • πŸš€ **Serverless Auto Scaling**: Highlights the serverless nature of Stream Analytics, which auto scales based on the workload, making it a good fit for certain use cases within the Azure ecosystem.
  • πŸ”’ **Data Security Considerations**: Mentions the importance of considering data security and protection, especially for production workloads, and references a helpful article in the Azure documentation.
  • πŸ“¦ **Setting Up a Stream Analytics Job**: Provides a step-by-step guide on creating a new Stream Analytics job, including selecting a resource group and configuring streaming units.
  • 🌐 **Event Hubs Integration**: Details the process of setting up Event Hubs for Azure streaming, including choosing a resource group, location, and pricing tier.
  • πŸ“ **Input and Output Configuration**: Explains how to configure inputs and outputs for a Stream Analytics job, including selecting the serialization format and encoding.
  • πŸ” **Query Language and Transformation**: Discusses the use of Stream Analytics query language for data transformation, including the ability to filter, aggregate, and join data streams.
  • πŸ”— **Joining with Reference Data**: Demonstrates how to join a Stream Analytics job with reference data, such as a SQL database, to enrich the data stream with additional context.
  • ⏰ **Tumbling Window Function**: Introduces the concept of a tumbling window in Stream Analytics, which is used for time-based data aggregation.
  • πŸ“Š **Real-Time Reporting with Power BI**: Suggests the possibility of using Stream Analytics to create live reports in Power BI, showcasing the service's capability for real-time analytics.
  • πŸ”§ **Monitoring and Testing**: Emphasizes the importance of monitoring the Stream Analytics job and testing the query with sample data to ensure it produces the expected results.

Q & A

  • What is Azure Stream Analytics?

    -Azure Stream Analytics is a serverless, auto-scaling service that makes it easy to perform real-time analytics on streaming data within Azure. It is tightly integrated with other Azure services and offers features like easy setup, limited inputs and outputs, and the ability to scale with the workload.

  • What are the limitations of Azure Stream Analytics in terms of data sources?

    -Azure Stream Analytics has limited support for data sources. While it works well with Azure Event Hubs, it does not natively support other sources like Apache Kafka or Confluent Cloud unless the data is also streamed into Event Hubs.

  • How does Azure Stream Analytics handle scalability?

    -Azure Stream Analytics automatically scales based on the workload, which means it can handle varying amounts of data without requiring manual intervention for scaling.

  • What is an Event Hub in Azure and how does it relate to Stream Analytics?

    -An Event Hub in Azure is a big data streaming platform and event ingestion service that can receive and process millions of events per second. It is used as an input or output for Azure Stream Analytics jobs, allowing for the streaming of large amounts of telemetry data from various devices or applications.

  • What is a tumbling window in the context of Stream Analytics queries?

    -A tumbling window in Stream Analytics is a type of windowing function that is used to divide the incoming data stream into a series of non-overlapping time frames. Data within each window frame is grouped and processed separately, often used for aggregations like sum, average, or count.

  • How can Azure Stream Analytics be used with Power BI for real-time reporting?

    -Azure Stream Analytics can output data directly to Power BI, allowing for the creation of live reports that reflect real-time data streams. This integration is useful for data engineers and analysts who need to provide up-to-date insights and visualizations to end-users.

  • What is the significance of partitioning in Event Hubs?

    -Partitioning in Event Hubs allows for scaling of the consumer workload. Each partition holds a portion of the data, and consumers can read from these partitions independently. This enables parallel processing and can improve throughput and performance.

  • How does Azure Stream Analytics ensure data security?

    -Azure Stream Analytics provides options for securing data, including access policies that control permissions and roles, network settings to limit access to specific virtual networks, and encryption options for data at rest.

  • What is the role of a reference data set in Azure Stream Analytics?

    -A reference data set in Azure Stream Analytics is used to join with the streaming data to enrich it. It is typically static or slowly changing data that provides additional context or information to the real-time data stream, such as mapping vendor IDs to taxi zones.

  • How can one test a query in Azure Stream Analytics?

    -In Azure Stream Analytics, one can test a query using the 'Test Query' feature, which allows users to see the results of the query based on the most recent data received from the input sources. This helps in validating the query logic before running the job at full scale.

  • What are streaming units in the context of Azure Stream Analytics?

    -Streaming units in Azure Stream Analytics represent the computational resources allocated to a Stream Analytics job. They determine the job's processing capacity and can be adjusted according to the workload requirements.

Outlines

00:00

πŸ˜€ Introduction to Azure Stream Analytics

Dustin Vannoy introduces Azure Stream Analytics, discussing its ease of use for streaming within Azure and its tight integration with the platform. He notes the limitations regarding inputs and outputs, suggesting that it's a good fit for Azure-centric use cases or when data is being brought into Azure from other sources like Kafka clusters. Dustin then demonstrates setting up a Stream Analytics job, touching on aspects like resource group selection, streaming units, and data protection considerations. He also mentions the quick deployment time and the availability of APIs for programmatic interaction.

05:01

πŸ“š Setting Up Event Hubs for Azure Streaming

The paragraph covers the process of setting up Event Hubs for Azure streaming. It explains choosing a resource group, location, and the pricing tier, emphasizing the differences between Basic and Standard tiers, especially for Apache Kafka compatibility. Dustin also discusses throughput units, auto-inflate options, and the creation of an Event Hub namespace. Access control and access policies are highlighted as important considerations. The paragraph further details creating an Event Hub, including partition count for scalability, message retention options, and the capture feature for long-term data storage in Azure Storage.

10:02

πŸ”Œ Configuring Inputs and Outputs for Stream Analytics

This section details the configuration of inputs and outputs for a Stream Analytics job. Dustin adds an Event Hub as an input source, selecting a namespace and an existing Event Hub. He talks about creating a consumer group, the use of connection strings, and the selection of serialization format and encoding. For outputs, he chooses to write back to Event Hubs, creating a new Event Hub for output and discussing the importance of partition keys and data serialization format. The paragraph also touches on the Stream Analytics query language and the ability to use functions like Azure ML or JavaScript within the job.

15:05

πŸ” Querying and Aggregating Data with Stream Analytics

Dustin dives into writing queries for Stream Analytics, starting with a basic 'select star' query that sends all incoming data to an output Event Hub. He demonstrates how to select specific fields and use filters, as well as how to perform aggregations like summing and averaging. The paragraph explains the use of a tumbling window for time-based data aggregation and the importance of defining the window duration. Dustin also discusses the use of reference data, such as joining with a SQL database to enrich the stream with additional information, and testing the query to ensure accuracy.

20:06

πŸš€ Starting the Stream Analytics Job and Viewing Results

The final paragraph outlines the steps to start a Stream Analytics job, including checking the number of streaming units and monitoring the output in the Event Hub. Dustin explains how to view the results using the 'Process data' feature, which allows for real-time monitoring of the output data. He also briefly mentions the capability to output data directly to Power BI for live reporting, highlighting the versatility of Stream Analytics for various use cases. The paragraph concludes with an invitation to follow for more content and a prompt to check out Dustin's website.

Mindmap

Keywords

πŸ’‘Azure Stream Analytics

Azure Stream Analytics is a cloud-based service designed for real-time data stream processing. It is tightly integrated within the Azure ecosystem and offers serverless auto-scaling capabilities. In the video, it is used to demonstrate how to set up and use the service for streaming analytics within Azure, highlighting its ease of use and serverless features.

πŸ’‘Event Hubs

Event Hubs is a real-time data ingestion service that can process millions of events per second. It is used in Azure Stream Analytics as a source of streaming data. In the script, the presenter discusses setting up Event Hubs for streaming data and its integration with Stream Analytics for processing and outputting data.

πŸ’‘Streaming Units

Streaming Units in Azure Stream Analytics represent the computational capacity allocated to a streaming job. They determine the throughput of the job. The script mentions that the default is set to three, which can be adjusted based on the scenario, affecting the performance and cost of the job.

πŸ’‘Data Protection

Data protection in the context of the video refers to the security measures taken to secure private data assets within Azure services. The presenter briefly touches on the topic, suggesting looking into data protection options for production workloads, which is crucial for maintaining data integrity and privacy.

πŸ’‘Auto-inflate

Auto-inflate is a feature in Azure Event Hubs that allows the system to automatically scale the throughput capacity based on the load. The presenter mentions setting the auto-inflate option to ensure flexibility in handling varying levels of data throughput during the demo.

πŸ’‘Partition Count

Partition count in Event Hubs refers to the number of shards that partition the data stream. This allows for scaling of consumers and parallel processing. The script uses partition count as an example to explain how data can be divided and processed by multiple consumers simultaneously.

πŸ’‘Serialization Format

The serialization format is the way data is structured when it is converted into a format that can be easily stored or transmitted. In the context of the video, JSON is chosen as the serialization format for the input data coming from Event Hubs, which is a common format for structured data.

πŸ’‘Stream Analytics Query Language

Stream Analytics Query Language (SAQL) is used to write queries that process streaming data within Azure Stream Analytics. The presenter discusses the use of SAQL to create queries that filter, aggregate, and transform data as it streams through the service.

πŸ’‘Tumbling Window

A tumbling window in the context of stream processing is a type of window that defines a specific duration of time during which data is aggregated. The video demonstrates using a tumbling window to aggregate data on a per-minute basis, which is a common pattern for time-based data analysis.

πŸ’‘Reference Data

Reference data in Azure Stream Analytics is used to enrich the streaming data with additional information. The presenter shows how to join streaming data with reference data from a SQL database to add context, such as mapping vendor IDs to taxi zones for more insightful analytics.

πŸ’‘Power BI

Power BI is a business analytics service that delivers insights via rich visualizations and interactive reports. The video mentions the capability to output data from Azure Stream Analytics directly to Power BI for creating live reports, showcasing the potential for real-time analytics and business intelligence.

Highlights

Azure Stream Analytics is a serverless, auto-scaling service that allows for easy streaming within Azure.

It is tightly integrated within Azure but has limited input and output options, making it suitable for specific use cases.

Stream Analytics is particularly useful for Azure-based data streaming, or when bringing data into Azure from other sources like Kafka clusters.

The service can be quickly set up through the Azure portal, with a user-friendly interface and available REST APIs for automation.

Data security and protection options are available and should be considered for production workloads.

Event Hubs can be set up for Azure streaming, with options for Apache Kafka compatibility.

The choice of pricing tier in Event Hubs is crucial, with options ranging from Basic to Premium, affecting features like consumer limits and connections.

Stream Analytics jobs can be created and customized with inputs, outputs, and queries directly in the Azure portal.

The query language used in Stream Analytics is similar to SQL, with support for aggregations, filters, and windowing functions.

Reference data can be integrated into Stream Analytics queries, allowing for complex joins and transformations.

Data from Stream Analytics can be outputted to various sinks, including another Event Hub, for further processing or consumption.

The system supports real-time data processing and can be used to feed live reports in services like Power BI.

The input preview feature in Stream Analytics allows users to test queries with real-time data before full deployment.

Stream Analytics jobs can be monitored and managed through the Azure portal, with insights into message throughput and system performance.

Security features like access policies and encryption options are available to protect data within Event Hubs.

The ability to auto-inflate throughput units in Event Hubs ensures scalability to handle varying data loads.

Stream Analytics supports a range of serialization formats and encoding options to accommodate different data types and sources.

The demonstration provided a practical walkthrough of setting up and using Azure Stream Analytics for real-time data streaming and processing.

Transcripts

play00:00

hey dustin vannoy here i'm going to

play00:02

share with you a bit about azure stream

play00:04

analytics and show you how we get that

play00:06

set up talk about some of the features

play00:08

and capabilities and then we'll look at

play00:10

creating a stream analytics job or two

play00:14

so stream analytics is a really easy way

play00:16

to do streaming within azure it's very

play00:18

tightly integrated within azure but it

play00:21

also has limited inputs and outputs so

play00:23

what that means is that if you're

play00:24

working with azure and you find a use

play00:26

case that fits well it's an easy

play00:28

solution it's serverless auto scale some

play00:30

really good features

play00:31

but if you're working with a variety of

play00:34

sources like apache kafka not not event

play00:37

hubs or event hubs for apache kafka but

play00:39

a true apache kafka or confluent cloud

play00:41

setup it's not going to work for you

play00:43

unless you start

play00:44

also streaming that data into event hubs

play00:46

which is all very possible and

play00:47

reasonable but basically the point is

play00:49

that if you're within azure if you're

play00:51

looking for an easy way to do stream

play00:52

analytics or you're bringing data into

play00:54

azure from another you know kafka

play00:56

cluster or something like that it might

play00:58

be a good fit for you so without further

play01:00

ado let's let's take a quick look and

play01:02

see what we think of this

play01:04

let's first set up our stream analytics

play01:06

job and that'll help us get a feel for

play01:07

what capabilities exist there

play01:11

if we find stream analytics in the

play01:13

portal and create a brand new job

play01:17

demo stream analytics

play01:20

very creative name i know and then we

play01:22

will select a resource group i have a

play01:24

streaming resource group that it'll fit

play01:26

well in

play01:27

hopefully that's the right location

play01:29

we'll just keep going and then streaming

play01:31

units at defaults to three i'm okay with

play01:33

that for this

play01:34

uh scenario

play01:35

you may want to secure private data

play01:37

assets you may want to look into some of

play01:39

your options around data protection

play01:41

there's a nice article in the docs that

play01:43

i think will give you a lot of good info

play01:44

if you have an edge case where you're

play01:46

not hosting in the cloud you're actually

play01:48

hosting elsewhere on edge devices

play01:50

definitely look up into that but most

play01:52

cases at least for uh getting started

play01:54

testing it out we don't need to add all

play01:56

this uh data security quite yet

play01:58

definitely for production workloads

play02:00

within your company take a look at that

play02:01

and make the right decisions for your

play02:02

use case

play02:04

so as it's deploying we actually have

play02:05

some interesting things we can see one

play02:07

is that deployed really quickly so

play02:09

that's nice like i said easy to get

play02:10

started easy to use especially if you're

play02:12

trying to do this from the ui there are

play02:15

uh apis so that you can

play02:17

do some things with stream analytics

play02:19

jobs from

play02:20

you know using some kind of client that

play02:22

you write that's calling rest

play02:23

rest apis

play02:25

but we'll just stick with the ui for

play02:26

this demo

play02:28

that deployed pretty quickly and now we

play02:30

can take a look at

play02:31

uh the next steps for our job

play02:35

so if we want to set up event hubs for

play02:37

our azure streaming we can jump to the

play02:40

event hubs page choose create

play02:42

pick a

play02:44

valid resource group and give it some

play02:46

sort of name

play02:52

okay i have that created and then you

play02:54

need to choose a location usually you're

play02:56

going to put it near your other

play02:57

resources and then the pricing tier is

play02:59

interesting if i'm doing a demo there is

play03:02

this basic uh very you know very strict

play03:06

number of consumers and broker

play03:07

connections type of

play03:09

option but i'm going to use event hubs

play03:10

for apache kafka in some of my examples

play03:13

so let me go and choose standard you

play03:14

cannot do even hubs for apache kafka

play03:16

with basic and then of course if you're

play03:19

setting this up for your company and

play03:21

production environment you might want to

play03:22

take a look at this preview premium

play03:24

option

play03:25

and other options that are available

play03:28

throughput units this is actually for

play03:30

the whole namespace so if i create a lot

play03:31

of event hubs then

play03:33

this could become

play03:34

you know something i need to to spin up

play03:36

more for a demo one is fine

play03:39

auto inflate is a good thing to know

play03:40

exist i'll go and say one to two which i

play03:43

probably will never even inflate with

play03:44

the example i'll do but just to make

play03:46

sure we have a little bit of flexibility

play03:48

in how much throughput we can handle

play03:52

you have the option to choose tags and

play03:54

then you'll review the options you've

play03:56

selected

play03:58

and choose create

play04:01

so once your event hub namespace is

play04:03

created you can go and set some various

play04:06

settings the access control is something

play04:07

you'll always need to think about within

play04:09

azure

play04:10

and we can

play04:12

add some access policies which are

play04:15

typically used in this case i'll start

play04:17

by doing some stream analytics example

play04:19

which will add one and then i might just

play04:21

use this

play04:22

default root directory which probably

play04:24

isn't what you'll do in production

play04:25

you'll probably want to limit it to just

play04:26

produce or just consume send or listen

play04:29

but for now we'll just stick with that

play04:31

for a demo i can go change my throughput

play04:33

units after i've done it i can go think

play04:35

about

play04:36

geo recovery i have network settings

play04:38

which is pretty typical to limit it to

play04:40

some internal azure resource resources

play04:43

some internal virtual networks

play04:45

encryption is something you certainly

play04:47

want to think about i often am pretty

play04:49

comfortable not using a customer managed

play04:51

key but

play04:52

if it's you know at a at a company i let

play04:55

them make that decision of course

play04:57

and then

play04:58

we can always go down and view our

play04:59

properties

play05:01

now we get to the good stuff under

play05:02

entities we have a schema registry which

play05:05

is an option that it's not going to work

play05:08

exactly like a

play05:09

confluence camera registry but it's

play05:11

something you can work with within event

play05:13

hubs maybe i can come back and talk

play05:14

about that another time for now though

play05:16

let's take a look at

play05:18

the event hubs page

play05:20

so here we are on the event hub section

play05:23

and this is where i can actually create

play05:24

an event hub

play05:25

so when you go to create an event hub

play05:27

basically what you're defining is here's

play05:28

my event hub name and then the partition

play05:31

count and so partition count is going to

play05:33

let you

play05:34

decide how much you can scale your

play05:36

consumers if i only have one partition

play05:38

then all of my data is going to one

play05:40

single partition and i can really only

play05:42

consume from that partition whereas if i

play05:45

have let's say three partitions i could

play05:46

have three separate consumers all with

play05:48

the same consumer group id

play05:50

and those will then each read only the

play05:52

messages that that they should right so

play05:55

it's going to kind of split our messages

play05:56

into three

play05:57

groups if you will and they'll each grab

play05:59

a piece of that and that way you're

play06:01

running

play06:02

in parallel with with three different

play06:03

consumers

play06:06

so we have the option of one to seven

play06:07

days here and for message message

play06:09

retention

play06:10

and uh you know you may only need it for

play06:13

a day or two just in case you need to

play06:14

catch up or replace some data i could

play06:16

also choose to turn on capture which is

play06:18

going to store that data in azure

play06:20

storage and so then i could store it for

play06:22

indefinitely instead of only for seven

play06:24

days

play06:25

you go ahead and give it a name click

play06:27

create and now you've got an event hub

play06:28

you can start to work with and again you

play06:30

can either take that route policy with

play06:32

all of the permissions it could have or

play06:34

you could create one specific for an

play06:35

event hub if you're really trying to

play06:37

lock this down to

play06:38

only those that need to use it

play06:41

so here i'm in a stream analytics job i

play06:43

created and it's blank let me go ahead

play06:45

and add the inputs and we'll talk about

play06:46

that as we go

play06:51

so we only have the three options i'll

play06:53

go ahead and add event hub here

play06:55

and then we can give it a name and this

play06:58

is really just for stream analytics so

play07:00

i'll call it eh1 input sure

play07:02

and then we need to pick an event hub's

play07:04

namespace i'll go and choose

play07:07

demo eh2

play07:09

we can use an existing event hub that i

play07:12

already set up i just find it's easier

play07:13

to set up the event hubs in advance and

play07:16

then

play07:17

well to go ahead and let it create a new

play07:18

user group i don't normally have

play07:20

problems with that and then i have

play07:22

trouble getting managed identity to work

play07:23

you'd have to go set that yourself

play07:25

typically at least in my

play07:27

in my environment it fails quite a bit

play07:30

so i just go ahead and use a connection

play07:31

string for for demos at least and then

play07:33

you can either create a new event hub

play07:35

policy for this or you can use existing

play07:38

i like to create new if i have the

play07:40

permissions it just it gives it a

play07:42

default name that's very obvious to me

play07:44

where it's coming from

play07:45

and that's all taken care of

play07:47

you would want to think about partition

play07:49

key if you're using event hubs

play07:50

especially if you have a lot of data

play07:52

coming through we won't get into that

play07:54

right now let's focus on sort of just

play07:56

setting up this technology for a basic

play07:58

scenario

play07:59

the serialization format you have a few

play08:01

options to choose from we will go ahead

play08:04

and stick with json for ours this is

play08:06

really what the input data is going to

play08:08

be coming from event hubs so you may not

play08:10

have the choice in the real world

play08:12

someone else may be producing the data

play08:14

for you and you just need to find out

play08:15

that

play08:16

that format that they're using to

play08:17

serialize that data

play08:19

encoding is always going to be utf-8 at

play08:21

this point event compression type you

play08:23

have a few options if you're going to

play08:25

deal with compressed data

play08:27

we'll click save that should be able to

play08:29

create that policy i think i have all

play08:31

the right permissions here and then we

play08:33

have an input that we can use in our

play08:34

stream analytics job

play08:36

let's go and decide what our output will

play08:38

be so we have an input and an output and

play08:40

then we get to define the query and

play08:41

potentially functions and lookup data if

play08:43

we choose to we'll go ahead and just

play08:45

write it back to event hubs i think that

play08:47

if we're doing

play08:48

a data engineer practice that i think i

play08:50

want to show you is if we are doing some

play08:53

processing maybe some aggregation

play08:55

we would often want to keep that data

play08:58

streaming for multiple consumers and so

play09:00

if everything's going to be done by us

play09:02

in stream analytics then we could have

play09:04

multiple outputs here but really let's

play09:06

pretend that we've got additional

play09:08

consumers maybe even

play09:10

some kind of reporting micro service

play09:11

that goes back to the customers and we

play09:13

want to make sure they can get this

play09:15

exact same data so for for this piece of

play09:17

the stream analytics job we're going to

play09:19

just write it back to event hubs

play09:26

we'll call that eh out i'll use eventhub

play09:28

namespace and

play09:33

create a new eventhub topic

play09:37

demo out

play09:38

there we go

play09:41

okay so we have event hub name demo out

play09:44

i'll go ahead and do a connection string

play09:46

we'll let it create a brand new one for

play09:48

us

play09:49

partition key i'm not really going to

play09:50

worry about this for the example

play09:56

i'm not going to work with custom with

play09:58

property columns

play10:00

and i'll go ahead and have it write the

play10:01

data out as json that way it's the most

play10:03

portable for the different consumers i

play10:06

expect to have

play10:10

typically with analytics and streaming

play10:12

systems will do line separated json so

play10:14

you'll have multiple json objects in

play10:16

if you're writing the files within a

play10:18

file maybe within a batch that that gets

play10:20

sent from stream analytics

play10:22

and then the

play10:23

encoding is going to still be utf-8

play10:31

within stream analytics

play10:33

i'm mostly going to be working on adding

play10:35

a query that will transform the data i

play10:37

do want to point out very quickly that

play10:39

there's this functions feature where you

play10:41

can have an azure ml service or ml

play10:43

studio or a javascript function

play10:46

so the really important piece of the

play10:48

stream analytics job is the query itself

play10:50

and this is where that stream analytics

play10:52

query language comes into play notice

play10:54

you can jump here and and take a look at

play10:56

the docs right from

play10:58

stream analytics job which is which is

play10:59

really handy because if you're new to

play11:01

this you'll probably need to check those

play11:02

out for the exact functionality you're

play11:03

looking for

play11:05

the select star means select all of the

play11:07

fields that will be in the message

play11:09

and where do we want that to go we're

play11:10

going to put it into our output event

play11:13

hub eh out one

play11:16

and from is going to be

play11:18

eh1 input

play11:21

so if i were to run this all i'd be

play11:22

doing is taking data from one event hub

play11:25

and writing it to another event hub

play11:27

there's a chance we'd want to do that

play11:29

probably not with event hubs though i

play11:32

don't think that's very likely to be our

play11:33

case let's start looking at some of the

play11:35

other capabilities we have

play11:37

first before we can really do much here

play11:38

i need to have data flowing and then i

play11:40

have some handy features like input

play11:42

preview that we can use

play11:48

so i'll go kick off my producer and once

play11:50

we have some data we can start to build

play11:52

out this query a little more

play11:56

for the time being i'll go ahead and

play11:57

save it

play11:59

let's take a look at azure stream

play12:01

analytics hands-on and i'll do this with

play12:03

the stream analytics job

play12:05

so here we are in azure stream analytics

play12:07

in the query pane

play12:09

and i've already set up an

play12:12

output topic and an input topic these

play12:14

are both event hub topics at first

play12:16

and what's going to happen is if i kick

play12:18

this off is we will select every record

play12:20

from my input data and send it along to

play12:22

the output topic

play12:27

if we take a look at our

play12:29

sample data towards the bottom

play12:31

we can see that i've got the new york

play12:33

taxi trip data we can see our column

play12:35

names and get a glimpse of the data

play12:37

types and from there i could actually

play12:40

instead of doing select star i could

play12:42

start to choose

play12:43

just a few of the fields to work with so

play12:46

for example if i do

play12:48

trip distance

play12:50

and passenger count

play12:53

and vendor id

play12:57

and do a test query

play13:00

it's going to show me a much you know

play13:03

smaller number of columns and it's going

play13:05

to show me the first

play13:06

50 rows from the recent data that's

play13:08

streamed in

play13:15

in addition i could do

play13:17

filters so a lot of the things that

play13:18

you'd expect to do with sql especially

play13:20

things that would work in t-sql

play13:23

are available here um the more you get

play13:25

into functions and the more advanced

play13:27

things that the less likely it is to

play13:29

match up but that's why the query

play13:30

language docs are so helpful and they're

play13:32

right here to open up and take a look at

play13:34

what you care about

play13:36

so for example if i wanted to only get

play13:38

vendor 2

play13:40

i could easily add vendor id

play13:44

equals 2

play13:45

4 1 2. so here i'm querying only for

play13:49

vendor id equals two just by adding that

play13:51

where clause then you can see that it's

play13:53

narrowed down those results so that's

play13:54

pretty good

play13:56

i can also do aggregation so let's go

play13:58

and take a look at an aggregate

play14:01

as you probably know from running sql

play14:02

queries we would need to use aggregate

play14:04

functions so i can

play14:05

sum passenger count

play14:10

for a total number of passengers

play14:15

i can sum

play14:17

trip distance

play14:26

and then maybe we'll do a pretty simple

play14:27

calculation so we could do average of

play14:30

the tip amount and total amount

play14:34

let's go ahead and continue on this

play14:37

group by

play14:38

so if i hover over i'll see that it's

play14:40

telling me that i need to have a group

play14:43

by statement or i need to

play14:45

use the over clause let's go and add a

play14:47

group by vendor id

play14:53

and then what it's looking for as well

play14:55

is a window

play14:57

basically data is going to continue to

play14:59

stream so we need a way to define

play15:01

what what window of time what section of

play15:04

time do we want to aggregate this data

play15:06

for

play15:07

so let's go ahead and use a tumbling

play15:09

window

play15:13

give it a duration

play15:15

and that will take care of the warnings

play15:18

we're getting

play15:19

what it's going to do is there's a event

play15:21

time that comes in with this data from

play15:22

event hubs and it's going to by default

play15:25

use that for my window we really would

play15:28

probably go back and take you may have

play15:29

seen there's a pickup time that we might

play15:32

use instead and basically you would just

play15:34

define that on the from clause we'll

play15:36

just let it use the default event

play15:38

timestamp and run this test query and

play15:40

see what we get

play15:44

now i'll go ahead and produce some more

play15:45

data and we'll see how this changes

play15:47

things

play15:52

so if we go back and refresh our input

play15:54

preview that'll change the data that

play15:56

we're looking at

play15:58

now when i run my test query

play16:02

i have numbers that are different than

play16:03

before

play16:05

likewise if i keep refreshing the input

play16:07

then those numbers will change what that

play16:09

means is i could go ahead and save this

play16:11

and run this and we should see every

play16:13

minute we're getting results

play16:16

now this is okay but vendor id there's

play16:18

only two vendors right now and the id

play16:21

itself doesn't give me a lot of

play16:22

information what i'd rather do is use

play16:24

the taxi zone so let's see how we can

play16:26

take some reference data as input and

play16:28

find the right value and use that for

play16:30

the grouping

play16:31

so i have my query based on vendor id

play16:34

with the the group by and the select

play16:36

calls using it but there's only two

play16:38

vendors and so i really want to go and

play16:40

join that into the taxi zone data so i

play16:42

can see what zone the pickup happened in

play16:46

in order to do that we need to add a

play16:48

reference data set and then we'll add a

play16:50

join

play16:51

to get my zone data and change from

play16:54

vendor to zone i need to add a reference

play16:57

input that reference input the easiest

play16:59

option i've seen is the sql database so

play17:01

we'll go ahead and get that set up

play17:04

and

play17:05

i have a serverless sql instance going i

play17:08

can go ahead and

play17:10

enter my

play17:12

connection information here

play17:23

and then i need to update the sql

play17:25

statement to be valid for my database

play17:33

okay and now i can save

play17:37

now i have reference data and i can

play17:38

actually join that in

play17:40

so to get zone sql joined in i'll alias

play17:43

my first input

play17:45

i'll go ahead and

play17:47

add in zone sql set up the join clause

play17:50

and now i need to take this t1 alias and

play17:53

use it in front of each of the fields

play17:54

that comes from that initial input

play18:03

and then rather than use vendor id i'm

play18:05

actually going to replace it with t2

play18:07

zone which is

play18:09

the the name of my taxi zone

play18:15

now i can test out this query and see if

play18:17

i got all of that syntax correct

play18:22

and there we go we've now have uh many

play18:25

more records each time this tumbling

play18:27

window runs and we can see our data by

play18:30

zone which i think would be a little

play18:31

more realistic than just the vendor id

play18:35

then you can go ahead and save that

play18:36

query and kick off your job and you'll

play18:38

see data for in the output event hub

play18:45

to kick off your job you can go to the

play18:47

overview page and click on start it'll

play18:49

take just a little bit of time to spin

play18:51

up these resources for you i've got the

play18:53

three streaming units check this before

play18:55

you run it because this is where it

play18:57

starts to cost you a little bit of money

play19:01

kick off my job that'll take a little

play19:02

bit of time and then i'll check and make

play19:04

sure that i'm getting output in my event

play19:06

hub

play19:07

but now that it's run for a little bit

play19:08

it's produced some events it's hit a

play19:10

tumbling window

play19:11

of one minute and it's output 23 events

play19:14

let's go take a look at our our output

play19:16

topic and see those results

play19:22

so from my output event hub i can scroll

play19:25

down to the monitoring and this will

play19:27

show me that some messages have been

play19:28

received and then i can go and

play19:31

use process data which is actually going

play19:33

to take me to this this other way of

play19:36

getting a stream analytics job so this

play19:38

is like a stream analytics a query built

play19:40

into

play19:42

viewing the the events coming out of

play19:44

event hub so we get to that same view

play19:45

same

play19:46

stream analytics query language will

play19:48

work we'll keep it simple and just kind

play19:50

of run to see the test results

play19:54

and like before

play19:56

there we go

play19:58

and so it might be a little lag but it

play19:59

should come in pretty soon feel free to

play20:01

hit it once or twice

play20:02

and then if i'm looking through here i

play20:03

have this event in queued time and i

play20:05

should see that the same

play20:07

zone might show up multiple times for

play20:09

different tumbling windows and the event

play20:12

in queue time is what will help me catch

play20:13

that

play20:17

now in our example i was outputting to

play20:20

another event hub topic which i think is

play20:22

fairly typical for data engineers to

play20:24

have some parts of their processing that

play20:26

are streaming read from event hubs or a

play20:28

similar broker and right back out to

play20:31

that that that broker to a different

play20:33

topic and the reason for that is so that

play20:35

multiple consumers can read the data

play20:37

however you may have other use cases

play20:39

where you need to write directly to

play20:40

storage with stream analytics and so

play20:42

there's quite a few outputs we can

play20:43

choose from one of the one of the ones i

play20:45

find really interesting is power bi i

play20:47

can actually set this up to do a live

play20:50

report within my power bi service

play20:53

so obviously you'll need power bi set up

play20:55

you'll need to authorize and all that

play20:57

but

play20:58

that just so you know that capabilities

play20:59

out there if you're trying to do

play21:00

real-time reports this might be a way

play21:02

you pull that off

play21:04

so that's our hands-on look at azure

play21:06

stream analytics i hope that you learned

play21:08

something along the way don't forget to

play21:10

subscribe to this channel or check out

play21:11

dustinvannoy.com for more content i'll

play21:13

see you next time

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Azure Stream AnalyticsData StreamingEvent HubsServerlessAuto ScalingReal-Time AnalyticsCloud IntegrationData ProcessingDustin VanNooyTech Tutorial