Azure Data Factory Part 5 - Types of Data Pipeline Activities

databag
5 Mar 202209:14

Summary

TLDRThis video from the Azure Data Factory series delves into the concept of pipelines and activities, explaining their roles in data processing. It clarifies the distinction between a pipeline, a logical grouping of activities, and an activity, a processing step within a pipeline. The video outlines three main types of activities: data movement, data transformation, and control flow, providing examples and emphasizing their importance in data factory operations. It also directs viewers to Microsoft's detailed documentation for further understanding.

Takeaways

  • 📚 The video is part of a series on Azure Data Factory, focusing on pipelines and activities, and their types.
  • 🔍 Pipelines are logical groupings of activities that perform a unit of work in Azure Data Factory.
  • 🔄 Activities represent individual processing steps within a pipeline, such as copying data or performing transformations.
  • 🔑 Understanding different types of activities is crucial for determining which to use based on specific requirements.
  • 🔗 The script revisits the concept of pipelines and activities, emphasizing their roles in data movement and transformation.
  • 🌐 The video mentions Azure ADF's integration runtimes, which are essential for data movement activities.
  • 📈 Data movement activities in Azure Data Factory primarily involve the Copy Activity, which supports various data stores.
  • 🔧 Data transformation activities include Data Flows, Azure Functions, Hive, Pig, and MapReduce for big data processing.
  • 📝 Control flow activities are used for managing the flow of execution in a pipeline, such as conditionals and iterations.
  • 📚 The video references Microsoft's detailed documentation on pipelines and activities in Azure Data Factory and Azure Synapse Analytics.
  • 👍 The presenter encourages viewers to subscribe to the channel for more educational content, emphasizing continuous learning and sharing.

Q & A

  • What is the main focus of the fifth part of the Azure Data Factory video series?

    -The main focus of the fifth part is to explore the concept of pipelines and activities, including the different types of activities available in Azure Data Factory.

  • What are the top-level concepts discussed in section one of the video series?

    -In section one, the top-level concepts discussed include pipelines, activities, datasets, linked services, integration runtimes, and triggers.

  • What did the audience learn about in section three of the video series?

    -In section three, the audience learned about creating their first pipeline and got an introduction to different types of activities.

  • What is a pipeline in Azure Data Factory?

    -A pipeline in Azure Data Factory is a logical grouping of activities that perform a unit of work, such as copying data from one location to another or performing data transformations.

  • What is an activity in the context of Azure Data Factory?

    -An activity in Azure Data Factory represents a processing step within a pipeline, such as a copy activity that moves data from one data store to another.

  • What is the difference between a dataset and a linked service in Azure Data Factory?

    -A dataset in Azure Data Factory refers to a table or file, whereas a linked service defines the connection to a data source or a cloud service.

  • What are the three main types of activities in Azure Data Factory?

    -The three main types of activities in Azure Data Factory are data movement activities, data transformation activities, and control flow activities.

  • What is a data movement activity in Azure Data Factory?

    -A data movement activity, such as the copy activity, is used for moving data from various sources to various destinations within Azure Data Factory.

  • What is a data transformation activity in Azure Data Factory?

    -Data transformation activities in Azure Data Factory, such as data flows, Azure Functions, Hive, Pig, and MapReduce, are used to transform data based on specific requirements.

  • What are control flow activities in Azure Data Factory?

    -Control flow activities in Azure Data Factory include ForEach, If Conditions, Execute Pipeline, Lookup, Add Variable, Switch, Until, and Validation activities, which are used to control the flow of data processing.

  • Where can one find detailed documentation on pipelines and activities in Azure Data Factory?

    -One can find detailed documentation on pipelines and activities in Azure Data Factory on the Microsoft documentation website, specifically in the section about Azure Data Factory and Azure Synapse Analytics.

Outlines

00:00

🔍 Exploring Azure Data Factory Pipelines and Activities

In this video, the focus is on understanding pipelines and activities in Azure Data Factory (ADF). The introduction briefly revisits concepts from previous sections, emphasizing the importance of pipelines, which are logical groupings of activities that perform various data operations such as copying, transforming, and cleaning data. The video aims to delve deeper into the different types of activities and their specific uses, building on the foundational knowledge established in earlier parts of the series.

05:00

📊 Detailed Explanation of Pipeline and Activities

The video reiterates the definition of pipelines in ADF, describing them as logical groupings of activities that perform units of work. Activities are the individual steps within a pipeline, such as copying data between locations or transforming data. The explanation emphasizes understanding the distinction between pipelines and activities, and how they interact with datasets to perform tasks. The discussion includes examples of how activities communicate with datasets to produce or consume data for various operations.

📂 Types of Activities in Azure Data Factory

The video categorizes activities in ADF into three main types: data movement activities, data transformation activities, and control flow activities. Data movement activities, like the copy activity, are used for transferring data between sources and sinks. Data transformation activities involve manipulating data using tools like data flows, Azure Functions, and other big data processing techniques. Control flow activities include operations like loops, conditions, and variables that control the execution flow within a pipeline. The video highlights the importance of selecting the appropriate type of activity based on the specific requirements of the task at hand.

📘 Resources and Documentation for ADF Activities

The video references detailed Microsoft documentation that provides comprehensive information about pipelines and activities in ADF and Azure Synapse Analytics. It clarifies that Synapse Analytics is an integrated service combining data transformation and storage capabilities. The documentation includes extensive lists of supported data stores and the types of activities that can be performed on them, categorized by various criteria such as source, sink, and integration runtime support. The importance of utilizing these resources to understand the full capabilities and configurations of activities in ADF is emphasized.

🛠 Practical Application: Creating Pipelines and Activities

In a practical demonstration, the video shows how to create a new pipeline in ADF and explore the available activities. It categorizes activities under 'Move and Transform' for data operations and 'General' for control flow operations. The demonstration highlights how to navigate the ADF interface to find and utilize different activities for specific tasks. The video concludes by encouraging viewers to practice creating and using different types of activities, promising more detailed tutorials on each type of activity in future videos.

👍 Encouragement and Call to Action

The video ends with a call to action, encouraging viewers to subscribe to the channel for more tutorials and updates. The speaker expresses the hope that viewers found the content informative and helpful, and reiterates the channel's motto of 'keep learning and sharing.' The video aims to motivate viewers to engage with the content and stay tuned for future videos that will explore ADF activities in greater detail.

Mindmap

Keywords

💡Azure Data Factory

Azure Data Factory is a cloud-based data integration service offered by Microsoft that allows users to create data-driven workflows for orchestrating and automating data movement and data transformation. In the context of the video, it is the main platform being discussed, with the script focusing on how to use its features for creating pipelines and activities to handle data workflows.

💡Pipeline

A pipeline in Azure Data Factory is a logical grouping of activities that perform a unit of work. The script defines a pipeline as a collection of activities that work together to achieve a specific data-related task, such as copying data from one location to another or performing data transformations. It is a core concept in the video, with the script exploring different types of activities that can be included within a pipeline.

💡Activity

Activities in the script are individual processing steps within a pipeline in Azure Data Factory. They represent the actions that are performed on the data, such as copying data from one location to another using a 'copy activity' or transforming data using 'data flow'. The video discusses the different types of activities and their uses within the context of data factory pipelines.

💡Data Movement Activities

Data movement activities, as mentioned in the script, are a type of activity in Azure Data Factory that focuses on transferring data from one location to another. The 'copy activity' is highlighted as a primary example of this, showcasing its capability to move data across various supported data stores. The script emphasizes the importance of understanding these activities for data transfer requirements.

💡Data Transformation Activities

Data transformation activities are those that modify or manipulate the structure of the data. The script introduces 'data flow' as a method for performing such transformations within Azure Data Factory. It also mentions other technologies like Azure Functions, Hive, Pig, and MapReduce that can be used for complex data transformation tasks, indicating the breadth of options available for different use cases.

💡Control Flow Activities

Control flow activities, as explained in the script, are used to control the sequence and execution of activities within a pipeline. They include conditional statements like 'If Condition', loops like 'ForEach', and other control structures that help manage the workflow logic. The video script uses these terms to illustrate how to direct the flow of data operations based on certain conditions or sequences.

💡Integration Runtimes

Integration Runtimes in the script refer to the compute resources used by Azure Data Factory to perform data movement and transformation activities. The video mentions three different types of integration runtimes, which are essential for understanding how data operations are executed in various environments, either cloud-based or on-premises.

💡Data Flow

Data Flow is a specific type of data transformation activity in Azure Data Factory that allows users to visually design data transformation logic without writing code. The script positions Data Flow as a key feature for performing complex data transformations using a visual interface, executed within Apache Spark clusters.

💡Dataset

A dataset in the script is a definition in Azure Data Factory that refers to a table or a file in a data store. It is used to represent the structure of the data and is essential for activities within a pipeline to consume or produce data. The video script explains the difference between a dataset and a linked service, emphasizing the role of datasets in data operations.

💡Linked Service

A linked service in the script is a connection point in Azure Data Factory that connects to data stores or computes resources. It is used in conjunction with datasets to enable activities within a pipeline to access the necessary data sources or compute targets. The video script discusses the relationship between linked services and datasets, highlighting their importance in the data integration process.

💡ADF Instance

ADF Instance in the script refers to an instance of Azure Data Factory, which is a specific setup or environment where users can create and manage their data integration workflows. The video script guides viewers through the process of navigating the ADF instance to create pipelines and activities, providing practical insights into using the service.

Highlights

Introduction to Azure Data Factory video series part five focusing on pipelines and activities.

Exploration of the concept of a pipeline as a logical grouping of activities in Azure Data Factory.

Activities defined as individual processing steps within a pipeline.

Explanation of the difference between a pipeline and an activity in the context of data processing.

Discussion on the creation of the first pipeline and the inclusion of various types of activities.

Overview of the three main types of activities: data movement, data transformation, and control flow.

Detailed look at data movement activities, specifically the Copy Activity in Azure Data Factory.

List of supported data stores for the Copy Activity as both source and destination.

Introduction to data transformation activities, including Data Flow and big data technologies like Hive, Pig, and MapReduce.

Highlight of control flow activities for managing the workflow within a pipeline.

Description of the ForEach, If Condition, and other control flow constructs available in Azure Data Factory.

Mention of Azure Functions and Data Lake as part of the data transformation capabilities.

The importance of understanding different types of activities to select the appropriate one for specific requirements.

Link to Microsoft's detailed documentation on pipelines and activities in Azure Data Factory.

Clarification of Azure Synapse Analytics as a service within Azure, combining transformation and storage.

Encouragement for viewers to subscribe to the channel for more learning and sharing of knowledge.

Final thoughts on the importance of continuous learning in the field of Azure Data Factory.

Transcripts

play00:00

hi everyone uh welcome back to azure

play00:02

data factory video series part five so

play00:05

in this section we will be exploring

play00:07

more on what is a pipeline and what is

play00:10

activity what are the types of

play00:11

activities so the main stretch would be

play00:13

on the different types of

play00:15

the pipeline activities we have i know

play00:18

that in section one we have discussed in

play00:20

a very high level about pipelines

play00:21

activities data sets linked services

play00:24

integration runtimes and also triggers

play00:26

because they are like the top level

play00:27

concepts and top level components i

play00:29

think it's in section two yeah and in

play00:31

section three we've also created our

play00:33

first pipeline and we've seen uh

play00:36

pipeline and also like different types

play00:37

of different activities of course but we

play00:39

haven't discussed about the different

play00:41

types of activity so when it comes to

play00:42

azure adf you have activities and you

play00:45

also have types of activities it's

play00:46

similar to you have integration run

play00:48

times and you also have three different

play00:49

types of integration runtimes so mainly

play00:52

this is

play00:53

the theory which you need to understand

play00:55

because

play00:56

if you if you know what the different

play00:57

types of activities you have then you

play00:59

understand okay which type of activity i

play01:00

need to use for my requirements right

play01:03

okay so without any further ado let's

play01:06

get into the concept so basically

play01:07

pipeline and activities so uh i know

play01:10

that you've already seen this definition

play01:11

but again just for the folks who haven't

play01:14

gone through previous videos i'm just

play01:16

repeating the same thing here so data

play01:18

factory might have one or more pipeline

play01:20

so basically a data factory is nothing

play01:22

but you create data pipeline so

play01:24

basically this is what we are talking

play01:25

here a pipeline is a data factory could

play01:27

have one or more pipelines right so

play01:29

basically you create the data pipeline

play01:31

and the pipeline is a logical grouping

play01:33

of activities

play01:35

so then what is pipeline a pipeline is

play01:37

basically the group of activities it's a

play01:39

logical group of activities that

play01:41

performs the unit of work

play01:43

unit of work could be anything it could

play01:45

be copy data from one location to other

play01:47

location or do some data transformation

play01:49

do do some data cleaning apply some

play01:53

expressions apply some formulas

play01:55

add some business rules

play01:57

unit of work could be anything which you

play01:59

perform on the data basically you use

play02:02

pipelines you create pipelines pipeline

play02:04

is a logical

play02:05

group of activities basically they

play02:07

perform the work

play02:08

so together the activities in a pipeline

play02:10

perform a task basically

play02:12

together the activities in a pipeline

play02:13

perform a task okay so there is a basic

play02:16

example uh we will not see that example

play02:18

it's a from the documentation

play02:20

and let's try to understand activity so

play02:22

activities represent a processing step

play02:24

in a pipeline so this is important so

play02:27

basically the processing step in a

play02:28

pipeline is represented by the

play02:30

activities we will look into that for

play02:32

example you might use a copy activity to

play02:34

copy data from one data store to another

play02:36

data store so basically the copy

play02:38

activity is doing the copy job

play02:41

right

play02:42

the copy functionality so you're copying

play02:44

something from one location to another

play02:46

location using a copy activity

play02:48

so you create a pipeline maybe copy data

play02:51

pipeline and within that pipeline use

play02:52

activity called copy activity basically

play02:54

it does the job so you need to

play02:56

understand the main difference between a

play02:58

pipeline and an activity okay

play03:01

then so basically this

play03:03

this flow should help you to understand

play03:05

that okay so you create a pipeline which

play03:08

is a logical group of an activity and

play03:11

basically this activity actually

play03:13

communicates with the data set so either

play03:15

you produce a data set right to the sink

play03:18

or to the destination or you consume the

play03:20

data set from the source to make some

play03:23

transformations okay so we've seen how

play03:25

to create a data set like basically data

play03:27

set is

play03:29

is referring either to a table or to the

play03:32

file and you also know what is the

play03:33

difference between data set and linked

play03:35

service if you do not know that i would

play03:37

highly recommend you to go and look into

play03:40

video series part two where i have

play03:42

explained the top components and there

play03:44

you have data set and link service so if

play03:45

you try to understand the flow so

play03:47

basically what you do is you create a

play03:48

pipeline and then within the pipeline

play03:50

you bring activity and basically that

play03:52

activity does the actual action here

play03:54

that is very important here to

play03:55

understand

play03:56

okay

play03:57

so then

play03:58

what are the different types of

play03:59

activities we have so basically this is

play04:01

what i wanted to explain in this section

play04:03

so there are mainly three types of

play04:06

activities data movement activities

play04:09

data transformation activities and

play04:11

control flow activities okay so we will

play04:14

try to see

play04:16

where we have them in the azure adf

play04:18

instance

play04:19

before going there i have a very nice

play04:21

documentation um

play04:25

not i have a documentation it's

play04:26

basically provided by the microsoft so

play04:28

they have

play04:29

very nice documentation when i say nice

play04:32

all the information is in detail

play04:34

few of the information are also in

play04:35

simple terms for everyone to understand

play04:38

right so if you come to this section

play04:40

here

play04:42

so this is the url

play04:44

okay and within this url you have

play04:46

pipelines and activities in azure data

play04:48

factory and azure synapse analytics so

play04:50

don't get confused with what is the

play04:51

synapse analytics basically this is one

play04:53

of

play04:54

the service within azure uh basically

play04:56

it's a combination of both

play04:57

transformation and the storage so it's

play05:00

it's like a cloud data warehouse i could

play05:02

say but it has more features and

play05:03

functionalities we will try to explore

play05:05

that in coming sessions but let's try to

play05:07

only focus on azure data factory okay so

play05:10

pipelines and activities in azure data

play05:12

factory if i scroll up basically there

play05:15

is this diagram which we've seen and

play05:17

then you have data movement activities

play05:20

so what is data movement activities or

play05:23

what are data moment activities so

play05:25

here

play05:27

if you see copy activity in azure data

play05:29

factory copies data from various sources

play05:32

and also to various things so basically

play05:35

here you have a big list of data stores

play05:38

within azure where the copy activity is

play05:40

supported okay so basically the copy

play05:42

activity within the adf is mainly used

play05:44

for data movement activities

play05:47

and here you can see

play05:49

whether it is supported as source or not

play05:51

you have a very big list of data stores

play05:52

and then whether it is supported as sync

play05:54

or not and it is also supported by azure

play05:56

integration runtime we have seen what is

play05:58

integration time over the different

play05:59

types of integration on times in the

play06:01

previous section which is part

play06:03

four yeah and also you have

play06:06

whether it is supported by self-hosted

play06:08

integration runtime or not okay so you

play06:10

have a very big list here you can just

play06:12

go through them you can also see these

play06:13

are categorized by azure and then by

play06:16

databases

play06:17

and also by nosql file and everything uh

play06:20

important key point here is data

play06:22

movement activity mainly the copy

play06:24

activity is used for the data moment

play06:26

activities then you have data

play06:27

transformation activities so data

play06:29

transformation activities so basically

play06:31

you transform the data so to do that you

play06:34

have a different list here so you have a

play06:37

data flow we will see what is data flow

play06:40

and then you have

play06:41

here you can see what is a compute

play06:42

environment basically the data flow will

play06:44

get executed in the apache spark

play06:46

clusters we will get into that so i have

play06:49

very big playlist on a data flow and

play06:51

mapping data flow where we will try to

play06:53

understand them and use them for data

play06:56

transformation related works then you

play06:57

have azure functions hive pig you know

play07:00

map reduce all these are related to big

play07:02

data

play07:06

right and you can also see the computer

play07:08

environment underlying there and then

play07:09

you have some custom activity data

play07:11

breaks so basically these are all used

play07:13

based on the requirement based on the

play07:15

use case

play07:17

for data transformation and then you

play07:18

have control flows okay so control flow

play07:21

activities are

play07:23

something like for each

play07:24

filter and if else condition pending or

play07:28

executing a pipeline and look up adding

play07:30

a variable and like a switch until

play07:33

activity validation activity weight or

play07:35

web hook so basically these are control

play07:37

flows okay

play07:40

three types data movement data

play07:42

transformation and control flow and we

play07:44

have seen those three different types

play07:46

there and if i go back to our adf

play07:48

instance just to see here so when you

play07:50

try to click on create a pipeline new

play07:52

pipeline okay for example and then if i

play07:55

close here then you have a big list of

play07:57

activities here so the move and

play07:59

transform here you see data copy is

play08:01

under the move

play08:03

okay and transform you also have a data

play08:05

flow and then you also have data

play08:07

explorer azure function basically few of

play08:09

them are used for data transformation

play08:11

and then when you come to general

play08:13

then you have a few options here few

play08:15

activities here basically which are used

play08:16

for

play08:17

control flow activities right we have

play08:19

seen them control flow activities and

play08:21

then you also have iteration and

play08:23

conditionals these are also used for

play08:24

control flow each for each if condition

play08:26

switch until and filter and then

play08:28

basically you have hd inside azure

play08:30

function data breaks data lakes these

play08:32

are used for data transformations and

play08:33

move and transform you have copy and

play08:35

data right so this is all about

play08:38

different types of

play08:41

different types of activities and what

play08:43

are they

play08:44

and when to use what

play08:47

again moving forward we will look into

play08:49

them in detail how to use them when to

play08:51

use them but i hope you like this

play08:54

section

play08:55

so uh if you think if you've gained any

play08:57

knowledge uh out of this video i would

play09:00

kindly request you to subscribe our

play09:02

channel which gives a lot of motivation

play09:04

and encouragement to make mo these kind

play09:06

of videos

play09:07

motto is very simple

play09:08

keep learning and sharing and i hope to

play09:10

have a good day thank you so much

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Azure ADFData PipelinesActivitiesData MovementData TransformationControl FlowCloud DataMicrosoft AzureData IntegrationTutorial
¿Necesitas un resumen en inglés?