GCP Data Engineer Mock interview

Grow With Google Cloud
31 Mar 202415:21

Summary

TLDRThe speaker shares their four-year expertise with Google Cloud Platform (GCP), focusing on BigQuery for data warehousing. They delve into BigQuery's architecture, highlighting its columnar storage and separation of storage and compute. The discussion covers performance optimization strategies like partitioning, clustering, and SQL query enhancements. Cost-effective practices are also explored, including slot purchasing and archiving unused datasets in cloud storage. The conversation extends to managing multiple datasets across projects, the utility of materialized views for repetitive queries, and scheduling queries within the BigQuery console. The speaker addresses team access management, data pipeline creation with Airflow, and the difference between tasks and workflows in DAGs. They conclude with a hypothetical scenario of transferring data from an FTP server to GCS and then to BigQuery.

Takeaways

  • 💡 The speaker has four years of experience, primarily with Google Cloud Platform (GCP) and BigQuery for data warehousing.
  • 🔍 BigQuery is a columnar storage data warehousing tool that separates storage and compute, optimizing analytical functions.
  • 💻 The speaker is proficient in SQL, DSA in Python, and is exploring Apache Spark on their own.
  • 📈 To optimize performance in BigQuery, techniques such as partitioning, clustering, and using the WHERE clause effectively are discussed.
  • 💼 For cost optimization in BigQuery, the speaker suggests purchasing slots in advance for known workloads to potentially get a discount.
  • 🗄️ Unused tables in BigQuery can be archived in cloud storage to save on storage costs, with the speaker considering the use of information schema or cloud logs to identify them.
  • 🔑 The speaker discusses managing multiple BigQuery datasets across different projects with appropriate permissions, highlighting the BigQuery reader role for limited access.
  • 📊 Materialized views in BigQuery can save on costs by caching query results, reducing the need to re-execute queries for unchanged data.
  • ⏰ Scheduling queries in BigQuery can be done through the console, with the speaker outlining the process for setting up schedules based on specific parameters.
  • 🛠️ The speaker has experience integrating BigQuery with other GCP services like Cloud Storage and Apache Airflow for data pipelines, but hasn't set up streaming pipelines.
  • 📝 When creating a data pipeline with Airflow, the difference between a DAG (Directed Acyclic Graph) and tasks is highlighted, with tasks being part of the larger DAG workflow.

Q & A

  • What is the primary function of BigQuery as described in the transcript?

    -BigQuery is a data warehousing tool used for analytical purposes, offering a columnar storage system that separates storage and compute, allowing for efficient data analysis.

  • How does BigQuery's columnar storage differ from traditional row-based storage?

    -BigQuery's columnar storage allows for faster analytical processing by accessing only the specific columns needed for a query, rather than scanning entire rows as in traditional row-based storage.

  • What are some methods mentioned to optimize performance in BigQuery?

    -Performance in BigQuery can be optimized through partitioning, clustering, and using the WHERE clause effectively to filter data, thus reducing the amount of data scanned.

  • How can one save costs when using BigQuery, as discussed in the transcript?

    -Costs in BigQuery can be saved by purchasing slots in advance for known workloads, implementing partitioning and clustering to reduce data scanned, and archiving unused datasets to cheaper storage options like Cloud Storage.

  • What is the role of the 'information schema' in managing BigQuery datasets?

    -The 'information schema' can be used to track when tables were last accessed, which can help in identifying unused datasets that could be archived to save costs.

  • How can one manage multiple BigQuery datasets across different projects?

    -Managing multiple BigQuery datasets across projects can be done through proper permission settings, allowing users with the required permissions to access and query data across different environments like development, staging, and production.

  • What is a materialized view in BigQuery and how can it help in cost optimization?

    -A materialized view in BigQuery is a pre-computed, saved result of a query that can be accessed quickly without re-executing the query. It can help in cost optimization by reducing computation for frequently run queries with unchanged data.

  • How can one schedule queries in BigQuery?

    -Queries in BigQuery can be scheduled through the BigQuery console by specifying the query and the schedule parameters, such as frequency (daily, monthly), to automate the execution.

  • What is the difference between a task and a workflow (DAG) in Apache Airflow?

    -In Apache Airflow, a task is an individual unit of work, while a workflow (DAG) is a collection of tasks arranged in a directed acyclic graph, where tasks can be dependent on each other.

  • Can you provide a basic example of how to write an Airflow DAG for transferring data from an FTP server to a GCS bucket and then to BigQuery?

    -An Airflow DAG for this scenario would involve defining two tasks: one using the PythonOperator to fetch data from the FTP server and store it in a GCS bucket, and another using a transfer-specific operator to load the data from GCS into BigQuery.

  • What is the approximate size of the tables the speaker has worked with in their current BigQuery project?

    -The speaker has worked with tables ranging from 2 GB to 1 TB in size, with around 70 to 80 tables in total in their current project.

Outlines

00:00

💻 Experience and BigQuery Architecture

The speaker has four years of experience, primarily with Google Cloud Platform (GCP), focusing on data warehousing using BigQuery. They are proficient in SQL and DSA in Python and are currently exploring Apache Spark. The speaker explains BigQuery's high-level architecture, highlighting its separation of storage and compute, its columnar storage, and the benefits of this for analytical functions. They discuss performance optimization techniques such as partitioning, clustering, and query optimization, as well as cost-saving strategies like purchasing slots in advance and using partitioning and clustering to reduce data scanning and computation.

05:02

💼 Managing BigQuery Costs and Data Sets

The discussion continues with strategies to save costs in BigQuery, including buying slots in advance for predictable workloads and optimizing performance to reduce costs. The speaker suggests archiving unused tables in cloud storage to save on storage costs and mentions the use of the information schema to identify tables not in use. They also touch on managing multiple BigQuery datasets across different projects and the concept of materialized views for query optimization, which can save on computation costs by caching query results.

10:04

📈 Scheduling Queries and Data Pipelines

The speaker outlines the process for scheduling queries in BigQuery through the console, emphasizing the importance of setting the correct schedule parameters. They also discuss the importance of proper authentication and authorization within a team, suggesting the use of the BigQuery reader role for members who should only view and query existing tables. The speaker shares their experience with data volumes, mentioning working with tables ranging from 1-2 GB to 500-1000 GB, and the use of Airflow for creating data pipelines that integrate with BigQuery. They explain the difference between a DAG (Directed Acyclic Graph) and tasks within Airflow, providing an example of a data pipeline that moves data from an FTP server to a GCS bucket and then to BigQuery.

15:05

🔄 Data Transfer and Task Operations in Airflow

The speaker concludes by discussing the process of writing an Airflow task to transfer data from an FTP server to a GCS bucket and then to BigQuery. They mention defining DAG parameters and creating tasks using Python operators. The speaker provides a high-level overview of the process, including the use of specific operators for transferring data from GCS to BigQuery and the consideration of different disposition options for these operations.

Mindmap

Keywords

💡BigQuery

BigQuery is a data warehouse service provided by Google Cloud Platform (GCP). It is designed for large-scale data analytics. In the script, the speaker discusses their experience with BigQuery, focusing on its use for data warehousing and analytical purposes. BigQuery's columnar storage and separation of storage and compute resources are highlighted as key features that enable efficient data analysis.

💡Columnar Storage

Columnar storage is a data storage format where data is stored column-wise rather than row-wise, as in traditional row-based storage. This format is particularly efficient for analytical queries, as it allows for faster aggregation and filtering operations. The script mentions that BigQuery uses columnar storage, which optimizes performance when executing analytical functions by accessing only the necessary columns.

💡OLAP (Online Analytical Processing)

OLAP refers to the process of performing analysis on data to support decision-making. It involves complex queries on large data sets and is typically used for business intelligence. The script describes BigQuery as an OLAP tool, emphasizing its suitability for analytical processing and its columnar storage as an advantage for such tasks.

💡SQL

SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. The speaker mentions being comfortable with SQL, indicating its importance in working with BigQuery for data warehousing tasks. SQL is used to write queries that extract, transform, and load data within BigQuery.

💡Data Partitioning

Data partitioning is a technique used to improve query performance by dividing a large table into smaller, more manageable pieces based on certain criteria, such as date ranges or key values. The script discusses partitioning as a method to optimize performance in BigQuery, as it reduces the amount of data scanned during query execution.

💡Clustering

Clustering in BigQuery is a feature that groups data based on the contents of one or more columns, which can improve query performance by reducing the amount of data scanned. The script mentions clustering as another optimization technique, suggesting that it can be used in conjunction with partitioning to further enhance performance.

💡Cost Optimization

Cost optimization in the context of BigQuery involves strategies to reduce the expenses associated with data storage and query execution. The script discusses various methods for cost optimization, such as buying slots in advance, using partitioning and clustering to reduce data scanned, and archiving unused data to cheaper storage options.

💡Cloud Storage

Google Cloud Storage is a service for storing and accessing data on Google Cloud Platform. The script suggests archiving unused BigQuery data to Cloud Storage as a cost-saving measure, highlighting the importance of managing data storage costs effectively.

💡Information Schema

The information schema is a special database in BigQuery that contains metadata about all other databases. The script mentions the information schema as a potential source for tracking table usage, which can help identify tables that are no longer in use and candidates for archiving to Cloud Storage.

💡Airflow

Apache Airflow is a platform for creating, scheduling, and monitoring workflows. The script refers to using Airflow to manage data pipelines involving BigQuery, indicating its role in automating data processing tasks and orchestrating data movement between different services within GCP.

💡Materialized View

A materialized view in BigQuery is a precomputed, stored result set that can be refreshed on demand or on a schedule. The script discusses the use of materialized views to optimize query performance and cost, as they can store the result of a query and avoid re-executing the same query multiple times.

Highlights

Four years of experience with GCP, primarily in data warehousing with BigQuery.

Proficiency in SQL and DSA in Python, currently exploring PySpark.

BigQuery is a data warehousing tool used for analytical purposes.

BigQuery features separated storage and computing resources.

Columnar storage in BigQuery optimizes analytical functions.

Performance optimization in BigQuery through partitioning and clustering.

Cost optimization in BigQuery by purchasing slots in advance.

Reducing costs by scanning fewer records with partitioning and clustering.

Archiving unused tables in cloud storage to save costs.

Using the information schema to track table usage for cost management.

Managing multiple BigQuery datasets in one console with proper permissions.

Materialized views in BigQuery to optimize query execution and reduce costs.

Scheduling queries in BigQuery through the console.

Assigning roles like BigQuery Reader for team members to manage access.

Working with data volumes up to 1 TB per table in BigQuery.

Integration of GCP services like Cloud Storage and Airflow with BigQuery.

Difference between tasks and workflows in Airflow DAGs.

Example of writing a Python script for an Airflow task to transfer data from FTP to GCS and then to BigQuery.

Transcripts

play00:00

having four years of experience uh most

play00:02

of my experience is in gcp majorly in

play00:06

the data site um in gcp I majorly work

play00:10

with bigquery um for the data

play00:13

warehousing um part um I am quite

play00:18

comfortable with SQL uh DSA in Python

play00:23

and currently I'm exploring uh pis Park

play00:26

from by

play00:28

myself okay all right so could you

play00:31

please explain the high level

play00:33

architecture of big query sure uh so big

play00:37

query is as I mentioned it's a data

play00:39

warehousing tool which we use for

play00:42

analytical purpose uh in B quy the

play00:45

storage and the computer is

play00:47

separated um they are not dependent on

play00:50

each other and also uh bigquery is a

play00:54

columnar storage unlike the uh

play00:57

relational other relational database olp

play01:00

which is row based B quy is olap which

play01:03

is

play01:04

columnar uh storage so whenever we want

play01:07

to do any kind of analytical function

play01:10

wherein if I want to sum value stored in

play01:13

a particular column bqu will only access

play01:15

that specific column instead of getting

play01:19

data from all the other uh columns as

play01:22

well so it's a columnar storage um it's

play01:25

anab processing um yeah

play01:31

so how we can optimize performance in

play01:34

big cury uh so we are having mple

play01:38

options so we have partitioning we have

play01:41

clustering um also in the normal SQL

play01:44

query we can do couple of optimization

play01:47

like including the filer we condition

play01:49

wherever required all of

play01:52

that and in terms of cost optimization

play01:55

how we can apply uh to save the cost in

play01:58

terms of B cury

play02:01

um didn't get okay so I'm asking you how

play02:05

we can save cost uh in B

play02:09

cury one aspect like how we can improve

play02:12

the performance one aspect like how we

play02:15

can save cost which area we need to

play02:17

focus to save the cost correct so uh if

play02:21

we are having a very specific demand so

play02:24

in bigy we have option that we can buy

play02:26

these slots beforehand itself I'm not

play02:29

sure if you are referring to this part

play02:31

for saving cost so uh if we know that

play02:34

our workload will be this much then we

play02:36

can buy the slots beforehand and uh so

play02:41

whatever we will be built that will be

play02:43

on that particular slot

play02:47

um and we may get discount uh on the

play02:51

other uh so if we improve our quy

play02:54

performance our cost will anyways be

play02:56

saved so if we include partitioning or

play02:59

lustering so if in our query instead of

play03:03

scanning the complete table we are only

play03:05

scanning a particular partition of the

play03:08

data then our query will scan less

play03:13

records so there will be less

play03:14

commutation and H the cost will be

play03:17

saved so uh for example last four year

play03:21

you are working in b cury or whatever

play03:24

right uh so for example we have

play03:27

thousands of data set thousands of

play03:30

tables but uh people are not using that

play03:33

data set and tables unnecessary once

play03:36

created then that is there but uh

play03:39

ultimately cost we are paying cost for

play03:42

that so how we can filter that who is

play03:46

not using the data set that table which

play03:49

it created that is not using right now

play03:52

so how we can filter it out and then how

play03:55

what is your action plan and what is

play03:57

your point of view in this particular

play03:59

area how we can work on this to save

play04:02

lots of costing in in terms of those who

play04:05

are not using that once

play04:08

created okay um so one of the thing

play04:12

which I will think about for the tables

play04:14

which are not in use but maybe we might

play04:18

require the data Maybe in the later

play04:22

future I will store that data in cloud

play04:25

storage archive uh so I believe normally

play04:29

the the bigquery storage cost is similar

play04:32

as the cloud storage cost when we

play04:34

compare it with the standard tier as

play04:37

much as I remember but uh we can if we

play04:40

are not using the table itself for a

play04:43

very long period of time then uh we can

play04:45

save that uh table data and cloud

play04:48

storage archive storage or near line or

play04:51

cold cold line depending on the use case

play04:54

so in that way we will be uh saving the

play04:57

storage cost uh um I'm not sure how we

play05:01

will check which all tables are not used

play05:04

because I haven't worked on that use

play05:05

case but there is a table or a view

play05:09

known as information schema so I'm not

play05:12

sure if in that we have the detail about

play05:15

when was the last table or when was the

play05:17

last time that the table was used if we

play05:20

are having that information in the

play05:21

information schema then I can get it

play05:23

from there otherwise I can par through

play05:27

the cloud logs Al so in the cloud

play05:29

logging I can go through each of the

play05:31

logs and I can check for that particular

play05:34

uh table when was that particular table

play05:37

uh exist last and if it is after a

play05:41

specific threshold then I can uh start

play05:45

with moving the data from viery to cloud

play05:49

storage uh for example we have a

play05:52

multiple project

play05:54

but we want to communicate uh uh the Big

play05:58

cury Data alongside of this project

play06:01

whatever the project we have multiple

play06:05

project how it will be possible to

play06:07

manage the multiple uh big cury data set

play06:12

in one single

play06:15

console um

play06:18

so I believe depending on the permission

play06:21

so if a user is having specific

play06:24

permission so assuming I'm having a you

play06:26

know three project one is for the

play06:28

development envir one is for the desting

play06:31

another is for the production

play06:33

environment so if a user is having

play06:35

access to all the three project the

play06:38

required permission which is wherein he

play06:41

can write or execute the command so

play06:44

uh on the basis of the permission if

play06:47

he's having the permission then the user

play06:48

can access the uh any of the tables data

play06:52

set from any of the project Bey

play06:58

conern and what is materialized

play07:01

view uh so material so normal view

play07:05

wherein the query is whenever I execute

play07:08

the query the query will be executed

play07:11

again and again even if you know the

play07:13

data will be same the data which will be

play07:16

returned will be same but in the case of

play07:18

materialized view uh if I'm executing a

play07:21

query uh the data for that particular

play07:24

time I will be having that data and

play07:27

whenever I will execute it again if the

play07:29

data has not changed um then I will be

play07:33

directly getting that data and the query

play07:35

won't be executing again but if there is

play07:38

some newer data then um the only that

play07:41

particular data will be executed so the

play07:44

remaining data won't be executed so

play07:46

basically we will save that much

play07:51

computation okay as you as you know that

play07:54

every query execution uh there is some

play07:57

cost right mhm so so suppose one query

play08:01

that is required to execute every day

play08:05

okay so in this case how we can optimize

play08:10

cost so we can also use cache or so we

play08:14

can cach the uh table data and uh as

play08:19

mentioned we can also use materialized

play08:22

view so if we are having a specific

play08:25

query which is running on top of table

play08:27

or multiple tables and it's running on a

play08:30

daily basis then we can instead create a

play08:33

view materialized view so whatever the

play08:36

previous day data is we will be already

play08:39

we will be having that data and the

play08:41

query won't be running again for only

play08:44

that specific part of data and on a

play08:46

daily basis if new data is getting added

play08:49

to the source table only that particular

play08:52

part will be

play08:54

computed so materialized view will be a

play08:56

very good use case for this particular

play09:00

scenario and

play09:02

uh can you tell me suppose I want to

play09:05

schedule some cies so what is the

play09:09

process and how to schedule the quy can

play09:11

you tell me the

play09:13

steps um so in big console itself we are

play09:16

having an option using which we can

play09:19

schedule the query um so we have we

play09:23

should have the query and we have to

play09:25

specify uh the what do we say the

play09:28

schedule time so if we wanted on a daily

play09:31

basis or on a monthly basis so whatever

play09:34

our schedule parameter is we have to

play09:36

specify that we should have the quy and

play09:38

we can schedule it using console

play09:42

itself okay so in terms of the

play09:46

authentication and authorization suppose

play09:49

uh in your team 10 members are there but

play09:52

we don't want to provide the admin

play09:53

access to everyone right MH so in this

play09:56

case which rule will assign

play09:59

to to play around it with bigy data set

play10:03

and table as a developer point of

play10:06

view oh so if someone only wants to if

play10:09

someone should not have access wherein

play10:11

he can modify a table or you know create

play10:14

a table but he can view and query the

play10:16

existing tables which we are already

play10:18

having uh then I believe we are having B

play10:21

query reader role I'm not sure on the

play10:23

rules but I think bigquery reader rule

play10:26

we are having that and that will do the

play10:30

so how how how much volume of data you

play10:33

have worked with B

play10:35

cury uh so it depends So currently in my

play10:40

project I currently in my project which

play10:42

I'm working as of now uh the individual

play10:46

tables are having around uh 500 GBS

play10:50

1,000 GBS which is 1 TB uh and currently

play10:54

we are having 70 to 80 tables and the

play10:58

sizes are you know quite variable in

play11:00

some D we are only having one 2 GB of

play11:02

data and some we are having hundreds of

play11:06

GBS is there any limitation to storage

play11:09

data in a particular

play11:11

table

play11:15

uh I don't I'm not aware of that I don't

play11:20

think it should be the case but I'm not

play11:21

aware of this okay apart from Big cury

play11:26

AG uh you are creating a data pipeline

play11:29

so which all gcp service which you have

play11:33

integrated with B cury can you explain

play11:35

yeah uh so our so we are using air flow

play11:39

um

play11:40

so we are already having the source

play11:43

table in bqu so there is another team

play11:46

which is directly getting the data from

play11:48

SCP into bigquery without doing any

play11:50

modification and from there on uh you

play11:53

know we work wherein we create our

play11:55

pipeline so our source is Bey itself we

play11:58

take the data from from B uh we have

play12:00

created couple of tasks in airflow um

play12:04

and after completing all that task we

play12:06

are again storing that data in B so the

play12:09

source data we can assume that is data

play12:12

leak in bqu itself and after

play12:15

transforming or processing via airflow

play12:17

task we are again storing it in big quy

play12:20

which we can term as data warehouse

play12:22

wherein we are having transform data uh

play12:25

yeah so the main Services which we are

play12:27

using our clouds to reach airflow and

play12:31

big so have you ever set up streaming

play12:35

pipeline

play12:36

no only B pipeline you have set up right

play12:40

yeah so task when you write a uh dag so

play12:47

dag workflow and task what is the

play12:50

difference uh between task and

play12:52

workflow uh so in a dag we can have

play12:56

multiple tasks so one task can be

play12:59

dependent on another task and the entire

play13:02

collection of task which is directed e

play13:04

cyclic graph wherein we cannot have you

play13:08

know a circle formation um so this

play13:11

entire sequence of task or the

play13:14

pipelining of task which we are having

play13:16

that we term it as individual

play13:19

dark okay can you write a dark uh uh

play13:24

task uh I mean I'll just give you one

play13:27

scenario uh suppose you face the data

play13:30

from FTP server and load into GCS Bucket

play13:34

from GCS to uh then uh again uh task

play13:38

number two you have to face the data

play13:40

from GCS bucket and load into big cury

play13:43

can you write a program uh python

play13:46

script yeah uh I may not recall the

play13:50

exact um syntax but uh I will write the

play13:54

basic sudo code so initially we have to

play13:57

Define all the dags parameter which we

play13:59

are having

play14:03

uh so after defining all the DX

play14:06

parameter where in we might have the dag

play14:09

ID the schedule time U catchup is equal

play14:12

to true or false or whatever that

play14:14

individual parameters we are having then

play14:17

inside after we have defined that dag

play14:20

Insurance then we will have to Define

play14:22

task so first task is we can name it as

play14:25

T1 and from which we are getting the

play14:28

data and storing it into the GCS bucket

play14:31

uh we can write a python script for that

play14:34

particular one so we can use the python

play14:36

operator uh so we will use the python

play14:40

operator and the python operator we will

play14:42

call the respective python function

play14:44

where in we will be having all our

play14:46

python code wherein we are getting the

play14:48

data and storing it in GCS bucket uh

play14:52

then for the task to uh we do have a

play14:55

specific operator in airflow which

play14:57

stores data from GCS directly to

play14:59

bigquery so we can make use of that

play15:02

particular operator and we are having

play15:04

multiple option from GCS to bigquery

play15:07

operator in which we are having right

play15:09

upend override disposition so on the uh

play15:13

on our use case or on the scenar on the

play15:16

basis of the scenario we can choose any

play15:19

of that particular format so if I

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
BigQueryData WarehousingGCP ServicesSQLPythonCost OptimizationCloud StorageAirflowData PipelineMaterialized Views
هل تحتاج إلى تلخيص باللغة الإنجليزية؟