GCP Data Engineer Mock interview
Summary
TLDRThe speaker shares their four-year expertise with Google Cloud Platform (GCP), focusing on BigQuery for data warehousing. They delve into BigQuery's architecture, highlighting its columnar storage and separation of storage and compute. The discussion covers performance optimization strategies like partitioning, clustering, and SQL query enhancements. Cost-effective practices are also explored, including slot purchasing and archiving unused datasets in cloud storage. The conversation extends to managing multiple datasets across projects, the utility of materialized views for repetitive queries, and scheduling queries within the BigQuery console. The speaker addresses team access management, data pipeline creation with Airflow, and the difference between tasks and workflows in DAGs. They conclude with a hypothetical scenario of transferring data from an FTP server to GCS and then to BigQuery.
Takeaways
- π‘ The speaker has four years of experience, primarily with Google Cloud Platform (GCP) and BigQuery for data warehousing.
- π BigQuery is a columnar storage data warehousing tool that separates storage and compute, optimizing analytical functions.
- π» The speaker is proficient in SQL, DSA in Python, and is exploring Apache Spark on their own.
- π To optimize performance in BigQuery, techniques such as partitioning, clustering, and using the WHERE clause effectively are discussed.
- πΌ For cost optimization in BigQuery, the speaker suggests purchasing slots in advance for known workloads to potentially get a discount.
- ποΈ Unused tables in BigQuery can be archived in cloud storage to save on storage costs, with the speaker considering the use of information schema or cloud logs to identify them.
- π The speaker discusses managing multiple BigQuery datasets across different projects with appropriate permissions, highlighting the BigQuery reader role for limited access.
- π Materialized views in BigQuery can save on costs by caching query results, reducing the need to re-execute queries for unchanged data.
- β° Scheduling queries in BigQuery can be done through the console, with the speaker outlining the process for setting up schedules based on specific parameters.
- π οΈ The speaker has experience integrating BigQuery with other GCP services like Cloud Storage and Apache Airflow for data pipelines, but hasn't set up streaming pipelines.
- π When creating a data pipeline with Airflow, the difference between a DAG (Directed Acyclic Graph) and tasks is highlighted, with tasks being part of the larger DAG workflow.
Q & A
What is the primary function of BigQuery as described in the transcript?
-BigQuery is a data warehousing tool used for analytical purposes, offering a columnar storage system that separates storage and compute, allowing for efficient data analysis.
How does BigQuery's columnar storage differ from traditional row-based storage?
-BigQuery's columnar storage allows for faster analytical processing by accessing only the specific columns needed for a query, rather than scanning entire rows as in traditional row-based storage.
What are some methods mentioned to optimize performance in BigQuery?
-Performance in BigQuery can be optimized through partitioning, clustering, and using the WHERE clause effectively to filter data, thus reducing the amount of data scanned.
How can one save costs when using BigQuery, as discussed in the transcript?
-Costs in BigQuery can be saved by purchasing slots in advance for known workloads, implementing partitioning and clustering to reduce data scanned, and archiving unused datasets to cheaper storage options like Cloud Storage.
What is the role of the 'information schema' in managing BigQuery datasets?
-The 'information schema' can be used to track when tables were last accessed, which can help in identifying unused datasets that could be archived to save costs.
How can one manage multiple BigQuery datasets across different projects?
-Managing multiple BigQuery datasets across projects can be done through proper permission settings, allowing users with the required permissions to access and query data across different environments like development, staging, and production.
What is a materialized view in BigQuery and how can it help in cost optimization?
-A materialized view in BigQuery is a pre-computed, saved result of a query that can be accessed quickly without re-executing the query. It can help in cost optimization by reducing computation for frequently run queries with unchanged data.
How can one schedule queries in BigQuery?
-Queries in BigQuery can be scheduled through the BigQuery console by specifying the query and the schedule parameters, such as frequency (daily, monthly), to automate the execution.
What is the difference between a task and a workflow (DAG) in Apache Airflow?
-In Apache Airflow, a task is an individual unit of work, while a workflow (DAG) is a collection of tasks arranged in a directed acyclic graph, where tasks can be dependent on each other.
Can you provide a basic example of how to write an Airflow DAG for transferring data from an FTP server to a GCS bucket and then to BigQuery?
-An Airflow DAG for this scenario would involve defining two tasks: one using the PythonOperator to fetch data from the FTP server and store it in a GCS bucket, and another using a transfer-specific operator to load the data from GCS into BigQuery.
What is the approximate size of the tables the speaker has worked with in their current BigQuery project?
-The speaker has worked with tables ranging from 2 GB to 1 TB in size, with around 70 to 80 tables in total in their current project.
Outlines
π» Experience and BigQuery Architecture
The speaker has four years of experience, primarily with Google Cloud Platform (GCP), focusing on data warehousing using BigQuery. They are proficient in SQL and DSA in Python and are currently exploring Apache Spark. The speaker explains BigQuery's high-level architecture, highlighting its separation of storage and compute, its columnar storage, and the benefits of this for analytical functions. They discuss performance optimization techniques such as partitioning, clustering, and query optimization, as well as cost-saving strategies like purchasing slots in advance and using partitioning and clustering to reduce data scanning and computation.
πΌ Managing BigQuery Costs and Data Sets
The discussion continues with strategies to save costs in BigQuery, including buying slots in advance for predictable workloads and optimizing performance to reduce costs. The speaker suggests archiving unused tables in cloud storage to save on storage costs and mentions the use of the information schema to identify tables not in use. They also touch on managing multiple BigQuery datasets across different projects and the concept of materialized views for query optimization, which can save on computation costs by caching query results.
π Scheduling Queries and Data Pipelines
The speaker outlines the process for scheduling queries in BigQuery through the console, emphasizing the importance of setting the correct schedule parameters. They also discuss the importance of proper authentication and authorization within a team, suggesting the use of the BigQuery reader role for members who should only view and query existing tables. The speaker shares their experience with data volumes, mentioning working with tables ranging from 1-2 GB to 500-1000 GB, and the use of Airflow for creating data pipelines that integrate with BigQuery. They explain the difference between a DAG (Directed Acyclic Graph) and tasks within Airflow, providing an example of a data pipeline that moves data from an FTP server to a GCS bucket and then to BigQuery.
π Data Transfer and Task Operations in Airflow
The speaker concludes by discussing the process of writing an Airflow task to transfer data from an FTP server to a GCS bucket and then to BigQuery. They mention defining DAG parameters and creating tasks using Python operators. The speaker provides a high-level overview of the process, including the use of specific operators for transferring data from GCS to BigQuery and the consideration of different disposition options for these operations.
Mindmap
Keywords
π‘BigQuery
π‘Columnar Storage
π‘OLAP (Online Analytical Processing)
π‘SQL
π‘Data Partitioning
π‘Clustering
π‘Cost Optimization
π‘Cloud Storage
π‘Information Schema
π‘Airflow
π‘Materialized View
Highlights
Four years of experience with GCP, primarily in data warehousing with BigQuery.
Proficiency in SQL and DSA in Python, currently exploring PySpark.
BigQuery is a data warehousing tool used for analytical purposes.
BigQuery features separated storage and computing resources.
Columnar storage in BigQuery optimizes analytical functions.
Performance optimization in BigQuery through partitioning and clustering.
Cost optimization in BigQuery by purchasing slots in advance.
Reducing costs by scanning fewer records with partitioning and clustering.
Archiving unused tables in cloud storage to save costs.
Using the information schema to track table usage for cost management.
Managing multiple BigQuery datasets in one console with proper permissions.
Materialized views in BigQuery to optimize query execution and reduce costs.
Scheduling queries in BigQuery through the console.
Assigning roles like BigQuery Reader for team members to manage access.
Working with data volumes up to 1 TB per table in BigQuery.
Integration of GCP services like Cloud Storage and Airflow with BigQuery.
Difference between tasks and workflows in Airflow DAGs.
Example of writing a Python script for an Airflow task to transfer data from FTP to GCS and then to BigQuery.
Transcripts
having four years of experience uh most
of my experience is in gcp majorly in
the data site um in gcp I majorly work
with bigquery um for the data
warehousing um part um I am quite
comfortable with SQL uh DSA in Python
and currently I'm exploring uh pis Park
from by
myself okay all right so could you
please explain the high level
architecture of big query sure uh so big
query is as I mentioned it's a data
warehousing tool which we use for
analytical purpose uh in B quy the
storage and the computer is
separated um they are not dependent on
each other and also uh bigquery is a
columnar storage unlike the uh
relational other relational database olp
which is row based B quy is olap which
is
columnar uh storage so whenever we want
to do any kind of analytical function
wherein if I want to sum value stored in
a particular column bqu will only access
that specific column instead of getting
data from all the other uh columns as
well so it's a columnar storage um it's
anab processing um yeah
so how we can optimize performance in
big cury uh so we are having mple
options so we have partitioning we have
clustering um also in the normal SQL
query we can do couple of optimization
like including the filer we condition
wherever required all of
that and in terms of cost optimization
how we can apply uh to save the cost in
terms of B cury
um didn't get okay so I'm asking you how
we can save cost uh in B
cury one aspect like how we can improve
the performance one aspect like how we
can save cost which area we need to
focus to save the cost correct so uh if
we are having a very specific demand so
in bigy we have option that we can buy
these slots beforehand itself I'm not
sure if you are referring to this part
for saving cost so uh if we know that
our workload will be this much then we
can buy the slots beforehand and uh so
whatever we will be built that will be
on that particular slot
um and we may get discount uh on the
other uh so if we improve our quy
performance our cost will anyways be
saved so if we include partitioning or
lustering so if in our query instead of
scanning the complete table we are only
scanning a particular partition of the
data then our query will scan less
records so there will be less
commutation and H the cost will be
saved so uh for example last four year
you are working in b cury or whatever
right uh so for example we have
thousands of data set thousands of
tables but uh people are not using that
data set and tables unnecessary once
created then that is there but uh
ultimately cost we are paying cost for
that so how we can filter that who is
not using the data set that table which
it created that is not using right now
so how we can filter it out and then how
what is your action plan and what is
your point of view in this particular
area how we can work on this to save
lots of costing in in terms of those who
are not using that once
created okay um so one of the thing
which I will think about for the tables
which are not in use but maybe we might
require the data Maybe in the later
future I will store that data in cloud
storage archive uh so I believe normally
the the bigquery storage cost is similar
as the cloud storage cost when we
compare it with the standard tier as
much as I remember but uh we can if we
are not using the table itself for a
very long period of time then uh we can
save that uh table data and cloud
storage archive storage or near line or
cold cold line depending on the use case
so in that way we will be uh saving the
storage cost uh um I'm not sure how we
will check which all tables are not used
because I haven't worked on that use
case but there is a table or a view
known as information schema so I'm not
sure if in that we have the detail about
when was the last table or when was the
last time that the table was used if we
are having that information in the
information schema then I can get it
from there otherwise I can par through
the cloud logs Al so in the cloud
logging I can go through each of the
logs and I can check for that particular
uh table when was that particular table
uh exist last and if it is after a
specific threshold then I can uh start
with moving the data from viery to cloud
storage uh for example we have a
multiple project
but we want to communicate uh uh the Big
cury Data alongside of this project
whatever the project we have multiple
project how it will be possible to
manage the multiple uh big cury data set
in one single
console um
so I believe depending on the permission
so if a user is having specific
permission so assuming I'm having a you
know three project one is for the
development envir one is for the desting
another is for the production
environment so if a user is having
access to all the three project the
required permission which is wherein he
can write or execute the command so
uh on the basis of the permission if
he's having the permission then the user
can access the uh any of the tables data
set from any of the project Bey
conern and what is materialized
view uh so material so normal view
wherein the query is whenever I execute
the query the query will be executed
again and again even if you know the
data will be same the data which will be
returned will be same but in the case of
materialized view uh if I'm executing a
query uh the data for that particular
time I will be having that data and
whenever I will execute it again if the
data has not changed um then I will be
directly getting that data and the query
won't be executing again but if there is
some newer data then um the only that
particular data will be executed so the
remaining data won't be executed so
basically we will save that much
computation okay as you as you know that
every query execution uh there is some
cost right mhm so so suppose one query
that is required to execute every day
okay so in this case how we can optimize
cost so we can also use cache or so we
can cach the uh table data and uh as
mentioned we can also use materialized
view so if we are having a specific
query which is running on top of table
or multiple tables and it's running on a
daily basis then we can instead create a
view materialized view so whatever the
previous day data is we will be already
we will be having that data and the
query won't be running again for only
that specific part of data and on a
daily basis if new data is getting added
to the source table only that particular
part will be
computed so materialized view will be a
very good use case for this particular
scenario and
uh can you tell me suppose I want to
schedule some cies so what is the
process and how to schedule the quy can
you tell me the
steps um so in big console itself we are
having an option using which we can
schedule the query um so we have we
should have the query and we have to
specify uh the what do we say the
schedule time so if we wanted on a daily
basis or on a monthly basis so whatever
our schedule parameter is we have to
specify that we should have the quy and
we can schedule it using console
itself okay so in terms of the
authentication and authorization suppose
uh in your team 10 members are there but
we don't want to provide the admin
access to everyone right MH so in this
case which rule will assign
to to play around it with bigy data set
and table as a developer point of
view oh so if someone only wants to if
someone should not have access wherein
he can modify a table or you know create
a table but he can view and query the
existing tables which we are already
having uh then I believe we are having B
query reader role I'm not sure on the
rules but I think bigquery reader rule
we are having that and that will do the
so how how how much volume of data you
have worked with B
cury uh so it depends So currently in my
project I currently in my project which
I'm working as of now uh the individual
tables are having around uh 500 GBS
1,000 GBS which is 1 TB uh and currently
we are having 70 to 80 tables and the
sizes are you know quite variable in
some D we are only having one 2 GB of
data and some we are having hundreds of
GBS is there any limitation to storage
data in a particular
table
uh I don't I'm not aware of that I don't
think it should be the case but I'm not
aware of this okay apart from Big cury
AG uh you are creating a data pipeline
so which all gcp service which you have
integrated with B cury can you explain
yeah uh so our so we are using air flow
um
so we are already having the source
table in bqu so there is another team
which is directly getting the data from
SCP into bigquery without doing any
modification and from there on uh you
know we work wherein we create our
pipeline so our source is Bey itself we
take the data from from B uh we have
created couple of tasks in airflow um
and after completing all that task we
are again storing that data in B so the
source data we can assume that is data
leak in bqu itself and after
transforming or processing via airflow
task we are again storing it in big quy
which we can term as data warehouse
wherein we are having transform data uh
yeah so the main Services which we are
using our clouds to reach airflow and
big so have you ever set up streaming
pipeline
no only B pipeline you have set up right
yeah so task when you write a uh dag so
dag workflow and task what is the
difference uh between task and
workflow uh so in a dag we can have
multiple tasks so one task can be
dependent on another task and the entire
collection of task which is directed e
cyclic graph wherein we cannot have you
know a circle formation um so this
entire sequence of task or the
pipelining of task which we are having
that we term it as individual
dark okay can you write a dark uh uh
task uh I mean I'll just give you one
scenario uh suppose you face the data
from FTP server and load into GCS Bucket
from GCS to uh then uh again uh task
number two you have to face the data
from GCS bucket and load into big cury
can you write a program uh python
script yeah uh I may not recall the
exact um syntax but uh I will write the
basic sudo code so initially we have to
Define all the dags parameter which we
are having
uh so after defining all the DX
parameter where in we might have the dag
ID the schedule time U catchup is equal
to true or false or whatever that
individual parameters we are having then
inside after we have defined that dag
Insurance then we will have to Define
task so first task is we can name it as
T1 and from which we are getting the
data and storing it into the GCS bucket
uh we can write a python script for that
particular one so we can use the python
operator uh so we will use the python
operator and the python operator we will
call the respective python function
where in we will be having all our
python code wherein we are getting the
data and storing it in GCS bucket uh
then for the task to uh we do have a
specific operator in airflow which
stores data from GCS directly to
bigquery so we can make use of that
particular operator and we are having
multiple option from GCS to bigquery
operator in which we are having right
upend override disposition so on the uh
on our use case or on the scenar on the
basis of the scenario we can choose any
of that particular format so if I
5.0 / 5 (0 votes)