Data Lakes and Data Warehouses in Google Cloud

Table of Contents

A data engineer is someone who builds data pipelines. They want to get their data into a place, such as a dashboard or report or machine learning model, from where the business can make data-driven decisions.

Transactional Databases vs. Data Warehouse

Data engineers may be responsible for both:

The back-end transactional database systems that support a company’s applications, and
- Cloud SQL is Google Cloud’s fully managed relational database solution.
The data warehouses that support their analytic workloads.
- BigQuery is Google Cloud’s petabyte scale server-less data warehouse.

Be aware that Cloud SQL is optimized to be a database for transactions (mostly writes), and BigQuery is a data warehouse optimized for reporting workloads (mostly reads).

Cloud SQL	RECORD-based storage Entire record must be opened on disk, even if you just selected a single column in your query.
BigQuery	COLUMN-based storage Allows for really wide reporting schemas, since you can simply read individual columns out from disk.

Data Lake, ETL, Data Warehouse

Data in many businesses is siloed by departments, and each department creates its own transactional systems to support its own business processes. A data lake brings together raw data from across the enterprise into a single location. Cloud Storage from Google Cloud is a popular choice to serve as a data lake.

The two main entities in Cloud Storage are buckets and objects.

Buckets are containers for objects. Buckets are identified in a single global unique namespace. A bucket is associated with a particular region or with multiple regions, which will reduce latency if the region is close to the place where the data will be processed.
Objects contain the actual data and metadata which is the information about the objects. Many Cloud Storage features, such as access control, compression, encryption, and lifecycle management use the metadata.

The raw data can be from relational databases, spreadsheet, offline files, or other systems and apps. There are some considerations as you build a data lake:

Can your data lake handle all the types of data you have?
Can it scale to meet the demand?
Can it support high-throughput ingestion?
Is there fine-grained access control?
Can other tools connect easily?

Data becomes useful only after you clean it up. You may need to extract the data from its original location, transform it, and then load it in. ETL (extract, transform and load) pipelines are usually necessary to ensure data accuracy and quality. They clean and transform data are typically stored within a data warehouse, not in a data lake.

A data warehouse is a consolidated place to store the data, and all the data are easily join-able and query-able. Unlike a data lake (where the data is in the raw format) in the data warehouse, the data is stored in a way that makes it efficient to query. Once your data is in your data warehouse, you need to optimize the queries your users are running to make the most efficient use of your compute resources.

There are some considerations as you choose a data warehouse:

Can it serve as a sink for both batch and streaming data pipelines?
Can it scale well to meet your needs?
How is the data organized, cataloged, and access controlled?
Is the warehouse designed for performance?
What level of maintenance is required?

Get Value Out of Data

When data is in a usable condition, we need to add new value to the data through analytics and machine learning. There are many teams that rely on data warehouse to build and maintain new data pipelines. The three most common clients are:

Machine learning engineer
Business intelligence and data analyst
Other data engineers

What’s the complete picture?

Various sources (Raw data)
    ⥥ Replicate
Data lake
    ⥥ ETL
Data warehouse (Data efficient to query)
    ⥥ Pipeline
Other teams (ML, BI, etc)

Build a Data Warehouse Using BigQuery

Recall that both Data Lake and Data Warehouse consolidate date, but a data warehouse imposes a schema. To use a data warehouse, a data analyst needs to know the schema of the data.

The purpose of a Data Lake is to store data only.
The purpose of a Data Warehouse is NOT to store data, but is to make sure:
- The data used for querying is clean, accurate, and consistent, and
- The query is quick

An modern enterprise data warehouse brings the data together, and makes it available for querying and data processing, it needs:

Scale from gigabytes to petabytes
Serverless and no-ops
Ecosystem of visualization and reporting tools
Ecosystem of ETL and data processing tools
Up-to-the minute data
Support machine learning
Enterprise grade security and collaboration

BigQuery organizes data tables into units called datasets, which are scoped in your Google Cloud Projects. When you reference a table from the command line in SQL queries or in code, you refer to it by using the construct project.dataset.table. You store data in separate tables based on logical schema considerations. The cost of a query is always assigned to the active project from where the query is executed.

Running a query means that you must be able to submit a query job to the service. In order to run a query in a project, you need identity access management (IAM) permission, to submit a job. Access control is through IAM and is at the dataset, table, view or column level. In order to query data in a table or view, you need at least read permissions on the table or view. BigQuery provides predefined roles for controlling access to resources.

Partitioning can improve query cost and performance by reducing data being queried. When the data is stored, BigQuery ensures that all the data in a block belongs to a single partition. BigQuery supports 3 ways of partitioning tables:

Ingestion time --time_partitioning_type
Any column that is of type DATETIME, DATE or TIMESTAMP --time_partitioning_field
Integer-typed column --range_partitioning

Clustering can improve the performance of certain types of queries. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data. BigQuery automatically sorts the data, based on values in the clustering columns. BigQuery periodically does auto re-clustering for you.

If you don’t have partitioned columns, and you want the benefits of clustering, you can create a fake underscore date column of type date and have all the values be null.

Schema Design

Normalizing the data means turning it into a relational system. This stores the data efficiently and makes query processing a clear and direct task. Normalizing data usually happens when a schema is designed for a database.

Denormalizing is the strategy of allowing duplicate field values for a column in a table in the data to gain processing performance. Data is repeated rather than being relational. Flattened data takes more storage, but makes queries more efficient. Specifically, denormalizing data enables BigQuery to more efficiently distribute processing amongst slots resulting in more parallel processing and better query performance. You would usually denormalize data before loading it into BigQuery.

However, there are cases where denormalizing data is bad for performance, specifically if you have to GROUP BY a column with a one-to-many relationship. To group the data, it must be shuffled, which is slow and often happens by transferring the data over a network between servers or systems. Fortunately this can be solved by columns with Nested and Repeated Fields:

ARRAY fields	`mode = REPEATED` Allow a given field to go deep into granularity.
STRUCT fields	`type = RECORD` Allow you to organize all those useful fields into logical containers instead of separate tables.

ARRAY and STRUCT data types in SQL can be absolutely independent of each other.

General Guidelines

Instead of joins, take advantage of Nested and Repeated Fields in denormalized tables.
Keep a dimension table smaller than 10 gigabytes normalized, unless the table rarely goes through UPDATE and DELETE operations.
Denormalize a dimension table larger than 10 gigabytes, unless data manipulation or costs outweigh benefits of optimal queries.

My Certificate

For more on Data Lakes and Data Warehouses in Google Cloud, please refer to the wonderful course here https://www.coursera.org/learn/data-lakes-data-warehouses-gcp

My 137th certificate from Coursera

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai