Google Cloud Big Data and Machine Learning Fundamentals

Table of Contents

This course provides an introduction to the tools and technologies Google Cloud offers to work with large data sets and then integrate that data into the artificial intelligence and machine learning lifecycle.

Data is the foundation of every application integrated with artificial intelligence. Without data, there is nothing for AI to learn from, no pattern to recognize, and no insight to glean. Conversely, without artificial intelligence, large amounts of data can be unmanageable or underutilized.

Google Cloud Services

Google Cloud offerings can be broadly categorized as compute, storage, big data, and machine learning services for web, mobile, analytics, and backend solutions. The main focus of this course is on big data and machine learning.

Compute

Organizations with growing data needs often require lots of compute power to run big data jobs. Google offers a range of computing services:

Compute Engine	IaaS offering. Raw compute, storage, and network capabilities. Maximum flexibility.
Kubernetes Engine (GKE)	Run containerized applications on cloud environment.
App Engine	Fully managed Paas offering. Bind code to libraries. Focus on application logic.
Cloud Functions	Execute code in response to events. Faas (functions as a service) offering.

In recent years CPUs and GPUs can no longer scale to adequately reach the rapid demand for machine learning. To help overcome this challenge, in 2016, Google introduced the Tensor Processing Unit, or TPU. TPUs are Google’s custom developed application specific integrated circuits (ASICs) used to accelerate machine learning workloads. With TPUs the computing speed increases more than 200 times.

Storage

In desktop computing, the compute and storage are coupled, meanwhile in cloud computing, the compute and storage are decoupled for proper scaling capabilities. With cloud computing, processing limitations are not attached to storage disks.

You could install a database on a virtual machine just as you would do in a datacenter. Alternatively, Google Cloud offers fully managed database and storage services:

	Unstructured data (documents, images, audios, etc)
Cloud Storage	Four storage classes: 1. Standard storage (hot data) 2. Nearline storage (once per month) 3. Coldline storage (once 90 days) 4. Archive storage (once a year)
	Structured data (tables)
Cloud SQL	Transactional workloads, SQL, Local / regional scalability
Cloud Spanner	Transactional workloads, SQL, Global scalability
Firestore	Transactional workloads, NoSQL
BigQuery	Analytical workload, SQL (BigQuery is both a Storage and Analytic product)
Cloud Bigtable	Analytical workload, No-SQL

Big Data and Machine Learning Products

All the product can be divided into four general categories along the Data-to-AI workflow.

Categories:	Products
Ingestion & Process	Digest both realtime and batch data. Pub/Sub, Dataflow, Dataproc, Cloud Data Fusion
Storage	Cloud SQL, Cloud Spanner, Firestore, Cloud Bigtable
Analytics	Big Query, Google Data Studio, Looker
Machine Learning	ML development platforms: Vertex AI (Auto ML, Workbench, TensorFlow) AI solutions: (Document AI, Contact Center AI, Retail Product Discovery, Healthcare Data Engine, etc)

Streaming Data

Batch processing is when the processing and analysis happens on a set of stored data. Streaming data is quite different from batch processing. Streaming data is a flow of data records generated by various data sources. The processing of streaming data happens as the data flows through a system. This results in the analysis and reporting of events as they happen.

Streaming Data Processing means that the data is analyzed in near real time and that actions will be taken on the data as quickly as possible. Modern data processing has progressed from legacy batch processing of data toward working with real-time data streams.

In modern organizations, data engineers and data scientists are facing four major challenges:

Variety	Data could come from a variety of different sources and in various formats.
Volume	Volume of data can vary from gigabytes to petabytes.
Velocity	Data often need to be processed in near real time as soon as it reaches the system.
Veracity	The quality of data.

Data Ingestion (Pub/Sub)

One of the early stages in a data pipeline is data ingestion, which is where large amounts of streaming data are received. The data might stream from even a million different events that are all happening asynchronously, for example data from Internet of Things application.

These presents new challenges to data ingestion, which can be summarized in four points:

Data from many different methods and devices
It can be hard to distribute event messages to the right subscribers
Data can arrive quickly in high volume
Ensuring services are reliable, secure

Pub/Sub is a tool to handle distributed message oriented architectures at scale. Pub/Sub is a distributed messaging service that can receive messages from a variety of device streams. It ensures at-least-once delivery of received messages to subscribing applications with no provisioning required, and offers end-to-end encryption.

A central element of Pub/Sub is the topic. You can think of a topic like a radio antenna. That means there can be zero, one or more publishers and zero, one or more subscribers related to a topic. And they’re completely decoupled so they’re free to break without affecting their counterparts.

Topic	Radio antenna
Topic is always there.	The antenna itself is always there, whether your radio is playing music or it’s turned off.
A publisher can send data to a topic that has no subscriber to receive it.	If music is being broadcast on a frequency that nobody’s listening to, the stream of music still exists.
A subscriber can be waiting for data from a topic that isn’t getting data sent to it.	Listen to static from a bad radio frequency.
A fully operational pipeline where the publisher is sending data to a topic that an application is subscribed to.	Listen to music from a radio frequency.

Next we’ll need a pipeline that can match Pub/Sub scale and elasticity to get these messages reliably into our data warehouse.

Dataflow

Dataflow creates a pipeline to process both streaming data and batch data. The word “process” in this case refers to the steps to extract, transform, and load data, or ETL. A popular pipeline design is Apache Beam, which’s an open source, unified programming model to define and execute data processing pipelines, including ETL, batch, and stream processing.

The next step is to identify an execution engine to implement those pipelines. Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud ecosystem. Dataflow is designed to be low-maintenance, serverless and NoOps.

Serverless	Serverless computing is a cloud computing execution model. This is when Google Cloud, for example, manages infrastructure tasks on behalf of the users.
No-Ops	A NoOps environment is one that doesn’t require management from an operations team, because maintenance, monitoring, and scaling are automated.

Developers will benefit from using Dataflow templates, which cover common use cases across Google Cloud products. BigQuery is one of many options that data can be outputted to.

Visualization (Looker or Data Studio)

Data that is difficult to interpret or draw insights from might be useless. After data is in BigQuery, lot of skill and effort can still be required to uncover insights.

Looker supports BigQuery as well as more than 60 different types of SQL database products commonly referred to as dialects. It allows developers to define a semantic modeling layer on top of databases using Looker modeling language or LookML. LookML defines logic and permissions independent from a specific database or a SQL language, which frees the data engineer from interacting with individual databases to focus more on business logic across an organization.

Another popular data visualization tool offered by Google is Data Studio. Data Studio is integrated into BigQuery, which makes data visualization possible with just a few clicks.

BigQuery

BigQuery is a fully-managed data warehouse. A data warehouse is a large store containing terabytes and petabytes of data gathered from a wide range of sources within an organization that’s used to guide management decisions. Being fully-managed means that BigQuery takes care of the underlying infrastructure. So you can focus on using SQL queries to answer business questions without worrying about deployment, scalability and security.

It provides two services: both Storage and Analytics. The two services are connected by Google’s high speed internal network.
Fully-managed and serverless.
Flexible pay-as-you-go pricing model.
Data encrypted at rest by default.
Bulit-in machine learning features.

BigQuery is like a common staging area for data analytics workloads. When your data is their business analysts, BI developers, data scientists and machine learning engineers can be granted access to your data for their own insights. BigQuery outputs usually feed into two buckets, business intelligence tools and AI and ML tools.

Note that inconsistency might result from saving and processing data separately. To avoid that risk, consider using Dataflow to build a streaming data pipeline into BigQuery.

BigQuery is optimized for running analytical queries over large data sets. By default, BigQuery runs interactive queries, which means that the queries are executed as needed. BigQuery also offers batch queries where each query is queued on your behalf and the query starts when idle resources are available.

BigQuery ML

Now you can create and execute machine learning models on your structured data sets in BigQuery in just a few minutes using SQL queries. BigQuery ML was designed to be simple, like building a model in two steps:

Step 1: Create a model with a SQL statement

CREATE MODEL some.model
OPTIONS( model_type = 'logistic_reg',
         input_label_cols = ['a_col']
) AS ...

Step 2: Write a SQL prediction query and invoke ml.PREDICT:

SELECT *
FROM ML.PREDICT( MODEL 'some.model', ...

You now have a model and can view the results.

BigQuery support supervised and unsupervised models.

Supervised models	Task driven and identify a goal. Logistic regression (classification), Linear regression, …
Unsupervised models	Data driven and identify a pattern. Cluster analysis, Association, Dimensionality reduction, …

In addition to providing different types of machine learning models. BigQuery ML supports features to deploy, monitor and manage the ML production called ML Ops, which include importing / exporting models, hyper-parameters tuning, etc.

There are a few key phases for a machine learning project:

Extract, transform and load data into BigQuery
Select and preprocess features
Create the model inside BigQuery
Evaluate the performance of the trained model
Use the model to make predictions

Machine Learning Options

Google Cloud offers four options for building machine learning models:

BigQuery	Use SQL queries to create and execute machine learning models in BigQuery.
Pre-built APIs	Leverage machine learning models that have already been built and trained by Google.
AutoML	A no-code solution. You can build your own machine learning models on Vertex AI through appointing click interface.
Custom training	Code your very own machine learning environment, the training and the deployment, which gives you flexibility and provides full control over the entire process.

Pre-built APIs

Pre-built APIs are offered as services. In many cases, they can act as building blocks to create the application you want without the expense, or complexity of creating our own models. They save the time and effort of building, curating and training a new dataset so you can quickly move to predictions. A shortlist is:

Speech-to-Text API
Cloud Natural Language API
Cloud Translation API
Text-to-Speech API
Vision API
Video Intelligence API

AutoML

AutoML is short for Automated Machine Learning. Training and deploying ML models can be extremely time consuming, because you need to repeatedly add new data and features, try different models, and tune parameters to achieve the best result. The goal of AutoML was to automate machine learning, so data scientists didn’t have to start the process from scratch.

For AutoML, two technologies are vital:

Transfer learning	Transfer learning is a powerful technique that lets people with smaller datasets or less computational power achieve state of the art results by taking advantage of pre-trained models that have been trained on similar larger datasets.
Neural Architect Search	The goal of neural architect search is to find the optimal model for the relevant project. AutoML platform actually trains and evaluates multiple models and compares them to each other. This neural architecture search produces an ensemble of ML models and chooses the best one.

Leveraging these technologies has produced a tool that can significantly benefit data scientists, who can train custom ML models with minimal effort and little machine learning expertise. It allows data scientists to focus on defining business problems and evaluating / improving model results.

Others might find AutoML useful as a tool to quickly prototype models and explore new datasets before investing in development. This might mean using it to identify the best features in a dataset.

Custom Training

If you want to code your machine learning model, you can use this option by building a custom training solution with Vertex AI Workbench. Workbench is a single development environment for the entire data science workflow from exploring, to training and then deploying a machine learning model with code. There are two options a pre-built container or a custom container:

Pre-built container	If your ML training need a platform like TensorFlow, PyTorch, scimitar-learn or XGBoost and Python code to work with the platform.
Custom container	You define the exact tools, you need to complete the job.

Vertex AI

Vertex AI is a unified platform that brings all the components of the machine learning ecosystem and workflow together. “A unified platform” means having one digital experience to create, deploy, and manage models over time and at scale.

Vertex AI allows users to build machine learning models with either

AutoML – a cloudless solution, or
Custom training – a code-based solution.

With traditional programming, a computer can only follow the algorithms that a human has set up. With machine learning, you feed a machine a large amount of data along with answers that you would expect a model to conclude from the data. From there, you expect the machine to learn from the provided data and examples to solve the puzzle on its own.

There are three key stages to this learning process:

Data preparation
- Data uploading (image, tabular, text, or video)
- Feature engineering (a feature refers to a factor that contributes to the prediction, it is an independent variable in statistics or a column in a table)
Model training and evaluation
- Evaluation metrics (confusion matrix, recall, precision, feature importance …)
Model serving
- Deployment (endpoint, batch prediction, offline prediction)
- Monitoring

Vertex AI provides many features to support the ML workflow, examples include:

Feature Store	A centralized repository for organizing, storing, and serving features to feed your training models. It aggregates all the different features from different sources.
Vizier	Helps you tune hyper-parameters in complex machine-learning models
Explainable AI	Helps with things like interpreting training performance
Pipelines	Monitor the ML production line

My Certificate

For more on Google Cloud Big Data and Machine Learning Fundamentals, please refer to the wonderful course here https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/

My #108 certificate from Coursera

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai