Google Cloud Dataproc

Google Cloud Dataproc is a serverless, fully-managed Hadoop and Spark clusters implementation, service, that allows you to easily process big datasets using the powerful and open tools in the Apache big data ecosystem.

Features

Automatic Cluster Management
Resizable Clusters
Integrated with other GCP services
Image Versioning
Highly Available
Developer Tools
Initializations Actions
Automatic or Manual Configuration
Flexible Virtual Machines
Custom Images for cluster
Dataproc workflows
Autoscaling

Hadoop Cluster

Cluster to be standard (1 master and N workers), single node (1 master 0 workers) or highly available (3 masters and N workers).

Master nodes

Each master node contains the YARN Resource Manager, HDFS NameNode and all job drivers.

Worker nodes

Each worker node contains a YARN NodeManager and a HDFS DataNode. HDFS replication factor is 2.

Additionally, you can setup secondary worker nodes, which are nodes that do not run HDFS. Secondary worker VMs are preemptible by default.

Creating a cluster

After creating an autoscaling policy, you can create/update a cluster with such policy.

On creating the cluster, you can configure the master and worker nodes (like in GCE), customize the cluster and manage security. You can also schedule deletion of cluster.

On creating a cluster, you can select optional components, and choose a specific image, or a custom image with some pre-installed packages.

Jobs

Google Cloud Dataproc runs jobs from several types:

Hadoop
Spark
SparkR
PySpark
Hive
SparkSql
Pig

On creating a job you specify region, cluster, type, the the jar files, the main class to run (or jar with the main class), arguments and max number of restarts per hour.

Workflows

A workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs on a cluster.

Workflow are ideal for complex job flows.

Managing and executing workflows can be done with Cloud Dataflow Workflow Templates API.

Jobs in workflows can create a cluster, run jobs, delete a cluster.

Workflow Templates

You can use Workflow Templates. There are different types of workflow templates:

Managed cluster: the cluster will be created, the jobs will be run and then the cluster will be deleted.
Cluster selector: they run the jobs into an existing cluster.
Parameterized: parameters sit outside, and jobs to run depend on the parameter.
Inline: to be executed via inline tool, like a CLI.

Use Cases

Automation
Transactional, fire and forget API interaction model.
Ephemeral and long-lived clusters.
Granular IAM security.

Storage Options

For in an out data:

HDFS File Systems
Google Cloud Storage

For internal data shuffle:

Persistent Disks.
Local SSD: they are deleted when nodes are recreated.

Best Practices

Separate data and compute.
Save cost by deleting cluster when you do not run jobs. Ephemeral (for running some jobs) and long-lived clusters.
Create and delete clusters often.
Use Jobs APIs.

IAM

There are roles and permissions for Cloud Dataproc API User (End User identity), Cloud Dataproc Service Agent (Control Plane identity) and Cloud Dataproc VM Service Account (Data Plane identity).

Use Cases

ETL jobs (Extract Transform Load).
Batch jobs.
Analytics jobs, including Machine Learning.

Navigation menu

Google Cloud Dataproc

Contents

Features

Hadoop Cluster

Master nodes

Worker nodes

Creating a cluster

Jobs

Workflows

Workflow Templates

Use Cases

Storage Options

Best Practices

IAM

Use Cases