Google Cloud Dataproc

From Luis Gallego Hurtado - Not Another IT guy
Jump to: navigation, search

Google Cloud Dataproc is a serverless, fully-managed Hadoop and Spark clusters implementation, service, that allows you to easily process big datasets using the powerful and open tools in the Apache big data ecosystem.


  • Automatic Cluster Management
  • Resizable Clusters
  • Integrated with other GCP services
  • Image Versioning
  • Highly Available
  • Developer Tools
  • Initializations Actions
  • Automatic or Manual Configuration
  • Flexible Virtual Machines
  • Custom Images for cluster
  • Dataproc workflows
  • Autoscaling

Hadoop Cluster

Cluster to be standard (1 master and N workers), single node (1 master 0 workers) or highly available (3 masters and N workers).

Master nodes

Each master node contains the YARN Resource Manager, HDFS NameNode and all job drivers.

Worker nodes

Each worker node contains a YARN NodeManager and a HDFS DataNode. HDFS replication factor is 2.

Additionally, you can setup secondary worker nodes, which are nodes that do not run HDFS. Secondary worker VMs are preemptible by default.

Creating a cluster

After creating an autoscaling policy, you can create/update a cluster with such policy.

On creating the cluster, you can configure the master and worker nodes (like in GCE), customize the cluster and manage security. You can also schedule deletion of cluster.

On creating a cluster, you can select optional components, and choose a specific image, or a custom image with some pre-installed packages.


Google Cloud Dataproc runs jobs from several types:

  • Hadoop
  • Spark
  • SparkR
  • PySpark
  • Hive
  • SparkSql
  • Pig

On creating a job you specify region, cluster, type, the the jar files, the main class to run (or jar with the main class), arguments and max number of restarts per hour.


A workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs on a cluster.

Workflow are ideal for complex job flows.

Managing and executing workflows can be done with Cloud Dataflow Workflow Templates API.

Jobs in workflows can create a cluster, run jobs, delete a cluster.

Workflow Templates

You can use Workflow Templates. There are different types of workflow templates:

  • Managed cluster: the cluster will be created, the jobs will be run and then the cluster will be deleted.
  • Cluster selector: they run the jobs into an existing cluster.
  • Parameterized: parameters sit outside, and jobs to run depend on the parameter.
  • Inline: to be executed via inline tool, like a CLI.

Use Cases

  • Automation
  • Transactional, fire and forget API interaction model.
  • Ephemeral and long-lived clusters.
  • Granular IAM security.

Storage Options

For in an out data:

  • HDFS File Systems
  • Google Cloud Storage

For internal data shuffle:

  • Persistent Disks.
  • Local SSD: they are deleted when nodes are recreated.

Best Practices

  • Separate data and compute.
  • Save cost by deleting cluster when you do not run jobs. Ephemeral (for running some jobs) and long-lived clusters.
  • Create and delete clusters often.
  • Use Jobs APIs.


There are roles and permissions for Cloud Dataproc API User (End User identity), Cloud Dataproc Service Agent (Control Plane identity) and Cloud Dataproc VM Service Account (Data Plane identity).

Use Cases

  • ETL jobs (Extract Transform Load).
  • Batch jobs.
  • Analytics jobs, including Machine Learning.