Google Cloud Dataflow

From Luis Gallego Hurtado - Not Another IT guy
Jump to: navigation, search


Google Cloud Dataflow is an Apache Beam implementation, that offers a high-level API for writing data pipelines (ETLs) and performing streaming processing.

It ingests several sources on stream or batch processing, processes the data and then provides data to analyzing tools.

Features

  • Automated resource management.
  • Dynamic Work rebalancing.
  • Reliable and consistent processing.
  • Horizontal autoscalling.
  • Unified Programming Model.
  • Community-driven Innovation.
  • You can shuffle data while processing data.
  • Running locally or on GCP.

Job

In order to create a job, you must specify the name, region, job template, path of input file, path of output file, encryption key and path of temporary location.

You can additionally specify the zone, maximum number of worker nodes, network and subnetwork, machine type, and the email of the service account that will run the job.

Quota and Limit

Quota

  • Organization:
    • It can run concurrently maximum of 125 jobs.
  • User:
    • He can perform maximum of 3M requests per minute.
    • He can perform maximum of 15K monitoring requests per minute.
  • Project:
    • It can concurrently maximum of 25 jobs.
    • It can get maximum of 160 shuffle slots (enough to shuffle approximately 50TB of data concurrently).
    • It can get maximum of 60GB per minute per cloud region for sending data from Compute Engine to Streaming Engine.
  • Job:
    • Maximum of Compute Engine instances is 1K.

Limits

  • Maximum number of workers per pipeline is 1K.
  • Maximum size of a job creation request is 10MB.
  • Maximum number of side input shards is 20K.
  • Maximum size for a single element value in Streaming Engine is 100MB.

IAM

Cloud Dataflow uses 2 service accounts:

  • Dataflow service account: used on job creation, and execution (for managing the job).
  • Controller Service Account: used by worker instance for accessing input and output resources.

Cloud Dataflow roles can be set on projects and organizations.

Pricing

Cloud Dataflow jobs are billed in per second increments, based on the actual size of batch or streaming workers.

Fees are different for batch and streaming.

Use Cases

  • Fraud detection in financial services.
  • IoT analytics in manufacturing, healthcare and logistics.

Pipeline

Input is transformed into PCollection, which can be transformed N times and then transformed into Output. Transformation can be either PTransform (into PCollection) or IO Transformed (from Input or to Output)

Features

Apache Beam implementation for ETL and Streaming