Google Cloud Bigtable

From Luis Gallego Hurtado - Not Another IT guy
Jump to: navigation, search


A scalable, fully-managed NoSQL wide-column database that is suitable for both real-time access and analytics workloads.

Good for low-latency read/write access, high-throughput analytics and native time series support.

Features

  • Ideal for storing very large amounts of single-keyed data.
  • It supports high read and write throughput at very low latency
  • Ideal data source for MapReduce operations.
  • Compresses your data automatically using an intelligent algorithm.
  • Native time series support.
  • Advantages over a self-managed HBase installation:
    • Incredible scalability
    • Simple administration
    • Cluster resizing without downtime

Architecture

Cloud Bigtable does not support transactions but single-row operations are transactional and atomic. Unlike Cloud Firestore, it does not support transactions nor secondary indexes.

Instance

An instance is a container for clusters and nodes.

  • Tables belong to instances (not to clusters nor nodes).

Cluster

A cluster represents actual BigTable service, located in a single zone.

  • An instance can have up to 2 clusters.
  • Storage type can be SSD (lower latency for real-time serving) or HDD (higher latency for analytics or batch processing).

Nodes

  • A client request go through a front-end server before it is sent to a node. It uses a load balancer as a proxy for connections.
  • In a production instance, there are 3 or more nodes.
  • Each cloud BigTable zone is managed by a master process, which balances workload and data volume between nodes.
  • Nodes are separated from actual storage, so when a node fails no data is lost. They store metadata.
  • Adding nodes increases storage, throughput and rows read per second. Throughput scales linearly (adding 10k queries/second).
  • Multiple nodes map to data based on the metadata.

Tablets

  • A Cloud Bigtable table is shared into blocks of contiguous rows, called tablets, to help balance the workload of queries.
  • Rebalancing tablets from one node to another is very fast, because data is not copied.

Replication

Replication in a Cloud Bigtable instance is achieved with 2 clusters in the same instance.

Cloud Bigtable replicates the following types of changes automatically:

  • Updates to the data in existing tables
  • New and deleted tables
  • Added and removed column families
  • Changes to a column family's garbage-collection policy

Consistency model

3 options:

  1. By default, eventually consistent
  2. Read-your-writes consistency, with groups of apps requesting with single-cluster routing, leaving other cluster for other purposes.
  3. Strong consistency, with groups of apps with single-cluster routing, leaving other cluster only for failover.

If a Cloud Bigtable cluster becomes unresponsive, replication makes it possible for incoming traffic to fail over to the instance's other cluster.

Data Model

NoSQL, no-joins, distributed key-value store.

  • Each table has only one index, the row key.
  • Rows are lexicographically sorted according to row-key, .
  • All operations are atomic at the row level.
  • Keep

Columns are grouped in column families.

Cell

Unwritten cells do not take up space.

Every cell is versioned with timestamp by default, and garbage collection retains the latest version.

Expiration can be set at column family level.

Periodic compaction reclaims unused space of cells.

Time Series Data

Recording time along a measurement.

Patterns

  • General patterns: short meaningful names
  • Patterns for row-key design:
    • Tall and narrow tables.
    • Insert new rows instead of column versions.
    • Design row keys with queries in mind.
      • Denormalization: create one table for each query, composing the key values according to queries will be performed, avoiding full scan.
    • Avoid hotspotting on keys: reads and writes should ideally be distributed evenly across the row space of the table (you can check it with heatmaps in Key Visualizer tool).
      • Field promotion. Move fields from column-data to row-key to make writes non contiguous.
      • Salting: add an additional value to row-key to make rows non-contiguous.
    • Reverse timestamps (so recents events come first) only if most common queries look for latest values.
  • Patterns for data column design:
    • Keep related data in the same table, keep unrelated data in different tables.
    • Store related entities in adjacent rows.
    • Store items related to an entity in a single row (within cells).
    • Store data you will access in a single query in a single column family.
    • Don't exploit atomicity of single row, i.e. insert new rows rather than updating existing ones.

Input Output

Retrieving data by row-key or by ranges of row-keys.

Application Profile

The application profile defines how Bigtable instance handles incoming requests.

You can select single-cluster routing (manual failover, read-your-writes consistency and single-row transactions) or multi-cluster routing (automatic failover, no read-your-writes consistency and no single-row transactions).

It can enable or disable single-row transactions.

Performance

A cluster's performance increases linearly as you add nodes to the cluster.

Cloud Bigtable shards the data into multiple tablets, which can be moved between nodes in your Cloud Bigtable cluster.

Testing performance

  • Use at least 300 GB of data.
  • Stay below the recommended storage utilization per node.
  • Before you test, run a heavy pre-test for several minutes.
  • Run your test for at least 10 minutes.
  • Look at the Key Visualizer scans for your table.
  • Try commenting out the code that performs Cloud Bigtable reads and writes.
  • Ensure that you're creating as few clients as possible.
  • Make sure you're reading and writing many different rows in your table.
  • Verify that you see approximately the same performance for reads and writes

Monitoring: Key Visualizer

Key Visualizer is a tool that helps you analyze your Cloud Bigtable usage patterns.

It provides visual reports for your tables based on the row keys that you access.

  • Check hotspots on specific rows
  • Find rows that contain too much data
  • Look at whether your access patterns are balanced across all of the rows in a table

Automatically generates hourly and daily scans for every table where the table size was over 30 GB or the average of all reads or all writes was at least 10,000 rows per second in the last 24 hours.

Each scan includes a large heatmap, which shows access patterns for a group of row keys over time and the aggregate values.

Quota and limits

  • Hard limits:
    • Individual values: 100MB per cell
    • Single row: 256MB.
    • Tables: 1000 per cluster.
  • Recommended size limits:
    • Row sizes below 100MB.
    • Row keys: below 4KB.
    • Column values below 10 MB.
    • Column families per table below 100.
    • Column qualifiers: less than 16KB per qualifier.

IAM

There are permissions at instance, cluster, table and column family level.

Command Line Interface

cbt createtable <table-name>
cbt createfamily <table-name> <family-name>
cbt deletefamily <table-name> <family-name>
cbt ls <table-name>
cbt setgcpolicy <table-name> <family-name> maxversions=<versions>

Use Cases

  • IoT, finance, adtech
  • Personalization, recommendations
  • Monitoring
  • Geospatial datasets
  • Graphs