Google Cloud BigQuery

From Luis Gallego Hurtado - Not Another IT guy
Jump to: navigation, search


Google Cloud BigQuery is a scalable, fully-managed Enterprise Data Warehouse (EDWH) with SQL and fast response times.

BigQuery is both a storage service and a power analysis tool.

Good for OLAP workloads up to petabyte-scale, Big Data exploration and processing and reporting via Business Intelligence (BI) tools.

Features

  • Serverless, fully-managed.
  • Multicloud capabilities: BigQuery Omni (private alpha) allows you to analyze data across clouds using standard SQL.
  • Natural language processing: Data QnA (private alpha), based on Analyza, makes it easy for anyone to access the data insights they need through NLP.
  • Built-in Machine Learning and Artificial Intelligence integrations: integrations with BigQuery ML, AI Platforma and TensorFlow, through AI Platform and TensorFlow, enable you to train and execute powerful models on structured data in minutes, with just SQL.
  • It accelerates BI workloads with BI Engine, an in-memory analysis service, to achieve sub-second query response time and high concurrency for popular BI tools via standard ODBC/JDBC.
  • Real-time analytics.
  • Automatic high availability, with replicated storage in multiple locations.
  • Standard SQL, providing also ODBC and JDBC drivers.
  • Federated query and logical datawarehousing, integrating with sources in Google Cloud Storage, Google Cloud Bigtable, Google Cloud SQL and spreadsheets in Google Drive.
  • Storage API: it allows to run open source data science workloads (Spark, TensorFlow, Google Cloud Dataflow, Apache Beam, MapReduce, Pandas and scikit-learn) directly in BigQuery, prividing convergence of data wharehouse and data lake.
  • Materialized views: precomputed views that periodically cache the results of a query for increased performance and efficiency.
  • Storage and compute separation.
  • Automatic backup and easy restore: 7-day history of changes.
  • Geospatial data types and functions, with BigQuery GIS.
  • The BigQuery Data Transfer Service automatically transfers data from external data sources.
  • With Dataproc and Dataflow, BigQuery provides integration with Apache big data ecosystem, allowing Hadoop/Spark and Beam workloads to read or write data directly from BigQuery using Storage API.
  • Petabyte scale.
  • Rich monitoring, logging and alerting through Google Cloud Audit Logs.
  • Public datasets.
  • Encryption at rest.

Structure

Dataset

A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

On creating the dataset, you specify:

  • The data location to which it belongs.
  • The default table expiration.
  • The encription key for data.

Dataset Restrictions

  • Dataset names must be unique per project.
  • Data location cannot be changed with Console, only with bq command-line tool, or calling the patch or update API methods.
  • All tables that are referenced in a query must be stored in datasets in the same location.
  • When you copy a table, the datasets that contain the source table and destination table must reside in the same location.

Tables

Tables store data.

Partitions

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.

By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.

You can partition BigQuery tables by:

  • Ingestion time.
  • A date, timestamp or datetime column.
  • An integer column.

Tables support column-based partitions, for low cardinality.

We can set expiration at partition level.

Clustering

In clustered tables, the table data is stored colocated, according to specified clustering columns. The order of columns you specify determines the sort order of the data.

When data is written to a clustered table BigQuery sorts the data using the values in the clustering columns.

Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data.

Performance improvement: when you submit a query that contains a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data. Performance is relevant if the table or partition is over 1GB.

Pricing

Pricing is based in analysis pricing (compute, like queries) and storage pricing. Additionally, there is also data ingestion pricing and data extraction pricing (straming inserts and using BigQuery Storage API). There are 2 pricing models: on-demand pricing (storage and compute that you use) and flat-rate pricing with Reservations, for pricing predictability.

Best Practices

  • Do not uses limit: it does not reduces the cost.
  • Use partitions to reduce cost.

Quota Limits

There are quota limits for query jobs, load jobs, export jobs, dataset limits, table limits, UDF limits, data manipulation's statements, BigQueryML limits, streaming inserts and API requests.

Use Cases

  • Enterprise Data Warehouse.
  • Analytical reporting on large data.
  • Syn for all your BI reports.
  • Data Science and advanced analyses
  • Big Data processing using SQL

Command Line Interface

  • Create
bq --location=<location> mk -d --default table expiration <expiracy> --description <dataset-description> <dataset-name>
  • List
bq ls --format=prettyjson
  • Show
bq show --format=prettyjson <project-id>:<dataset-name>
  • Query
bq query --nouse_legacy_sql '<select>'
  • Update
bq update --default table expiration 7200 <dataset-name>
  • Remove
bq rm -r -f -d <project-id>:<dataset-name>