Dataform on GCP Overview

By Venkatesh Thamotharan | @intelia | August 3

Introduction

Dataform is a serverless service that helps data engineers and data analysts to develop and operationalize scalable data transformation pipelines in BigQuery (BQ) which, is Google Cloud’s cost-effective enterprise data warehouse. BQ is fully managed, serverless data analytics platform which has built-in features like machine learning (ML), geospatial analysis, and business intelligence (BI), and ad hoc analysis. Dataform allows you to develop and build complex SQL workflows for data transformation and allows you to schedule these workflows and version control with Git. Dataform presents a single unified UI and API for SQL workflow development and execution. It also provides development and deployment of tables, incremental tables, or views to BigQuery. Dataform lets you manages the T in ELT (Extract, Load, Transform) processes. For more information about ELT how approaches to data extraction and transformation are changing, please refer to THIS ARTICLE. After raw source data is extracted and loaded into BigQuery, Dataform helps you to transform it into a well-defined, tested, and documented suite of data tables, which will be data integration ready.

 

 

 

Uses of Dataform

The key purpose of Dataform is to allow data engineers and data analyst to build SQL pipelines in BigQuery whilst following the best practices in software engineering. Listed below are some of the uses of Dataform on GCP.

  • Develop and operationalise scalable complex data pipelines in BigQuery using SQL from a single environment and without additional dependencies.
  • Define tables, visualise dependencies, commit the changes to Git, and schedule pipelines, from a single interface, without leaving your web browser.
  • Develop curated, up-to-date, trusted, and documented tables in BigQuery.
  • Enable data analysts and data engineers to collaborate on the same central repository which allows them to manage their SQL code.
  • Dataform handles the operational infrastructure, thus fully managed and serverless orchestration, to update your tables following the dependencies between your tables and using the latest version of your code.
  • Integrate with GitHub and GitLab. Dataform allows you to connect your repository with third-party providers such as GitHub and GitLab and commit changes and push or open pull requests from the IDE.

Comparison to alternatives for Dataform

There are a few alternatives to Dataform where you can perform data transformation, which are listed below.

  • BigQuery Stored Procedures: Stored procedures written in BigQuery can be used to perform data transformation while reading from a table in BigQuery and writing it another table. Stored procedures are hard to maintain and hard to orchestrate specifically moving from one environment to another. And, they are not centrally managed, thus only one person can develop a particular stored procedure at a time.
  • dbt (Data Build Tool): dbt is open-source, and it allows anyone to build and manage data transformation workflows in a modular and reusable approach. dbt’s infrastructure need to be manged and we need to dbt on every environment, such as development, staging or production. Furthermore, we need to integrate dbt cloud with GCP.
  • Fivetran: Fivetran is a low-code ETL tool that automates the ETL process and make ETL more accessible and efficient. includes dozens of pre-built connectors for popular data sources such as Microsoft Azure, SalesForce, and Facebook. Fivetran can perform simple transformation as it is heavily focused on bring the data into the platform/cloud.
  • Matillion: Matillion ETL is an ETL/ELT tool built specifically for cloud database platforms such as Snowflake, Delta Lake on Databricks, Amazon Redshift, Google BigQuery, and Azure Synapse Analytics. It is a browser-based UI with powerful push-down ETL/ELT functionality. Matillion has a hard limit on the hardware / EC2 instances it can use, thus does not scale up well. It is difficult to debug and has no restart-from-point-of-failure option for failed loads.

In conclusion, if you are already on GCP, then Dataform is better as it is owned by Google and is well integrated with BigQuery. We will not be able to use BigQuery Stored Procedures as we need to modify the code for all new instances and deployments. If you are using any other cloud data warehouse such as Snowflake and Redshift, dbt is more suited given its flexibility plus you can set up your own server or use their hosted services.

Dataform: Deployment and Maintenance

Deployment of assets to BigQuery via SQL workflows is fully managed and serverless. You can manually trigger an execution of a workflow or schedule executions by using Cloud Composer or Workflows with Cloud Scheduler. Furthermore, by utilising release configurations and workflow configurations you can execute workflows in various environments, such as development, staging or production. Maintenance of Dataform code is made easy. Dataform uses the concept of a centralised repository which stores all the SQLX and JavaScript files that make up a SQL workflow and stores configuration files and packages. Each Dataform repository can link with a GitHub or GitLab repository. Using Git Dataform manages version control and track changes to the code in the repository.

Is Dataform costly to use?

Dataform is a free service on GCP. Yet, there will be other associated costs using Dataform, such as the cost of running queries in BigQuery since Dataform runs queries in BigQuery. Furthermore, Dataform uses other Google Cloud services like Cloud Workflows, Cloud Scheduler and Cloud Logging. They also will bear a price after they reach the free threshold.

Why should we use Dataform?

Dataform is tool built for data professionals who interact with a cloud data warehouse. It helps data engineers and data analyst to adopt best practices and develop workflows to manage tables in their data warehouse. It enables to manage data significantly faster and deliver data that is trusted and understood by the entire organisation. Dataform is specifically designed to work with BigQuery, which means that it can take advantage of some of the unique features and capabilities of that platform. If you’re using BigQuery and have a requirement to transform data (who wouldn’t!), Dataform is the ideal choice and it’s just become Generally Available which means it’s fully supported for production workloads.