Categories: Uncategorized

by Sally Bo Hatter

Share

by Sally Bo Hatter

These days it is hard to imagine a company which does not rely on data and the role of data engineering has grown significantly in recent years. At the same time, companies try to benefit from using cloud computing to reduce costs and risks associated with maintenance of on-premises platforms, achieve greater flexibility, and get access to a variety of tools that cloud can provide. Here, let us see what services and tools Google Cloud Platform (GCP) can offer for data engineering.

BigQuery

BigQuery is one of the best services on GCP for analysis of large volumes of data. BigQuery is an OLAP database which makes it a perfect choice for a cloud data warehouse. It works seamlessly out of the box and does not require complex configuration or special knowledge about databases to use it efficiently. The ecosystem of BigQuery supports ad-hoc queries, a BI tool on top of it, AI and/or ML applications, load and export data, and the organization of fine-grained access control to data.  BigQuery is well-integrated with other services, for example logging data from other services can be loaded to BigQuery for further analysis. The one aspect you will need to keep an eye on is cost. The pricing includes two parts: computing (per query, or for allocated capacity) and storage. Depending on your data strategy and use-cases it may become costly. The other issue may be latency.  So, applications implemented on top of a BigQuery data warehouse should take this into consideration. More information about BigQuery may be found here.

Dataflow

Dataflow is a fully managed stream and batch processing engine. Dataflow allows processing streams of large volumes of data with low latency and high consistency. Imagine you have a constant flow of events from numerous devices that you need to aggregate and load in a database for further analysis, or a log of keep-alive events from a user device which you want to convert in user sessions. Those are examples of workload that would perfectly fit Dataflow. Dataflow uses Apache Beam as a programming interface. Once written, pipeline can be run on wide range of runners, including but not limited by Apache Spark, or Apache Flink, and Dataflow itself [1]. Google Cloud also includes a large list of templates for Dataflow pipelines with various sources and sinks of data. Dataflow can be used in ETL to move and transform data from one service to another, serve data to a ML model, as part of an application requiring the processing of a stream of events (log events, clicks, etc.), as part of a lambda architecture. One of the benefits of Dataflow is that it requires little configuration and maintenance compared to alternatives like Dataproc, but at the same time Dataflow may be more expensive for the same workload. More details: https://cloud.google.com/dataflow.

[1] Runners differ in capabilities and may influence portability of a specific pipeline. The information about capabilities is available in the official Apache Beam documentation.

Dataproc

Dataproc supports a Hadoop cluster on Google Cloud. Besides Hadoop it includes Apache Spark, Flink, Zookeeper, Solr, and other services that make up the Hadoop ecosystem. It can run either on a classic compute engine or on the Kubernetes engine (GKE). The serverless part of Dataproc supports Apache Spark jobs without creating a cluster. Dataproc is a good tool for ad-hoc analytics on unstructured data, or for hosting data lake or lakehouse workloads. It is a service of choice for migrating existing on-premises data lake to cloud. As mentioned before, it can serve as an alternative to Dataflow. It costs less but requires more expertise and more maintenance compared to Dataflow. Dataproc also has a slightly lower SLA compared to Dataflow. Which one to choose depends on the concrete problem at hand. More information about Dataproc may be found here.

Cloud Storage

Cloud Storage is an object store similar to Amazon S3 and Azure Blob Storage. Cloud Storage is highly scalable, highly available storage that can serve as a platform for storing large amounts of unstructured or semi-structured data. It can serve as foundation for a modern data lake, a landing zone for other services, or as an archive/backup storage. It is well-integrated with many Google Cloud services: BigQuery, Dataflow, Dataproc can read and write to Cloud Storage. For example, a Spark or Hadoop job using Cloud Storage connector can access Cloud Storage directly. Cloud Storage has different storage classes and SLAs available depending on the use case. More information.

Cloud Composer

When loading data into a data warehouse or transforming data in a data lake, we need to consider dependencies between steps and define a schedule, i.e. create and manage workflows. Apache Airflow is one of the most popular tools for workflow orchestration. It supports organizing tasks into a directed acyclic graph of fixed size (or an infinitely growing graph in case of a cycle in workflow) and associate a schedule with it. Cloud Composer is a fully managed Apache Airflow hosted on Google Cloud. It offers integration with other services and allows for flexible cost management. Still there may be a downside for small companies when it comes to the cost/utilization ratio since Cloud Composer does not scale to zero, and if workloads are scheduled to run nightly, the instance of Cloud Composer will be unused during the day. It may appear beneficial to have a self-managed and simpler scheduler running on a VM in that case. More details are here.

Conclusion

Google Cloud provides plenty of opportunities for data engineering and hosting data-driven applications in the form of well-integrated and fully managed on-demand services, along with many partner- and 3rd-party services that can run on Google Cloud. Other fully integrated services that were not covered, but worth mentioning include:

  • Looker, a data visualization/BI tool.
  • Cloud Data Fusion, a fully managed visual data integration tool.
  • Pub/Sub, a distributed serverless message queue.
  • Datastream for BigQuery, a provider of seamless replication from relational databases directly to BigQuery.

Choosing the right tool involves the consideration of many nuances and can be a challenging task. One thing you can be sure about is that there is no single solution that fits all use-cases. A tailored solution should be aligned with use-cases, stakeholder requirements, and your company’s data strategy.

Share