Opinionated Orchestration with Airflow on Kubernetes
Categories: Data, Eng
As Grand Rounds grew from employing tens of people to hiring hundreds of employees per year in support of our ever expanding product offerings, so did our tech stacks and orchestration tools to fit those needs. Yes, you read that right, orchestration tools, plural! We had prioritized quick iteration on our applications over deep investment in data or our data platform. This allowed us to build impactful products quickly. In turn, however, we faced the challenge of debt on our data platform: we had multiple technologies when we should’ve had one, and for any problem there are several equally valid solutions. In an effort to build an opinionated platform, we architected a unified data platform which will be described in detail in this post.
As a tech company, data is vital for our culture and product, but it’s not the data itself that’s crucial, it’s the ability to quickly and effectively manipulate the data to gain actionable insights. To work with data across an organization, orchestrators organize simple tasks together to create workflows.
Within the context of a data platform, an orchestrator needs to solve the basic problem of authoring, scheduling, and monitoring workflows. The de facto open source tools to solve these issues are Airflow or Oozie, each of which solves the orchestration problem for engineers on slightly different tech stacks.
At Grand Rounds, data needs range from data processing through Spark to automating simple tasks, the generalization of orchestration available through Airflow was a great choice to explore further. To be successful, we needed to maintain current infrastructure around existing orchestrators and to build the scaffolding required to support each team. In the short term, it is a hard sell to convince an organization to re-write tens of thousands of lines of Java, Ruby, and Python code to fit a new architecture. To avoid rewriting, containers provide a clean abstraction layer to house various languages. Airflow leverages this abstraction through the KubernetesPodOperator. With the Kubernetes(k8s) Operator, we can build a highly opinionated orchestration engine with the flexibility for each team and engineer to have the freedom to develop individualized workflows.
Deeper Dive Into Airflow
Airflow has long had the problem of conflating orchestration with execution, as aptly noted by the Bluecore team. To create a separation of concerns between these two concepts, Airflow at Grand Rounds is restricted to running only one operator, the K8s operator, which is responsible for four key steps:
- Allocate resources
- Spin up a container in the right namespace
- Assign an IAM role to pod
- Monitor status of worker pod
This paradigm allows us to create a boundary between orchestration required by teams and the task each pod performs, resulting in the following architecture:
Dynamically mounting volumes and requesting large compute resources is a great benefit for our data science teams, as many modeling tasks are resource intensive. For this, we leveraged native Kubernetes resource requests as a way to dynamically request resources to allow tasks of various sizes to run in parallel. This reduces the overhead of needing to upgrade EC2 instances as tasks become more demanding. It also reduces the overall data platform spend since we’re no longer standing up a single large instance for running all data science tasks.
Assigning IAM roles to pods has helped the data engineering teams enforce policy of least access per project by granting access to specific resources required by each pod. IAM roles, and access grants are managed through Terraform at Grand Rounds, allowing us to store all infrastructure as code.
This model of allocating resources and access is shown in the illustration below:
To effectively manage this environment of multiple teams with varied testing and development lifecycles, code needs to be separated and maintained by each team. By writing our own GitSync, inspired by mumoshu’s version, we’re able to sync multiple repos and branches during runtime, allowing customized testing and software development life cycles to be maintained by respective teams in their own repos. To maintain dependencies across teams and organizations, Boundary Layer serves as an abstraction layer to attach one team’s DAG to another’s without dependency on external repos. While Boundary layer is still in a POC phase at Grand Rounds, it shows promise in giving us the ability to abstract out how teams implement and design DAGs.
The management of teams and their repositories is shown below:
Unification of Orchestration
While each team can start with their own repos, most teams have existing code bases with their own orchestrators as mentioned earlier in this post. Our goal with the new architecture is to unify all orchestrators at Grand Rounds. To validate this, we ran native Airflow operators and older code written in Ruby inside K8s Airflow. To run native Airflow, a bit of tweaking to Puckle’s Airflow image was required to work in our environment. In a few minutes, we could easily transform existing Airflow scripts to begin running on K8s. Similarly, with our Ruby code, it was easy to copy over existing scripts into a Ruby Docker container and begin orchestration through Airflow.
Sample Workflow at GR
Not only do we manage orchestration within each team’s repo, we also manage terraform there. Below is a simple DAG showing the capabilities of Airflow on K8s by creating a task to extract tables from an RDS instance into our data lake using Sqoop. The main complexity arose from managing access to different AWS services such as: KMS for encryption, cross roll assumption to create read-replicas, and EMR access to spin up an instance with Sqoop installed. All of these tasks were simplified through the use of Terraform. This localization of code allowed us to develop and test inside one team’s space without affecting any other teams’ lifecycle.
The re-architecture of Airflow allows us to consolidate resources from maintenance of the various orchestrators to a singular platform and to reduce overall spend by efficiently scaling our clusters. The increased efficiency in separation of concerns between orchestration, task execution, and code will reduce the overall cognitive load on our data teams, which will enable us to hire engineers across a wider experience band.
Thanks to major investments by the open source community in creating helm charts and containerizing Airflow, this project could be completed with relatively little resource investment at any company. At Grand Rounds this process was completed by myself and Dhruv Jauhar with minimal weekly time investment over a matter of months.
While Airflow has created significant upgrades from our existing infrastructure, we continue to invest in our platform to increase velocity of all data teams at Grand Rounds. Projects which have been top of mind for us are Spark on Kubernetes to enable large scale data processing and Papermill as a platform for our analytic teams to enable deeper insights from our data. As we move forward along this vision, we’re growing our engineering organization and are hiring for full-time and internship roles. If you’re interested in learning more about Airflow at Grand Rounds or our experience building the data platform, please reach out to me on LinkedIn.