Survey of Cloud-Native Workflow Orchestration with Apache Airflow | IJCT Volume 9 – Issue 3 | IJCT-V9I3P7

International Journal of Computer Techniques
ISSN 2394-2231
Volume 9, Issue 3  |  Published: May – 2022

Author

Jeevan Krishna Paruchuri

Abstract

Workflow orchestration is foundational infrastructure for modern data platforms, and Apache Airflow has become the dominant open-source orchestrator in cloud-native environments. This survey examines deployment patterns for Airflow on Kubernetes, contrasts them with alternative orchestration systems including Apache Oozie, Luigi, Prefect, and Dagster, and reports operational findings from a six-month production deployment of thirty-five core campaign workflows on Google Cloud Platform. The survey is grounded in a migration from a centralized Google Cloud Data Fusion environment of approximately 178 campaign attribution workflows that had become difficult to maintain. The principal practitioner finding is that the introduction of GitSync-based deployment reduced the deploy cycle from three to five days under manual procedures to under five minutes from merge to running DAG, but that the productivity gain is offset by operational complexity that surfaces as a small set of recurring failure modes. The survey contributes a taxonomy of four deployment patterns, a catalog of five production failure modes with root causes and mitigations, and a set of CI/CD recommendations for managing DAG-as-code at scale.

Keywords

Apache Airflow, workflow orchestration, cloud-native, DAG, data pipeline, task scheduling, directed acyclic graph

Conclusion

This survey has examined cloud-native workflow orchestration with Apache Airflow, organized around four deployment patterns, CI/CD and GitOps integration practices, a comparative analysis across ten operational criteria, and five production failure modes drawn from a six-month deployment of thirty-five core campaign workflows on Google Cloud Platform. The deployment migrated workloads from a previous environment of approximately 178 Google Cloud Data Fusion campaign attribution workflows that had become difficult to maintain, and the introduction of GitSync-based deployment reduced the deployment cycle from three to five days under manual procedures to under five minutes from merge to running DAG. The principal findings are that the KubernetesExecutor pattern combined with GitSync offers the most attractive operational profile for most medium-scale Airflow deployments, that static DAG validation in CI prevents the largest single class of production failure, and that approximately sixty percent of failures observed in the empirical deployment were related to DAG code or configuration rather than to infrastructure. These findings together suggest that the principal investment for teams adopting Airflow on Kubernetes is not in the cluster operation itself, where Kubernetes-native tooling is mature, but in the discipline of DAG-as-code engineering: validation in CI, observability of scheduler and executor health, vendoring or versioning of shared code, and the operational habits that catch problems before they reach production. Workflow orchestration is a load-bearing component of modern data platforms, and the operational discipline required to run it well repays the investment many times over.

References

[1] Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., and Whittle, S. (2015). The Dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12), 1792–1803. [2] Apache Software Foundation. (2019). Apache Airflow documentation, version 1.10. https://airflow.apache.org/docs/ [3] Apache Software Foundation. (2020). Apache Airflow documentation, version 2.0. https://airflow.apache.org/docs/ [4] Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., et al. (2020). Delta Lake: managed transactional table layer for object storage. Proceedings of the VLDB Endowment, 13(12), 3411–3424. [5] Bass, L., Weber, I., and Zhu, L. (2015). DevOps: A Software Architect’s Perspective. Addison-Wesley. [6] Beauchemin, M. (2015). The rise of the data engineer. Apache Airflow project notes / Medium. [7] Beauchemin, M. (2017). Airflow: A workflow management platform. Airbnb Engineering Blog. [8] Bernhardsson, E., and Freider, E. (2014). Luigi: Workflow management for batch jobs. Spotify Engineering. [9] Bernstein, P. A., Hadzilacos, V., and Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley. [10] Beyer, B., Jones, C., Petoff, J., and Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. [11] Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., and Thorne, S. (2018). The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media. [12] Burns, B., Beda, J., and Hightower, K. (2019). Kubernetes: Up and Running (2nd ed.). O’Reilly Media. [13] Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and Wilkes, J. (2016). Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade. ACM Queue, 14(1), 70–93. [14] Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. (2015). Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4). [15] Chambers, B., and Zaharia, M. (2018). Spark: The Definitive Guide. O’Reilly Media. [16] Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209. [17] Davoudian, A., and Liu, M. (2020). Big data systems: A software engineering perspective. ACM Computing Surveys, 53(5), 1–39. [18] Dean, J., and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. [19] Forsgren, N., Humble, J., and Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press. [20] Fowler, M. (2014). Microservices: A definition of this new architectural term. martinfowler.com. [21] Fowler, M., and Lewis, J. (2014). Microservices. martinfowler.com. [22] Helm Project. (2019). Helm: The package manager for Kubernetes. https://helm.sh/ [23] Hightower, K., Burns, B., and Beda, J. (2017). Kubernetes Up and Running: Dive into the Future of Infrastructure. O’Reilly Media. [24] Humble, J., and Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley. [25] Ibryam, B., and Huss, R. (2019). Kubernetes Patterns: Reusable Elements for Designing Cloud-Native Applications. O’Reilly Media. [26] Islam, M., Huang, A. K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Neumann, A., and Abdelnur, A. (2012). Oozie: Towards a scalable workflow management system for Hadoop. Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, 1–10. [27] Karau, H., and Warren, R. (2017). High Performance Spark. O’Reilly Media. [28] Kim, G., Humble, J., Debois, P., and Willis, J. (2016). The DevOps Handbook. IT Revolution Press. [29] Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly Media. [30] Kreps, J., Narkhede, N., and Rao, J. (2011). Kafka: A distributed messaging system for log processing. Proceedings of the NetDB Workshop. [31] Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), 558–565. [32] Limoncelli, T. A. (2018). GitOps: A path to more self-service IT. ACM Queue, 16(3), 13–26. [33] Limoncelli, T. A., Hogan, C. J., and Chalup, S. R. (2014). The Practice of System and Network Administration (3rd ed.). Addison-Wesley. [34] Marz, N., and Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Realtime Data Systems. Manning. [35] Microsoft. (2020). Azure Kubernetes Service documentation. https://docs.microsoft.com/azure/aks/ [36] Microsoft. (2020). Azure Key Vault and managed identities documentation. https://docs.microsoft.com/azure/key-vault/ [37] Morris, K. (2016). Infrastructure as Code: Managing Servers in the Cloud. O’Reilly Media. [38] Nadareishvili, I., Mitra, R., McLarty, M., and Amundsen, M. (2016). Microservice Architecture: Aligning Principles, Practices, and Culture. O’Reilly Media. [39] Newman, S. (2015). Building Microservices: Designing Fine-Grained Systems. O’Reilly Media. [40] Pahl, C., Brogi, A., Soldani, J., and Jamshidi, P. (2019). Cloud container technologies: A state-of-the-art review. IEEE Transactions on Cloud Computing, 7(3), 677–692. [41] Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, 47(2), 17–28. [42] Russom, P. (2017). Data lakes: Purposes, practices, patterns, and platforms. TDWI Best Practices Report. [43] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J. F., and Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28, 2503–2511. [44] Shahrad, M., Fonseca, R., Goiri, I., Chaudhry, G., Batum, P., Cooke, J., Laureano, E., Tresness, C., Russinovich, M., and Bianchini, R. (2020). Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. USENIX Annual Technical Conference, 205–218. [45] Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The Hadoop distributed file system. IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10. [46] Singh, K., Behzad, B., and Ross, R. (2020). KEDA: Kubernetes-based event-driven autoscaling. Cloud Native Computing Foundation project documentation. [47] Sridharan, C. (2018). Distributed Systems Observability. O’Reilly Media. [48] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629. [49] Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., Bhagat, N., Mittal, S., and Ryaboy, D. (2014). Storm @Twitter. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 147–156. [50] Turnbull, J. (2014). The Docker Book: Containerization Is the New Virtualization. James Turnbull. [51] Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. (2013). Apache Hadoop YARN: Yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing, 1–16. [52] Verma, A., Pedrosa, L., Korupolu, M. R., Oppenheimer, D., Tune, E., and Wilkes, J. (2015). Large-scale cluster management at Google with Borg. Proceedings of the 10th European Conference on Computer Systems, 1–17. [53] Walls, C. (2016). Spring Boot in Action. Manning Publications. [54] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. [55] Armbrust, M., et al. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. Proceedings of CIDR 2021.

How to Cite This Paper

Jeevan Krishna Paruchuri (2022). Survey of Cloud-Native Workflow Orchestration with Apache Airflow. International Journal of Computer Techniques, 9(3). ISSN: 2394-2231.

© 2022 International Journal of Computer Techniques (IJCT). All rights reserved.

Submit Your Paper