^PAPER_TITLE^

^AUTHOR_NAME^

International Journal of Computer Techniques

ISSN 2394-2231

DOI Registered

Volume 8, Issue 1 | Published: February 2021

Table of Contents

Author

Jeevan Krishna Paruchuri

Abstract

Modern data-driven organizations have historically operated two parallel storage systems: a data warehouse for governed analytical workloads and a data lake for raw, semi-structured, and machine-learning-oriented data. This two-system architecture introduces three direct problems: data duplication across stores, brittle extract-transform-load (ETL) pipelines coupling them, and governance discontinuities at the boundary. The cumulative consequence is significant operational overhead. The lakehouse paradigm, an architectural approach that has emerged in academic and industrial literature in the late 2010s and crystallized around 2020, attempts to unify these systems by layering transactional table semantics, schema enforcement, and governance directly onto inexpensive object storage. This paper presents a systematic survey of lakehouse architecture proposals, the open table formats that enable them Delta Lake, Apache Iceberg, and Apache Hudi and early adoption reports from academic and industry literature published through 2020. This survey makes three contributions: (1) a taxonomy of lakehouse design patterns organized along five architectural layers; (2) a structured comparison of the three dominant open table formats (Delta Lake, Apache Iceberg, Apache Hudi) against analytical, ML, and streaming workload requirements; and (3) a set of open research challenges informed by the author’s experience deploying a Delta Lake-based lakehouse in a production banking environment. Practitioner observations show that storage-side optimizations dominate cost in real deployments, that small-file fragmentation is the primary query-performance bottleneck for streaming workloads, and that the table format alone does not eliminate governance debt. The paper concludes by arguing that lakehouse research must shift from benchmark-scale experiments toward production-scale governance, compaction, and cost-modeling problems.

Keywords

lakehouse, data lake, data warehouse, Delta Lake, Apache Iceberg, Apache Hudi, open table format, cloud object storage

Conclusion

The lakehouse paradigm represents a genuine architectural shift in enterprise data management, enabled by open table formats that bring transactional semantics, schema enforcement, and time travel to inexpensive object storage. This survey has organized the paradigm into five architectural layers, compared the three dominant open table formats across seven technical dimensions, and reported practitioner observations from a production banking deployment that document concrete benefits and persistent challenges. The principal finding of the survey is that the lakehouse paradigm solves a specific and important problem the storage and compute unification problem that motivated the two-system architecture but it does not solve adjacent problems that organizations often hope to address through architectural change. Governance debt, operational discipline around compaction and snapshot retention, and cost predictability remain open challenges that the table format alone cannot address. The practitioner observations reported in Section 6 show that storage-side optimizations dominated cost reduction in one production environment, that small file fragmentation was the principal query-performance bottleneck for streaming workloads, and that governance gaps persisted even after the lakehouse migration was complete. These findings suggest that lakehouse research should shift its emphasis from benchmark-scale experiments comparing query latencies on synthetic workloads toward production-scale challenges in adaptive compaction, governance-native table format design, cross-format interoperability, and empirical cost modeling. The lakehouse is no longer a speculative architecture; it is being deployed at scale in regulated industries, and the research questions that matter most are increasingly the ones that can only be observed at production scale. The contributions of this survey the layered taxonomy, the structured comparison, and the practitioner observations are intended to support that shift in research orientation.

References

[1] Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., van Hovell, H., Ionescu, A., Łuszczak, A., Switakowski, M., Szafrański, M., Li, X., Ueshin, T., Mokhtar, M., Boncz, P., Ghodsi, A., Paranjpye, S., Senster, P., Xin, R., and Zaharia, M. (2020). Delta Lake: reliable transactional storage layer for cloud data platforms. Proceedings of the VLDB Endowment, 13(12), 3411–3424. [2] Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. (2015). Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4). [3] Chaudhuri, S., and Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26(1), 65–74. [4] Chaudhuri, S., Dayal, U., and Narasayya, V. (2011). An overview of business intelligence technology. Communications of the ACM, 54(8), 88–98. [5] Dageville, B., Cruanes, T., Zukowski, M., Antonov, V., Avanes, A., Bock, J., Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J., et al. (2016). The Snowflake elastic data warehouse. Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, 215–226. [6] Dean, J., and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. [7] Dixon, J. (2010). Pentaho, Hadoop, and data lakes. James Dixon’s Blog. [8] Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 820–824. [9] Ghemawat, S., Gobioff, H., and Leung, S. T. (2003). The Google file system. Proceedings of the 19th ACM Symposium on Operating Systems Principles, 29–43. [10] Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. (1997). Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1), 29–53. [11] Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., and Srinivasan, V. (2015). Amazon Redshift and the case for simpler data warehouses. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 1917–1923. [12] Hai, R., Geisler, S., and Quix, C. (2016). Constance: An intelligent data lake system. Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, 2097–2100. [13] Halevy, A., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., and Whang, S. E. (2016). Goods: Organizing Google’s datasets. Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, 795–806. [14] Inmon, W. H. (1992). Building the Data Warehouse. QED Technical Publishing Group. [15] Inmon, W. H., and Linstedt, D. (2014). Data Architecture: A Primer for the Data Scientist. Morgan Kaufmann. [16] Khine, P. P., and Wang, Z. S. (2018). Data lake: A new ideology in big data era. ITM Web of Conferences, 17, 03025. [17] Kimball, R., and Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd ed.). Wiley. [18] Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly Media. [19] Kreps, J., Narkhede, N., and Rao, J. (2011). Kafka: A distributed messaging system for log processing. Proceedings of the NetDB Workshop. [20] Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandiver, B., Doshi, L., and Bear, C. (2012). The Vertica analytic database: C-Store 7 years later. Proceedings of the VLDB Endowment, 5(12), 1790–1801. [21] Madden, S. (2012). From databases to big data. IEEE Internet Computing, 16(3), 4–6. [22] Maier, D., Megler, V. M., and Tufte, K. (2014). Challenges for dataset search. Database Systems for Advanced Applications, 1–15. [23] Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. (2010). Dremel: Interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1–2), 330–339. [24] Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. (2019). Data lake management: Challenges and opportunities. Proceedings of the VLDB Endowment, 12(12), 1986–1989. [25] Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. (2008). Pig Latin: A not-so-foreign language for data processing. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 1099–1110. [26] O’Neil, P., Cheng, E., Gawlick, D., and O’Neil, E. (1996). The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4), 351–385. [27] Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. (2009). A comparison of approaches to large-scale data analysis. Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 165–178. [28] Quix, C., Hai, R., and Vatov, I. (2016). Metadata extraction and management in data lakes with GEMMS. CAiSE Forum, 28–35. [29] Russom, P. (2017). Data lakes: Purposes, practices, patterns, and platforms. TDWI Best Practices Report. [30] Sethi, R., Traverso, M., Sundstrom, D., Phillips, D., Xie, W., Sun, Y., Yegitbasi, N., Jin, H., Hwang, E., Shingte, N., and Berner, C. (2019). Presto: SQL on everything. Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE), 1802–1813. [31] Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The Hadoop distributed file system. IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10. [32] Stonebraker, M. (2010). SQL databases v. NoSQL databases. Communications of the ACM, 53(4), 10–11. [33] Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., et al. (2005). C-Store: A column-oriented DBMS. Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), 553–564. [34] Terrizzano, I. G., Schwarz, P. M., Roth, M., and Colino, J. E. (2015). Data wrangling: The challenging journey from the wild to the lake. Conference on Innovative Data Systems Research (CIDR). [35] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629. [36] Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. (2013). Apache Hadoop YARN: Yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing, 1–16. [37] Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, T., and Bao, X. (2017). Amazon Aurora: Design considerations for high throughput cloud-native relational databases. Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data, 1041–1052. [38] Walker, C., and Alrehamy, H. (2015). Personal data lake with data gravity pull. IEEE Fifth International Conference on Big Data and Cloud Computing, 160–167. [39] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). [40] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. [41] Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. (2013). Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the 24th ACM Symposium on Operating Systems Principles, 423–438.

How to Cite This Paper

Jeevan Krishna Paruchuri (2021). Lakehouse Architecture: Unifying Data Lakes and Data Warehouses^. International Journal of Computer Techniques, 8(1). ISSN: 2394-2231.

Lakehouse Architecture Unifying Data Lakes and Data Warehouses Download

ijct-publication-certificate-Jeevan Krishna Paruchuri (2)Download

International Journal of Computer Techniques – IJCT

Submit Paper Now

Breast Cancer Classification and Segmentation Using Machine Learning Classifiers and Convolutional Neural Networks – IJCT Volume 12 – Issue 5 | IJCTV12I5P50

Breast Cancer Classification and Segmentation Using Machine Learning Classifiers and Convolutional Neural Networks – IJCT Volume 12 – Issue 5 | IJCTV12I5P50

Lakehouse Architecture: Unifying Data Lakes and Data Warehouses | IJCT Volume 8 – Issue 1 | IJCT-V8I1P24

Author

Abstract

Keywords

Conclusion

References

How to Cite This Paper

Related Posts:

Reviewer Board

Computer Science Journal Reviewer – BOYE Aziboledia Frederick | IJCT

Raghavendar Akuthota – IJCT Reviewer (IBM Sterling Integrator, EDI)

IJCT Reviewer Board Member | Join Our Esteemed Review Panel

Radhika Ravindranath – IJCT Reviewer

Varinder Garg

Vamsy

Mahaboobsubani Shaik

Madhan Sonachalam

Sairohith Thummarakoti

Submit Paper Now

Author

Abstract

Keywords

Conclusion

References

How to Cite This Paper

Related Posts:

IJCT Latest News

Reviewer Board