
Trino as a Unified Query Layer for Heterogeneous Data Sources: Survey and Benchmarks | IJCT Volume 9 – Issue 3 | IJCT-V9I3P8

International Journal of Computer Techniques
ISSN 2394-2231
Volume 9, Issue 3 | Published: May – 2022
Table of Contents
ToggleAuthor
Kuladeep Sandra
Abstract
Modern enterprises store data across heterogeneous systems relational databases, data lakes, cloud warehouses, message brokers and the cost of moving that data into a single physical store is rarely justifiable. Federated query engines that read each source in place have therefore become a foundational layer of enterprise data platforms. This paper surveys the federated query landscape and presents production operating experience with Trino across 14 heterogeneous data sources in a banking and insurance environment. The paper addresses three questions: how Trino’s architecture enables federation across diverse backends; what the operational realities are when running Trino in production at enterprise scale; and how practitioners should think about its strengths and limits. Two non-obvious findings shape the operational picture. First, Trino’s container cold-start time on Kubernetes is nearly 30 seconds, which makes it unsuitable for sub-second interactive dashboard SLAs without architectural workarounds. Second, network topology dominates query performance for data-intensive queries: relocating Trino worker pods onto racks physically closer to the source data improved data-intensive query latency by close to 40 percent. The thesis is that Trino is a powerful federation layer for ad-hoc analytics, self-service discovery, and cross-source joins, but it is not a magical query everything equally fast engine; it requires understanding per-connector trade-offs, careful memory and topology tuning, and clear architectural decisions about which use cases it serves and which it does not.
Keywords
Trino, federated queries, query federation, multi-source, SQL analytics, query engine
Conclusion
Returning to the three questions:
RQ1. Trino’s connector pattern is the architectural feature that enables federation. The coordinator-worker execution model is conventional; the connectors are what allow the same SQL to run against 14 different backends. Predicate pushdown is the operational lever that determines whether the federation actually performs, and pushdown quality varies by connector.
RQ2. The operational realities are mixed. Cold-start time is ~30 seconds and rules out auto-scale-to-zero for interactive workloads. Network topology matters: physical rack proximity to the source data improved our data-intensive queries by ~40 percent. Memory tuning is non-default and OOM failures are common until the configuration is right. Connector reliability varies and treating all sources as equivalent is a mistake.
RQ3. Trino is a federation layer for ad-hoc analytics, self-service discovery, and cross-source joins. It is not a replacement for the underlying systems and it is not a real-time query engine. The teams that get the most from Trino are those that understand its trade-offs and tune for them, not those that expect it to make federation effortless.
The closing observation: federated query systems are powerful enough that the temptation is to treat them as universal solutions, and that temptation is wrong. Trino is excellent at what it is good at and unsuitable for what it is not. The discipline of knowing the difference is what separates a successful Trino deployment from a frustrated one.
References
[1] R. Sethi, M. Traverso, D. Sundstrom, et al., “Presto: SQL on everything,” in Proc. IEEE Int. Conf. Data Eng. (ICDE), 2019.
[2] Trino Software Foundation, “Trino documentation.” [Online]. Available: trino.io
[3] Apache Software Foundation, “Apache Iceberg documentation.” [Online]. Available: iceberg.apache.org
[4] Apache Software Foundation, “Apache Hive documentation.” [Online]. Available: hive.apache.org
[5] Apache Software Foundation, “Apache Parquet documentation.” [Online]. Available: parquet.apache.org
[6] Apache Software Foundation, “Apache Kafka documentation.” [Online]. Available: kafka.apache.org
[7] M. Armbrust, A. Ghodsi, R. Xin, and M. Zaharia, “Lakehouse: A new generation of open platforms,” in Proc. Conf. Innovative Data Syst. Res. (CIDR), 2021.
[8] A. Thusoo, J. S. Sarma, N. Jain, et al., “Hive: A warehousing solution over a map-reduce framework,” Proc. VLDB Endowment, 2009.
[9] M. Kornacker, A. Behm, V. Bittorf, et al., “Impala: A modern, open-source SQL engine for Hadoop,” in Proc. Conf. Innovative Data Syst. Res. (CIDR), 2015.
[10] M. Stonebraker and J. M. Hellerstein, “What goes around comes around,” in Readings in Database Systems, 4th ed. 2005.
[11] Kubernetes Authors, “Kubernetes documentation.” [Online]. Available: kubernetes.io
[12] Microsoft, “Azure Data Lake Storage Gen2 documentation.” [Online]. Available: learn.microsoft.com
[13] Amazon Web Services, “Amazon S3 documentation.” [Online]. Available: aws.amazon.com/s3/
[14] M. Kleppmann, Designing Data-Intensive Applications. Sebastopol, CA: O’Reilly Media, 2017.
[15] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Eds., Site Reliability Engineering. Sebastopol, CA: O’Reilly Media, 2016.
How to Cite This Paper
Kuladeep Sandra (2022). Trino as a Unified Query Layer for Heterogeneous Data Sources: Survey and Benchmarks. International Journal of Computer Techniques, 9(3). ISSN: 2394-2231.







