Database Subspace Clustering for High-Dimensional Data Analytics Using Query-Aware Optimization | IJCT Volume 13 – Issue 3 | IJCT-V13I3P126

International Journal of Computer Techniques
ISSN 2394-2231
Volume 13, Issue 3  |  Published: May – June 2026

Author

Rahmat Widia Sembiring, Abidin Luthfi Sembiring, Alfan Ramadhan Sembiring

Abstract

This paper introduces QAO-IDSC (Query-Aware Optimization for In-Database Subspace Clustering). This novel framework embeds subspace clustering logic directly within a relational database engine through user-defined functions and cost-guided query planning. The framework incorporates an entropy-driven dimension relevance filter, a shared-subspace execution plan, and a parallel UDF executor that coordinates with the database buffer manager. Comprehensive experiments on synthetic and real-world benchmark datasets with up to two million records and two thousand dimensions demonstrate that QAO-IDSC reduces execution time by 63.2% relative to traditional out-of-database methods and achieves a Normalized Mutual Information (NMI) score of 0.83, outperforming all evaluated baselines. The proposed system integrates transparently with SQL workflows, enabling data analysts to invoke subspace clustering through standard query syntax without exporting data.

Keywords

subspace clustering; in-database analytics; query optimization; high-dimensional data; user-defined functions; OLAP; parallel query execution

Conclusion

This paper presented QAO-IDSC, a framework for performing subspace clustering directly within a relational database engine through query-aware optimization. By combining an entropy-based dimension relevance filter, a shared-subspace execution planner, a PROCLUS-derived UDF engine, and a parallel buffer-integrated executor, QAO-[27]IDSC achieves a 63.2% reduction in execution time relative to traditional out-of-database subspace clustering methods while simultaneously improving clustering quality (NMI = 0.83 vs. 0.71 for the best baseline). The system integrates with the SQL language interface, requiring no data export and enabling direct composition with conventional analytical queries. The experimental results establish that the in-database paradigm is not merely a practical convenience but a source of genuine algorithmic advantage: proximity to the data enables more efficient memory management, query-planner guidance reduces redundant computation, and shared buffer access eliminates the overhead of external data movement. As datasets in domains such as genomics, finance, and industrial sensing continue to grow in both size and dimensionality[28], frameworks that embed analytical computation within the database engine will become increasingly essential. Future research directions include: (i) extending QAO-IDSC to columnar and distributed database engines such as Apache Arrow-based systems and CockroachDB; (ii) developing an adaptive entropy threshold that adjusts to the observed data distribution during execution; (iii) investigating learned cost models for subspace evaluation ordering; and (iv) designing a federated variant that coordinates subspace clustering across multiple database nodes without centralizing the data.

References

[1] M. Steinbach, L. Ertöz, and V. Kumar, “The Challenges of Clustering High Dimensional Data,” 2004, pp. 273–309. doi: 10.1007/978-3-662-08968-2_16. [2] S. Moens, B. Čule, and B. Goethals, “RASCL: a randomised approach to subspace clusters,” International Journal of Data Science and Analytics , vol. 14, no. 3, pp. 243–259, May 2022, doi: 10.1007/s41060-022-00327-y. [3] M. Moreno, R. Vilaça, and P. G. Ferreira, “Scalable transcriptomics analysis with Dask: applications in data science and machine learning,” BMC Bioinformatics , vol. 23, no. 1, pp. 514–514, Nov. 2022, doi: 10.1186/s12859-022-05065-3. [4] M. Paganelli, P. Sottovia, K. Park, M. Interlandi, and F. Guerra, “Pushing ML Predictions Into DBMSs,” IEEE Transactions on Knowledge and Data Engineering , vol. 35, no. 10, pp. 10295–10308, Apr. 2023, doi: 10.1109/tkde.2023.3269592. [5] C. A. A. Pastrana and C. F. O. Andrade, “Aislamiento social obligatorio: un análisis de sentimientos mediante machine learning,” Suma de Negocios , vol. 12, no. 26, pp. 1–1, Jan. 2021, doi: 10.14349/sumneg/2021.v12.n26.a1. [6] Y. Zhang et al. , “Learning Self-Growth Maps for Fast and Accurate Imbalanced Streaming Data Clustering,” IEEE Transactions on Neural Networks and Learning Systems , vol. 36, no. 9, pp. 16049–16061, May 2025, doi: 10.1109/tnnls.2025.3563769. [7] L. Zhou et al. , “CACTUSDB: Unlock Co-Optimization Opportunities for SQL and AI/ML Inferences,” Feb. 26, 2026, Cornell University . doi: 10.48550/arxiv.2602.23469. [8] R. Mondal, E. Ignatova, D. Walke, D. Broneske, G. Saake, and R. Heyer, “Clustering graph data: the roadmap to spectral techniques,” Discover Artificial Intelligence , vol. 4, no. 1, Jan. 2024, doi: 10.1007/s44163-024-00102-x. [9] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data,” Data Mining and Knowledge Discovery , vol. 11, no. 1, pp. 5–33, Jul. 2005, doi: 10.1007/s10618-005-1396-1. [10] Y. Zhu, K. M. Ting, and M. Carman, “Grouping points by shared subspaces for effective subspace clustering,” Pattern Recognition , vol. 83, pp. 230–244, May 2018, doi: 10.1016/j.patcog.2018.05.027. [11] Computational Methods of Feature Selection . 2007. doi: 10.1201/9781584888796. [12] H.-P. Kriegel, P. Kröger, M. Renz, and S. Wurst, “A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data,” pp. 250–257, Jan. 2006, doi: 10.1109/icdm.2005.5. [13] K. Kailing, H. Kriegel, and P. Kröger, “Density-Connected Subspace Clustering for High-Dimensional Data,” pp. 246–256, Apr. 2004, doi: 10.1137/1.9781611972740.23. [14] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. M. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” ACM SIGMOD Record , vol. 28, no. 2, pp. 61–72, Jun. 1999, doi: 10.1145/304181.304188. [15] X. Peng, L. Cao, C. Zhang, and Z. H. Zhou, “STATPC: A statistical approach to subspace clustering with FDR-controlled false discovery,” IEEE Transactions on Knowledge and Data Engineering , vol. 34, no. 5, pp. 2218–2231, 2022, doi: 10.1109/TKDE.2020.2992028. [16] X. Feng, A. Kumar, B. Recht, and C. Ré, “Towards a unified architecture for in-RDBMS analytics,” pp. 325–336, May 2012, doi: 10.1145/2213836.2213874. [17] M. Karan, A. Sablayrolles, and J. Ragan-Kelley, “ONNX-DB: Executing neural network inference inside relational databases,” in Proceedings of the VLDB Endowment , 2023, pp. 1923–1936. doi: 10.14778/3594512.3594523. [18] S. Jahirabadkar and P. Kulkarni, “SCAF – An Effective Approach to Classify Subspace Clustering Algorithms,” Mar. 31, 2013. doi: 10.5121/ijdkp.2013.3205. [19] S. Zinchenko and D. Ponomaryov, “The Selection Problem in Multi-Query Optimization: a Comprehensive Survey,” Dec. 16, 2024, Cornell University . doi: 10.48550/arxiv.2412.11828. [20] T. Siddiqui and W. Wu, “ML-Powered Index Tuning: An Overview of Recent Progress and Open Challenges,” ACM SIGMOD Record , vol. 52, no. 4, pp. 19–30, Jan. 2023, doi: 10.1145/3641832.3641836. [21] T. Siddiqui and W. Wu, “ML-Powered Index Tuning: An Overview of Recent Progress and Open Challenges,” Aug. 25, 2023, Cornell University . doi: 10.48550/arxiv.2308.13641. [22] X. Zhang et al. , “A Unified and Efficient Coordinating Framework for Autonomous DBMS Tuning,” Proceedings of the ACM on Management of Data , vol. 1, no. 2, pp. 1–26, Jun. 2023, doi: 10.1145/3589331. [23] L. Zhang and M. A. Babar, “Automatic Configuration Tuning on Cloud Database: A Survey,” Apr. 09, 2024, Cornell University . doi: 10.48550/arxiv.2404.06043. [24] U. J. Nzenwata et al. , “Autonomous Database Systems – A Systematic Review of Self-Healing and Self-Tuning Database Systems,” Asian Journal of Research in Computer Science , vol. 18, no. 7, pp. 77–87, Jul. 2025, doi: 10.9734/ajrcos/2025/v18i7721. [25] F. Antonazzo, “Unsupervised learning of huge data sets with limited computed resources,” Sep. 30, 2022. Accessed: Apr. 2026. [Online]. Available: http://www.theses.fr/2022ULILB015/document [26] K. Rahul, R. K. Banyal, and N. K. Arora, “A systematic review on big data applications and scope for industrial processing and healthcare sectors,” Journal Of Big Data , vol. 10, no. 1, Aug. 2023, doi: 10.1186/s40537-023-00808-2. [27] S. Thudumu, P. Branch, J. Jin, and J. J. Singh, “A comprehensive survey of anomaly detection techniques for high dimensional big data,” Journal Of Big Data , vol. 7, no. 1, Jul. 2020, doi: 10.1186/s40537-020-00320-x.

How to Cite This Paper

Rahmat Widia Sembiring, Abidin Luthfi Sembiring, Alfan Ramadhan Sembiring (2026). Database Subspace Clustering for High-Dimensional Data Analytics Using Query-Aware Optimization. International Journal of Computer Techniques, 13(3). ISSN: 2394-2231.

© 2026 International Journal of Computer Techniques (IJCT). All rights reserved.

Submit Your Paper