
Distributed Biomedical Text Summarization and Knowledge Discovery Using Generative Language Models and Apache Spark | IJCT Volume 13 – Issue 3 | IJCT-V13I3P19

International Journal of Computer Techniques
ISSN 2394-2231
Volume 13, Issue 2 | Published: March – April 2026
Table of Contents
ToggleAuthor
Harshith Sai, Pranav Aditya, Bhanu Prakash, Tanvir H Sardar, Vema Sai Krishna
Abstract
Biomedical literature is rapidly increasing exponentially, and it constitutes a severe challenge that healthcare researchers, clinicians and data scientists must manage by extracting meaningful insights within large amounts of unstructured text. In this paper, a scalable, end-to-end Artificial Intelligence (AI) pipeline that can be used to distribute the workload of biomedical text summarization and automated knowledge discovery is presented, using Apache Spark as a distributed data ingestion and processing engine, domain-adapted transformer-based Large Language Models (LLMs) as abstractive summarization engines, and Latent Dirichlet Allocation (LDA) as an unsupervised topic modeling engine. It was tested on the corpus of more than 5,000 biomedical abstracts of PubMed-Medline. The pipeline itself took 9-12 minutes to run on Apache Spark Pandas UDFs and Falconsai/medical_summarization based on the ability to run an entire workflow 6x faster than the sequential execution. The evaluation by ROUGE showed that there was great semantic retention with ROUGE-1: 0.58, ROUGE-2: 0.41 and ROUGE-L: 0.52 scores. The LDA topic modeling identified five coherent and clinically significant medical research themes, which showed that the system can scale automatically extract knowledge. It is also designed to use SQLite, FAISS vector indexing as the persistence layer, Retrieval-Augmented Generation (RAG) serving server driven by a Flask REST API and a React-based chat interface. Findings affirm that the suggested system is a sound, scalable, and generalizable model of accelerating biomedical research, clinical decision support, and healthcare analytics.
Keywords
Biomedical Text Mining, Apache Spark, Large Language Models, Abstractive Summarization, LDA Topic Modeling, Knowledge Discovery, Healthcare NLP, Distributed Computing, ROUGE Evaluation, RAG, FAISS.
Conclusion
The article has proposed a scalable AI pipeline of 5 steps that are applied to accomplish distributed biomedical text summarization and knowledge discovery. It incorporates successfully Apache Spark distributed computing, topic model of topic of domain adapted transformers and Large Language Models, LDA topic modeling, vector-indexed persistence, and a Retrieval-Augmented Generation serving interface, to offer an end-to-end solution of raw data ingestion to interactive knowledge query.The experimentation has revealed that the proposed system is highly performing in all the dimensions evaluated. The distributed summarization pipeline has already offered 6x improvement in comparison to the sequential baselines, as well as the semantic quality (ROUGE-1: 0.58, ROUGE-2: 0.41, ROUGE-L: 0.52) has not been reduced, which demonstrates that the scalability and the quality are not the opposite concepts in the proposed architecture. The LDA topic modeling component succeeds in revealing five medically relevant and understandable biomedical research themes in the summarized corpus, which provides automated knowledge discovery, which can be extended to the thousands of documents. The overall pipeline can run in 9-12 minutes on an off-the-shelf CPU architecture, as this has proven that the system works in practice in deployable healthcare IT environments without a specific team of GPUs.
There are several important extensions that can be made to this modular architecture of the system. The current LDA-based topic modeling may be eventually replaced with even more recent neural topic models, such as BERTopic or CTM, as a possible way to potentially improve topic coherence and integrate contextual representations. Depending on the hardware capabilities, the summarization step could be extended to increasingly powerful biomedical LLM, such as BioGPT or Med-PaLM. The layer of RAG serving can be enhanced with the multi-hop reasoning and resolution of the cross-document references. This system would also be extended in terms of managing complete articles and not just abstracts and this would greatly increase the scope and depth of knowledge discovery.This literature has immense clinical implications. The medical systems are cropping at the knees of tools that can help the clinical and research community to cope with the rapidly increasing biomedical literature. The suggested system will be able to accelerate the pace of evidence-based medicine, automate the process of systematic review, discover pertinent clinical trials more quickly, and reduce cognitive load on the healthcare personnel by automating the process of extracting, synthesizing, and organizing the findings of research.In summary, the paper has demonstrated that combination of the Big Data distributed computing, domain-adapted generative AI, and unsupervised knowledge discovery algorithms can provide a powerful and viable model to be able to solve the biomedical literature overload problem. The provided system is major one to automated, scalable, and intelligent knowledge management of healthcare.
References
[1]National Library of Medicine, “PubMed Overview,” U.S. National Library of Medicine, Bethesda, MD, 2024. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/
[2]K. Roberts, D. Demner-Fushman, and L. Tonning, “Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine,” Database, vol. 2017, 2017.
[3]S. U. Hassan, N. R. Aljohani, A. Shabbir, R. Ali, J. De Beer, A. M. Martínez-González, and M. A. Martínez, “Reshaping the future of bibliometrics: the role of artificial intelligence in scientific literature analysis,” Scientometrics, vol. 124, pp. 1651–1672, 2020.
[4]M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” in Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012, pp. 15–28.
[5]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
[6]Y. Zhang, H. Chen, and Z. Zhou, “Transformer-based Biomedical Text Summarization using BART,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 3, pp. 1018–1027, 2022.
[7]M. Moradi and M. Ghadiri, “Different Approaches for Identifying Important Concepts in Probabilistic Biomedical Text Summarization,” Artificial Intelligence in Medicine, vol. 84, pp. 101–116, 2018.
[8]E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott, “Publicly Available Clinical BERT Embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78, 2019.
[9]Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,” ACM Transactions on Computing for Healthcare, vol. 3, no. 1, pp. 1–23, 2021.
[10]K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, et al., “Large language models encode clinical knowledge,” Nature, vol. 620, pp. 172–180, 2023.
[11]X. Chen, Y. Liu, and R. Wang, “Topic Modeling for Healthcare Knowledge Discovery using LDA and NMF,” Scientific Reports, vol. 11, no. 1, pp. 1–14, 2021.
[12]H. Wang, W. Fan, and P. S. Yu, “Scalable Topic Discovery in Big Data Biomedical Repositories,” Big Data Research, vol. 28, pp. 100–116, 2022.
[13]J. Xu, Y. Yang, and J. Sun, “Distributed NLP with Apache Spark for Clinical Text Mining,” ACM Transactions on Computing for Healthcare, vol. 4, no. 2, pp. 1–22, 2023.
[14]K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.
[15]P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[16]A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “MIMIC-III, a freely accessible critical care database,” Scientific Data, vol. 3, no. 1, pp. 1–9, 2016.
[17]C. Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81, 2004.
[18]A. R. Fabbri, I. Li, T. She, S. Li, and D. Radev, “Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1074–1084, 2019.
[19]M. Yasunaga, A. Bosselut, H. Ren, X. Zhang, C. D. Manning, P. S. Liang, and J. Leskovec, “Deep Bidirectional Language-Knowledge Graph Pretraining,” Advances in Neural Information Processing Systems, vol. 35, pp. 37309–37323, 2022.
[20]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.
How to Cite This Paper
Harshith Sai, Pranav Aditya, Bhanu Prakash, Tanvir H Sardar, Vema Sai Krishna (2026). Distributed Biomedical Text Summarization and Knowledge Discovery Using Generative Language Models and Apache Spark. International Journal of Computer Techniques, 13(2). ISSN: 2394-2231.
Distributed Biomedical Text Summarization and Knowledge Discovery Using Generative Language Models and Apache SparkDownload
Related Posts:
Tag Fast Publication







