International Journal of Computer Techniques – Volume 12 Issue 2, 2025
Kushal Swamy, Omkar Salgare, Omdatta Sakhare, Shoaib Tamboli Guide Name: Dr. Soumitra Das Department of Computer Engineering, Indira College of Engineering and Management, Pune
Abstract
The traditional approach to summarizing dense, complex, and jargon-filled legal documents
is done manually and is very time-consuming, labor-intensive, prone to human biases, and
applicability-driven. This study proposes the use of the T5 (Text-to-Text Transfer
Transformer) for automatic summarization of legal documents with 100% text extraction,
semantic preservation, and high readability. The Indian Legal Database Corpus (ILDC) is put forward as the primary dataset, which has
case judgments written with expert-written summaries that can serve as an ideal benchmark
for training and evaluating legal document summarization models. The proposed system uses
NLP and is fine-tuned on this dataset which allows it to produce brief, legally accurate, and
understandable summaries. All processes include comprehensive preprocessing, smart
document chunking, and employing optimized techniques to enhance summary generation
which solves the issue of large legal text processing. To ensure the model’s robustness and generalizability, further validation is conducted on
legal texts beyond ILDC. Testing the model on different legal datasets, such as international
case law, Supreme Court cases, or regulatory documents from various jurisdictions, allows
for evaluating its adaptability across different legal traditions. Additionally, cross-domain
testing on legal contracts, corporate policies, and government regulations is performed to
assess its performance beyond case law. Exploring multilingual legal documents is also
considered for expanding the model’s applicability in diverse legal settings. The evaluation process entrusts different scoring systems, like ROUGE for textual overlap,
BLEU for fluency and coherence, and semantic similarity score evaluated with Sentence-
BERT to ensure meaning retention. A model of T5 fine-tuning on ILDC is professed as
effective in this study to produce an efficient and better-summarizing system in the area of
legal texts.
Keywords
T5, ILDC: Indian Legal Database Corpus, NLP, Semantic Preservation
Conclusion & Future Work
This work shows that T5, when fine-tuned on ILDC, is highly effective at legal document
summarization. The model assures 100% text extraction, meaning preservation, and
readability, hence being appropriate for legal practitioners and researchers.
Future Work:
The System is presently only able to process English Language legal documents,
restricting its use in diverse legal jurisdictions that operate with more than one language.
Training the T5 model with a multilingual legal database will enhance the model’s ability
to process legal documents in regional languages like Hindi, Tamil, and Bengali, making
it more acceptable for the Indian legal system.
Automatic Citing of Relevant Case Laws in Summaries (Legal Citation Generation).
Interactive Summarisation: Expanding interactive user control over how summarization
parameters are set is known to improve usability. With it is possible to compress or
lengthen certain summarization sections, or highlight particular sections, or a custom
dictionary of legal terms may operate to generate summaries for them.
Acknowledgment
We recognize the open-source efforts of the Indian Legal Database Corpus (ILDC) and the
NLP community for facilitating advancements in legal AI research.
References
Raffel, C., Shazeer, N., Roberts, A., et al., “Exploring the Limits of Transfer Learning
with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21,
no. 140, pp. 1-67, 2020.
Vaswani, A., Shazeer, N., Parmar, N., et al., “Attention Is All You Need,” Advances in
Neural Information Processing Systems, vol. 30, 2017.
Zhang, J., Zhao, Y., Saleh, M., et al., “PEGASUS: Pre-training with Extracted Gap-
sentences for Abstractive Summarization,” International Conference on Machine Learning
(ICML), 2020.
Narayan, S., Cohen, S. B., Lapata, M., “Don’t Give Me the Details, Just the
Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization,”
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2018.
Chandrasekaran, M. K., Jain, S., “Legal Document Summarization using
Transformer-based Models,” Proceedings of the 2021 Conference on Computational
Linguistics, 2021.
Bommasani, R., Hudson, D. J., Adeli, E., et al., “On the Opportunities and Risks of
Foundation Models,” Journal of Artificial Intelligence Research, vol. 72, pp. 1-43, 2022.
Sinha, R., “Leveraging NLP for Legal Document Analysis: A Survey,” Proceedings of
the 2020 Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
Grover, C., McDonald, R., “Semantic Similarity Measures for Legal Text Processing,”
Journal of Legal Studies in Artificial Intelligence, vol. 18, no. 4, pp. 237-251, 2019.
How to Cite
Kushal Swamy, Omkar Salgare, Omdatta Sakhare, Shoaib Tamboli, “ValidEase: NLP for Simplification and Summarization of Legal Documents,” International Journal of Computer Techniques, Volume 12, Issue 2, 2025. ISSN 2394-2231
1 comment