Adaptive Testing of Large Language Model–Enhanced Software Systems: A Comprehensive Framework for Requirements, Test-Case Generation, Prioritization, and Evaluation

Rahul M. Bennett

Authors

Rahul M. Bennett Department of Computer Science, University of Manchester, United Kingdom

Keywords:

software testing, large language models, test-case generation

Abstract

Background: The rise of large language models (LLMs) and their integration into software development toolchains has introduced new dimensions to software testing, from automated test-case generation to system-level validation, while simultaneously complicating requirements testing and configurable-system evaluation (Wang, 2024; dos Santos, 2020). Existing literature on software product lines, configurable systems, and regression test prioritization offers foundational methods that must be reinterpreted when LLMs participate as both test artifact producers and application components (Agh, 2024; Souto, 2017; Elbaum, 2002).

Objective: This article proposes a unified, publication-ready theoretical and methodological framework for testing LLM-enhanced software systems that spans requirements elicitation and testing, automated unit and integration test generation using LLMs, focal-method mapping, test-case selection and prioritization, and empirical evaluation strategies. The framework emphasizes balancing soundness and efficiency in configurable environments while employing machine learning–based prioritization and leveraging recent advances in LLM tool-use and web-agent architectures (Pan, 2022; Tufano, 2022; Schick et al., 2023).

Methods: We synthesize evidence from systematic literature reviews, empirical studies, and recent preprints to build a layered methodology: (1) requirements-level formalization and traceable test intent extraction; (2) LLM-driven test-case generation templates and focal-method mapping; (3) hybrid selection and prioritization using feature-aware ML models and historical regression data; (4) orchestration for end-to-end testing and evaluation in realistic web environments; and (5) continuous monitoring and adaptive re-prioritization. Each component is described with prescriptive guidelines and evaluative metrics. Literature-based rationale and thought experiments ground the proposed choices (dos Santos, 2020; Wang, 2024; Chandra, 2025).

Results: The conceptual framework yields measurable improvements along three axes in thought experiments and empirical analogues discussed herein: coverage of requirement-derived behaviors, fault-revealing power of generated test suites, and regression test efficiency under budget constraints. Drawing on methods from test-case prioritization research and focal-method mapping, we show how LLM-generated tests can be filtered and ranked to achieve higher early-fault detection compared to naïve generation, while addressing configurable-system explosion through sampling and soundness-efficiency trade-offs (Elbaum, 2000; He, 2024; Souto, 2017).

Conclusions: LLMs offer unprecedented capabilities for automating parts of the testing lifecycle, but their effective integration requires principled pipelines that combine requirements engineering, focal-method guidance, adaptive prioritization, and environment realism. The proposed framework provides a roadmap for researchers and practitioners to construct, evaluate, and iterate robust testing systems for contemporary software that integrates LLMs either as tooling or as functional components. Future work should empirically validate the framework across diverse domains, quantify human–LLM collaboration dynamics in testing, and extend the approach to continual learning settings.

References

Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936.

dos Santos, J.; Martins, L.E.G.; de Santiago Júnior, V.A.; Povoa, L.V.; dos Santos, L.B.R. Software requirements testing approaches: A systematic literature review. Requir. Eng. 2020, 25, 317–337.

Agh, H.; Azamnouri, A.; Wagner, S. Software product line testing: A systematic literature review. Empir. Softw. Eng. 2024, 29, 146.

Souto, S.; D’Amorim, M.; Gheyi, R. Balancing Soundness and Efficiency for Practical Testing of Configurable Systems. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, Argentina, 20–28 May 2017; pp. 632–642.

Tufano, M.; Deng, S.K.; Sundaresan, N.; Svyatkovskiy, A. Methods2Test: A dataset of focal methods mapped to test cases. In Proceedings of the 19th International Conference on Mining Software Repositories, MSR’22, Pittsburgh, PA, USA, 23–24 May 2022; pp. 299–303.

Pan, R.; Bagherzadeh, M.; Ghaleb, T.A.; Briand, L. Test case selection and prioritization using machine learning: A systematic literature review. Empir. Softw. Eng. 2022, 27, 29.

He, Y.; Huang, J.; Yu, H.; Xie, T. An Empirical Study on Focal Methods in Deep-Learning-Based Approaches for Assertion Generation. Proc. ACM Softw. Eng. 2024, 1, 1750–1771.

Elbaum, S.; Malishevsky, A.G.; Rothermel, G. Prioritizing test cases for regression testing. Sigsoft Softw. Eng. Notes 2000, 25, 102–112.

Elbaum, S.; Malishevsky, A.G.; Rothermel, G. Test Case Prioritization: A Family of Empirical Studies. IEEE Trans. Softw. Eng. 2002, 28, 159–182.

Lops, A.; Narducci, F.; Ragone, A.; Trizio, M.; Bartolini, C. A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites. arXiv 2024, arXiv:2408.07846.

Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint, 2023.

Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint, 2021.

Chandra, R.; Lulla, K.; Sirigiri, K. Automation frameworks for end-to-end testing of large language models (LLMs). Journal of Information Systems Engineering and Management, 2025, 10, e464-e472.

Gur, I.; Furuta, H.; Huang, A.; Safdari, M.; Matsuo, Y.; Eck, D.; Faust, A. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint, 2023.

Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Bisk, Y.; Fried, D.; Alon, U. Webarena: A realistic web environment for building autonomous agents. arXiv preprint, 2023.

Lu, P.; Peng, B.; Cheng, H.; Galley, M.; Chang, K.-W.; Wu, Y. N.; Zhu, S.-C.; Gao, J. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint, 2023.

Deng, X.; Gu, Y.; Zheng, B.; Chen, S.; Stevens, S.; Wang, B.; Sun, H.; Su, Y. Mind2web: Towards a generalist agent for the web. arXiv preprint, 2023.

He, H.; Yao, W.; Ma, K.; Yu, W.; Dai, Y.; Zhang, H.; Lan, Z.; Yu, D. WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 6864–6890. Available: https://aclanthology.org/2024.acl-long.371

Adaptive Testing of Large Language Model–Enhanced Software Systems: A Comprehensive Framework for Requirements, Test-Case Generation, Prioritization, and Evaluation

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License