Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) have shown strong potential for building domain-specific AI assistants. However, evaluating the quality of such systems remains challenging, particularly when comparing different model sizes and architectures under identical retrieval conditions. This project, KTH-GPT, investigates the performance of an agentic RAG-based AI assistant designed for answering questions related to university-specific and institutional information.
The focus of the project is to design an agentic RAG Agent and execute a systematic evaluation pipeline for RAG-based question answering using the RAGAS framework. The evaluation measures multiple aspects of system quality, including answer correctness, faithfulness to retrieved context, context relevance, and retrieval effectiveness, by comparing model-generated answers and retrieved documents against curated ground-truth (golden) answers.
A central research objective is to study how the choice of language model impacts overall system performance. The project evaluates multiple LLMs of varying sizes and capability tiers, including smaller instruction-tuned models and larger, more capable models that require significant GPU memory. By running the same evaluation pipeline across different models, the project aims to identify whether stronger models yield substantially better RAG performance, or whether acceptable results can be achieved using more resource-efficient models.
Due to the high VRAM requirements of modern LLMs, especially when running local inference at scale, access to high-performance computing resources is required to carry out these experiments in a reasonable timeframe. The results of the project are intended to inform system design decisions for KTH-GPT, particularly regarding model selection and resource requirements for deployment.