The recent developments of natural language processing and large language models have contributed to unprecedented tools that have become widely available and valuable for society. This thesis project aims to develop a versatile GraphRAG-based semantic search and discovery framework in academic contexts. Two students will initiate the project jointly at KTH, later specializing in different applications while sharing core infrastructure. The applications will include a research literature assistant and a course discovery system. The former will enable researchers to study and embark on new scientific disciplines faster, aiding literature survey and interconnections between research questions and fields. At the same time, the latter will allow teachers to optimize course design, implementation, and merging, thereby reducing costs (in line with recent guidelines from the higher education governance). The project structure, applications, technologies, data sources, and timeline are detailed. We will be three supervisors: Fredrik Heintz (Linköping University), Shiva Sander Tavallaey (ABB/KTH), and myself (KTH).
Project Structure:
Phase 1: Core Framework Development (Joint)
- Development of the base GraphRAG architecture
- Implementation of text processing and embedding pipeline
- Creation of graph database schema and relationships
- Development of query processing system
Phase 2: Data Pipeline Development (Joint)
- Design of flexible data ingestion pipelines
- Implementation of text extraction and preprocessing
- Development of metadata extraction systems
- Creation of relationship extraction algorithms
Phase 3: Specialized Applications (Individual)
Student A: Research Literature Assistant
- Focus: Literature review and research paper discovery
- Features:
o Paper relationship mapping
o Citation network analysis
o Research gap identification
o Semantic search across papers
o Topic clustering and trend analysis
Student B: Course Discovery System
- Focus: University-wide course catalog analysis
- Features:
o Cross-department course discovery
o Prerequisite path visualization
o Topic coverage analysis
o Course similarity detection
o Learning path recommendations
Expected Outcomes
1. Shared GraphRAG framework
2. Two specialized applications
3. Performance evaluation and comparison
4. Documentation and deployment guidelines
Timeline
- Months 1-2: Joint core framework development
- Months 3-4: Joint data pipeline development
- Months 5-6: Individual application development