There are popular large language model framework like llama.cpp, hugging face and llm.c. They support different features, like sliding window or pre-decoding. Profiling the compute and memory needs and analysis the bottleneck of llm will benefits to the effienciency in inference of large language model.
Planned Task
1) run large languge model framework, like llama.cpp, ollama, llm.c.
2) check different features of the framework and compare the performance influence of different features in the same framework
3) compare the performance between different features.
4) Tasks include inference and fine-tuning. also traning large language model with small data set
5) Analysis the bottleneck of different stages in large language mode, compute or memory bottbleneck
6) Plan to improve the performance of large language model.
7) State-of-art bring up different optimization techniques. I also plan to profile the state of art papers.
Resource Usage:
1) Mainly On GPU. Nvidia T4 suits the workload of inference.
2) For traning and fine-tuning, V100 and A100 might be used.
3) 200 hours. I am sorry that this is just estimation. I didn't use Alvis before. I can get accurate number after this project.