TY - GEN
T1 - MonoCoder
T2 - 2024 IEEE High Performance Extreme Computing Conference, HPEC 2024
AU - Kadosh, Tal
AU - Hasabnis, Niranjan
AU - Vo, Vy A.
AU - Schneider, Nadav
AU - Krien, Neva
AU - Capotǎ, Mihai
AU - Wasay, Abdul
AU - Tamir, Guy
AU - Willke, Ted
AU - Ahmed, Nesreen
AU - Pinter, Yuval
AU - Mattson, Timothy
AU - Oren, Gal
N1 - Publisher Copyright: © 2024 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder,which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pretrained MonoCoderon an HPC-specific dataset (named HPCORPUS) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder,although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoderunderstands HPC code better than state-of-the-art LLMs. Mono Codersource code is available at our GitHub repository.
AB - With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder,which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pretrained MonoCoderon an HPC-specific dataset (named HPCORPUS) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder,although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoderunderstands HPC code better than state-of-the-art LLMs. Mono Codersource code is available at our GitHub repository.
UR - http://www.scopus.com/inward/record.url?scp=105002727173&partnerID=8YFLogxK
U2 - 10.1109/HPEC62836.2024.10938441
DO - 10.1109/HPEC62836.2024.10938441
M3 - Conference contribution
T3 - 2024 IEEE High Performance Extreme Computing Conference, HPEC 2024
BT - 2024 IEEE High Performance Extreme Computing Conference, HPEC 2024
Y2 - 23 September 2024 through 27 September 2024
ER -