MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capotǎ, Abdul Wasay, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder,which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pretrained MonoCoderon an HPC-specific dataset (named HPCORPUS) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder,although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoderunderstands HPC code better than state-of-the-art LLMs. Mono Codersource code is available at our GitHub repository.

Original languageAmerican English
Title of host publication2024 IEEE High Performance Extreme Computing Conference, HPEC 2024
ISBN (Electronic)9798350387131
DOIs
StatePublished - 1 Jan 2024
Event2024 IEEE High Performance Extreme Computing Conference, HPEC 2024 - Virtual, Online
Duration: 23 Sep 202427 Sep 2024

Publication series

Name2024 IEEE High Performance Extreme Computing Conference, HPEC 2024

Conference

Conference2024 IEEE High Performance Extreme Computing Conference, HPEC 2024
CityVirtual, Online
Period23/09/2427/09/24

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Computational Mathematics
  • Control and Optimization
  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks'. Together they form a unique fingerprint.

Cite this