TY - GEN
T1 - Nona
T2 - 13th IEEE International Conference on Cloud Networking, CloudNet 2024
AU - Pit-Claudel, Benoit
AU - Malak, Derya
AU - Cohen, Alejandro
AU - Medard, Muriel
AU - Ghobadi, Manya
N1 - Publisher Copyright: © 2024 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - This paper proposes a novel queueing-theoretic approach to enable stochastic congestion-aware scheduling for distributed machine learning inference queries. Our proposed framework, called Nona, combines a stochastic scheduler with an offline optimization formulation rooted in queueing-theoretic principles to minimize the average completion time of heterogeneous inference queries. At its core, Nona incorporates the fundamental tradeoffs between compute and network resources to make efficient scheduling decisions. Nona's formulation uses the Pollaczek-Khinchine formula to estimate queueing latency and to predict system congestion. Builind upon conventional Jackson networks, it captures the dependency between the computation and communication operations of interfering jobs. From this formulation, we derive an optimization problem and use its results as inputs for the scheduler. We introduce a novel graph contraction procedure to enable cloud providers to solve Nona's optimization formulation in practical settings. We evaluate Nona with real-world machine learning models (AlexNet, ResNet, DenseNet, VGG, and GPT2) and demonstrate that Nona outperforms state-of-the-art schedulers by up to 350×.
AB - This paper proposes a novel queueing-theoretic approach to enable stochastic congestion-aware scheduling for distributed machine learning inference queries. Our proposed framework, called Nona, combines a stochastic scheduler with an offline optimization formulation rooted in queueing-theoretic principles to minimize the average completion time of heterogeneous inference queries. At its core, Nona incorporates the fundamental tradeoffs between compute and network resources to make efficient scheduling decisions. Nona's formulation uses the Pollaczek-Khinchine formula to estimate queueing latency and to predict system congestion. Builind upon conventional Jackson networks, it captures the dependency between the computation and communication operations of interfering jobs. From this formulation, we derive an optimization problem and use its results as inputs for the scheduler. We introduce a novel graph contraction procedure to enable cloud providers to solve Nona's optimization formulation in practical settings. We evaluate Nona with real-world machine learning models (AlexNet, ResNet, DenseNet, VGG, and GPT2) and demonstrate that Nona outperforms state-of-the-art schedulers by up to 350×.
UR - http://www.scopus.com/inward/record.url?scp=85217078792&partnerID=8YFLogxK
U2 - 10.1109/CloudNet62863.2024.10815926
DO - 10.1109/CloudNet62863.2024.10815926
M3 - Conference contribution
T3 - 2024 IEEE 13th International Conference on Cloud Networking, CloudNet 2024
BT - 2024 IEEE 13th International Conference on Cloud Networking, CloudNet 2024
A2 - Menezes Ferrazani Mattos, Diogo
A2 - Monteiro Moraes, Igor
A2 - Nguyen, Thi Mai Trang
A2 - de Souza Couto, Rodrigo
A2 - Rubinstein, Marcelo Goncalves
Y2 - 27 November 2024 through 29 November 2024
ER -