Schedule first, manage later: Network-aware load balancing

Amir Nahir, Ariel Orda, Danny Raz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Load balancing in large distributed server systems is a complex optimization problem of critical importance in cloud systems and data centers. Existing schedulers often incur a high overhead in communication when collecting the data required to make the scheduling decision, hence delaying the job request on its way to the executing server. We propose a novel scheme that incurs no communication overhead between the users and the servers upon job arrival, thus removing any scheduling overhead from the job's critical path. Our approach is based on creating several replicas of each job and sending each replica to a different server. Upon the arrival of a replica to the head of the queue at its server, the latter signals the servers holding replicas of that job, so as to remove them from their queues. We show, through analysis and simulations, that this scheme improves the expected queuing overhead over traditional schemes by a factor of 9 (or more) under various load conditions. In addition, we show that our scheme remains efficient even when the inter-server signal propagation delay is significant (relative to the job's execution time). We provide heuristic solutions to the performance degradation that occurs in such cases and show, by simulations, that they efficiently mitigate the detrimental effect of propagation delays. Finally, we demonstrate the efficiency of our proposed scheme in a real-world environment by implementing a load balancing system based on it, deploying the system on the Amazon Elastic Compute Cloud (EC2), and measuring its performance.

Original languageEnglish
Title of host publication2013 Proceedings IEEE INFOCOM 2013
Pages510-514
Number of pages5
DOIs
StatePublished - 2013
Event32nd IEEE Conference on Computer Communications, IEEE INFOCOM 2013 - Turin, Italy
Duration: 14 Apr 201319 Apr 2013

Publication series

NameProceedings - IEEE INFOCOM

Conference

Conference32nd IEEE Conference on Computer Communications, IEEE INFOCOM 2013
Country/TerritoryItaly
CityTurin
Period14/04/1319/04/13

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Schedule first, manage later: Network-aware load balancing'. Together they form a unique fingerprint.

Cite this