TY - GEN
T1 - Schedule first, manage later
T2 - 32nd IEEE Conference on Computer Communications, IEEE INFOCOM 2013
AU - Nahir, Amir
AU - Orda, Ariel
AU - Raz, Danny
PY - 2013
Y1 - 2013
N2 - Load balancing in large distributed server systems is a complex optimization problem of critical importance in cloud systems and data centers. Existing schedulers often incur a high overhead in communication when collecting the data required to make the scheduling decision, hence delaying the job request on its way to the executing server. We propose a novel scheme that incurs no communication overhead between the users and the servers upon job arrival, thus removing any scheduling overhead from the job's critical path. Our approach is based on creating several replicas of each job and sending each replica to a different server. Upon the arrival of a replica to the head of the queue at its server, the latter signals the servers holding replicas of that job, so as to remove them from their queues. We show, through analysis and simulations, that this scheme improves the expected queuing overhead over traditional schemes by a factor of 9 (or more) under various load conditions. In addition, we show that our scheme remains efficient even when the inter-server signal propagation delay is significant (relative to the job's execution time). We provide heuristic solutions to the performance degradation that occurs in such cases and show, by simulations, that they efficiently mitigate the detrimental effect of propagation delays. Finally, we demonstrate the efficiency of our proposed scheme in a real-world environment by implementing a load balancing system based on it, deploying the system on the Amazon Elastic Compute Cloud (EC2), and measuring its performance.
AB - Load balancing in large distributed server systems is a complex optimization problem of critical importance in cloud systems and data centers. Existing schedulers often incur a high overhead in communication when collecting the data required to make the scheduling decision, hence delaying the job request on its way to the executing server. We propose a novel scheme that incurs no communication overhead between the users and the servers upon job arrival, thus removing any scheduling overhead from the job's critical path. Our approach is based on creating several replicas of each job and sending each replica to a different server. Upon the arrival of a replica to the head of the queue at its server, the latter signals the servers holding replicas of that job, so as to remove them from their queues. We show, through analysis and simulations, that this scheme improves the expected queuing overhead over traditional schemes by a factor of 9 (or more) under various load conditions. In addition, we show that our scheme remains efficient even when the inter-server signal propagation delay is significant (relative to the job's execution time). We provide heuristic solutions to the performance degradation that occurs in such cases and show, by simulations, that they efficiently mitigate the detrimental effect of propagation delays. Finally, we demonstrate the efficiency of our proposed scheme in a real-world environment by implementing a load balancing system based on it, deploying the system on the Amazon Elastic Compute Cloud (EC2), and measuring its performance.
UR - http://www.scopus.com/inward/record.url?scp=84883127006&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/INFCOM.2013.6566825
DO - https://doi.org/10.1109/INFCOM.2013.6566825
M3 - منشور من مؤتمر
SN - 9781467359467
T3 - Proceedings - IEEE INFOCOM
SP - 510
EP - 514
BT - 2013 Proceedings IEEE INFOCOM 2013
Y2 - 14 April 2013 through 19 April 2013
ER -