Skip to main content
Log in

Non-clairvoyant online scheduling of synchronized jobs on virtual clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Although virtualization technology is recently applied to next-generation distributed high-performance computing systems, theoretical aspects of scheduling jobs on these virtualized environments are not sufficiently studied, especially in online and non-clairvoyant cases. Virtualization of computing resources results in interference and virtualization overheads that negatively impact the load balancing objectives on commonly used cluster of multi-core physical machines. We present a technique for non-clairvoyant online scheduling of globally synchronized jobs, each of which spawns tasks to execute compute-intensive works. Our technique considers both load balancing of physical cores and per job synchronization cost minimization. We show that in the presence of arbitrary virtualization overheads, interference effects and synchronization cost, the problem can be reduced to an online unrelated parallel machine scheduling, which is solved using routing of virtual circuits. We present a new opportunity cost model to reduce the problem to the routing of virtual circuits and prove the effectiveness of our scheduling technique using mathematical analysis and simulative experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Khorandi SM, Sharifi M (2017) Scheduling of online compute-intensive synchronized jobs on high performance virtual clusters. J Comput Syst Sci 85(3):1–17. https://doi.org/10.1016/j.jcss.2016.10.009

    Article  MathSciNet  MATH  Google Scholar 

  2. Mondragon OH, Bridges PG, Jones T (2015) Quantifying scheduling challenges for exascale system software. In: The 5th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), Portland

  3. Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X, Choudhary A, Dosanjh S, Dunning T, Fiore S, Geist A, Gropp B, Harrison R, Hereld M, Heroux M, Hoisie A, Hotta K, Jin Z, Ishikawa Y, Johnson F, Kale S, Kenway R, Keyes D, Kramer B, Labarta J, Lichnewsky A, Lippert T, Lucas B, Maccabe B, Matsuoka S, Messina P, Michielse P, Mohr B, Mueller MS, Nagel WE, Nakashima H, Papka ME, Reed D, Sato M, Seidel E, Shalf J, Skinner D, Snir M, Sterling T, Stevens R, Streitz F, Sugar B, Sumimoto S, Tang W, Taylor J, Thakur R, Trefethen A, Valero M, Steen Avd, Vetter J, Williams P, Wisniewski R, Yelick K (2011) The international exascale software roadmap. Int J High Perform Comput Appl 25(1):3–60

    Article  Google Scholar 

  4. Sterling T (2009) Models of computation—enabling exascale. Int J High Perform Comput Appl 23(4):332–334

    Article  Google Scholar 

  5. Kocoloski B, Lange J, Abbasi H, Bernholdt DE, Jones TR, Dayal J, Evans N, Lang M, Lofstead J, Pedretti K, Bridges PG (2015) System-level support for composition of applications. In: The 5th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS ’15), Portland, vol 7. ACM, pp 1–8

  6. Kocoloski B, Lange J (2013) Improving compute node performance using virtualization. Int J High Perform Comput Appl 27(2):124–135

    Article  Google Scholar 

  7. Brightwell R, Oldfield R, Maccabe AB, Bernholdt DE (2013) Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R. In: The 3rd International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’13), Eugene, vol 2. ACM, pp 1–8

  8. Gupta A, Faraboschi P, Giaochin F, Kale LV, Kaufmann R, Lee B-S, March V, Milojicc D, Suen CH (2014) Evaluating and improving the performance and scheduling of HPC applications in cloud. IEEE Trans Cloud Comput 99:1–14

    Google Scholar 

  9. Gupta A, Sarood O, Kale L, Milojicic D (2013) Improving HPC application performance in cloud through dynamic load balancing. In: The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Delft, pp 402–409

  10. Gupta A, Kale LV, Milojicic D, Faraboschi P, Balle SM (2013) HPC-aware VM placement in infrastructure clouds. In: The IEEE International Conference on Cloud Engineering (IC2E ’13), Redwood, pp 11–20

  11. Bridges PG, Arnold D, Pedretti KT, Suresh M, Lu F, Dinda P, Joseph R, Lange J (2012) Virtual machine-based emulation of future generation high-performance computing systems. Int J High Perform Comput Appl 26(2):125–135

    Article  Google Scholar 

  12. Mondragon O, Bridges PG, Ferreira KB, Levy S, Widener PM (2016) Understanding performance interference in next-generation HPC systems. In: The 2016 ACM/IEEE Conference on Supercomputing (SC’16), Salt Lake City. ACM, pp 384–395

  13. Keren A, Barak A (2003) Opportunity cost algorithms for reduction of I/O and interprocess communication overhead in a computing cluster. IEEE Trans Parallel Distrib Syst 14(1):39–50

    Article  Google Scholar 

  14. Beckman P, Brightwell R, Supinski BRd, Gokhale M, Hofmeyr S, Krishnamoorthy S, Lang M, Maccabe B, Shalf J, Snir M (2012) Exascale operating system and runtime software report. U.S. Department of Energy. https://science.energy.gov/~/media/ascr/pdf/research/cs/Exascale%20Workshop/ExaOSR-Report-Final.pdf

  15. Lange J, Pedretti K, Dinda P, Bridges PG, Bae C, Soltero P, Merritt A (2011) Minimal-overhead virtualization of a large scale supercomputer. In: ACM SIGPLAN notices—VEE ’11, vol 46(7), pp 169–180

  16. Lange J, Pedretti K, Hudson T, Dinda P, Cui Z, Xia L, Bridges P, Gocke A, Jaconette S, Levenhagen M, Brightwell R (2010) Palacios and Kitten: new high performance operating systems for scalable virtualized and native supercomputing. In: The 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta. IEEE, pp 1–12

  17. Ebenlendr T, Sgall J (2009) Optimal and online preemptive scheduling on uniformly related machines. J Sched 12(5):517–527

    Article  MathSciNet  MATH  Google Scholar 

  18. Strunk A (2012) Costs of virtual machine live migration: a survey. Paper presented at the IEEE 8th World Congress on Services, Honolulu

  19. Jin H, Gao W, Wu S, Shi X, Wu X, Zhou F (2011) Optimizing the live migration of virtual machine by CPU scheduling. J Netw Comput Appl 34(4):1088–1096

    Article  Google Scholar 

  20. Breitgand D, Kutiel G, Raz D (2011) Cost-aware live migration of services in the cloud. In: The Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Network and Services, Boston, vol 11. USENIX, pp 1–6

  21. Ramezani F, Lu J, Taheri J, Zomaya AY (2017) A multi-objective load balancing system for cloud environments. Comput J. https://doi.org/10.1093/comjnl/bxw109

    Google Scholar 

  22. Awerbuch B, Azar Y, Plotkin S, Waarts O (2001) Competitive routing of virtual circuits with unknown duration. J Comput Syst Sci 62(3):385–397

    Article  MathSciNet  MATH  Google Scholar 

  23. Amir Y, Awerbuch B, Barak A, Borgstrom S, Keren A (2000) An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Trans Parallel Distrib Syst 11(7):760–768

    Article  Google Scholar 

  24. Epstein L, Favrholdt LM, Kohrt JS (2012) Comparing online algorithms for bin packing problems. J Sched 15(1):13–21

    Article  MathSciNet  MATH  Google Scholar 

  25. Sleator DD, Tarjan RE (1985) Amortized efficiency of list update and paging rules. Commun ACM 28(2):202–208

    Article  MathSciNet  Google Scholar 

  26. Agrawal K, Li J, Lu K, Moseley B (2016) Scheduling parallelizable jobs online to minimize the maximum flow time. In: 28th ACM-SIAM Symposium on Parallelism in Algorithms and Architectures, Pacific Grove. ACM, pp 195–205

  27. Li J, Chen JJ, Agrawal K, Lu C, Gill C, Saifullah A (2014) Analysis of federated and global scheduling for parallel real-time tasks. In: 26th Euromicro Conference on Real-Time Systems (ECRTS), Madrid. IEEE, pp 85–96

  28. Duboc L, Leiter E, Rosenblum DS (2013) Systematic elaboration of scalability. IEEE Trans Softw Eng 39(1):119–140

    Article  Google Scholar 

  29. Duboc L, Rosenblum D, Wicks T (2007) A framework for characterization and analysis of software system scalability. In: The 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE ’07), Dubrovnik, pp 375–384

  30. Caragiannis I (2008) Better bounds for online load balancing on unrelated machines. In: The Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2008), San Francisco. SIAM, pp 972–981

  31. Lübbecke E, Maurer O, Megow N, Wiese A (2016) A new approach to online scheduling: approximating the optimal competitive ratio. ACM Trans Algorithms (TALG) 13(1):15

    MathSciNet  Google Scholar 

  32. Borodin A, El-Yaniv R (1998) Online computation and competitive analysis. Cambridge University Press, New York

    MATH  Google Scholar 

  33. Aspens J, Azar Y, Fiat A, Plotkin S, Waarts O (1997) On-line routing of virtual circuits with applications to load balancing and machine scheduling. J ACM 44(3):486–504

    Article  MathSciNet  MATH  Google Scholar 

  34. Chen L, Ye D, Zhang G (2015) Approximating the optimal algorithm for online scheduling problems via dynamic programming. Asia-Pac J Oper Res 32(1):1540011

    Article  MathSciNet  MATH  Google Scholar 

  35. Chang E-C, Yap C (2003) Competitive on-line scheduling with level of service. J Sched 6(3):251–267

    Article  MathSciNet  MATH  Google Scholar 

  36. Mondragon OH, Bridges PG, Jones T (2015) Quantifying scheduling challenges for exascale system software. In: The 5th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS ’15), Portland

  37. Singh S, Chana I (2016) A survey on resource scheduling in cloud computing: issues and challenges. J Grid Comput 14(2):217–264

    Article  Google Scholar 

  38. Pietri I, Sakellariou R (2016) Mapping virtual machines onto physical machines in cloud computing: a survey. ACM Comput Surv (CSUR) 49(3):1–29

    Article  Google Scholar 

  39. Quintin J-N, Wagner F (2012) WSCOM: online task scheduling with data transfers. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa. IEEE, pp 344–351

  40. Maoz T, Barak A, Amar L (2008) Combining virtual machine migration ith process migration for HPC on multi-clusters and grids. In: The IEEE International Conference on Cluster Computing, Tsukuba, pp 89–98

  41. Gupta A, Kalé LV, Gioachin F, March V, Suen CH, Lee B-S, Faraboschi P, Kaufmann R, Milojicic D (2012) Exploring the performance and mapping of HPC applications to platforms in the cloud. In: The 21st International Symposium on High-Performance Parallel and Distributed Computing, Delft, pp 121–122

  42. Machovec D, Tunc C, Kumbhare N, Khemka B, Akoglu A, Hariri S, Siegel HJ (2016) Value-based resource management in high-performance computing systems. In: ACM 7th Workshop on Scientific Cloud Computing, Kyoto, pp 19–26

  43. Ritson CG, Sampson AT, Barnes FRM (2012) Multicore scheduling for lightweight communicating processes. Sci Comput Program 77(6):727–740

    Article  Google Scholar 

  44. Heath MT (2015) A tale of two laws. Int J High Perform Comput Appl 29(3):1–11

    Article  MathSciNet  Google Scholar 

  45. Sterling T (2009) The biggest need: a new model of computation. Int J High Perform Appl 23(4):335–336

    Article  Google Scholar 

  46. Pedretti KT, Bridges PG (2010) Opportunities for leveraging OS virtualization in high-end supercomputing. In: The Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds (MASVDC’10), Atlanta

  47. Kale LV, Kumar S, DeSouza J (2002) A malleable-job system for timeshared parallel machines. In: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Berlin. IEEE, pp 230–237

  48. Corbalan J, Martorell X, Labarta J (2001) Improving gang scheduling through job performance analysis and malleability. In: The 15th International Conference on Supercomputing, Sorrento. ACM, pp 303–311

  49. Clauss C, Moschny T, Eicker N (2016) Dynamic process management with allocation-internal co-scheduling towards interactive supercomputing. In: The 1st Workshop on Co-scheduling of HPC Applications (COSH 2016), Prague

  50. Herbein S, Ahn DH, Lipari D, Scogland TRW, Stearman M, Grondona M, Garlick J, Springmeyer B, Taufer M (2016) Scalable I/O-aware job scheduling for burst buffer enabled HPC clusters. In: The 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16), Kyoto. ACM, pp 69–80

  51. Kocoloski B, Zhou Y, Childers B, Lange J (2015) Implications of memory interference for composed HPC applications. In: The 2015 International Symposium on Memory Systems (MEMSYS’15), Washington. ACM, pp 95–97

  52. Zhao J, Cui H, Xue J, Feng X, Yan Y, Yang W (2013) An empirical model for predicting cross-core performance interference on multicore processors. In: The 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh. ACM, pp 201–212

  53. Mosix Cluster Management System (2017) http://www.mosix.org/

  54. Park JJK, Park Y, Mahlke S (2015) Chimera: collaborative preemption for multitasking on a shared GPU. In: the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Istanbul. ACM, pp 593–606

  55. Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760

    Article  Google Scholar 

  56. Xiao S, Feng W-c (2010) Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta. IEEE, pp 1–12

  57. Ebenlendr T, Sgall J (2010) Semi-online preemptive scheduling: one algorithm for all variants. Theory Comput Syst 48(3):577–613

    Article  MathSciNet  MATH  Google Scholar 

  58. Shmoys DB, Wein J, Williamson DP (1995) Scheduling parallel machines on-line. SIAM J Comput 24(6):1313–1331

    Article  MathSciNet  MATH  Google Scholar 

  59. Graham RL (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429

    Article  MathSciNet  MATH  Google Scholar 

  60. Patel DK, Tripathy D, Tripathy CR (2016) Survey of load balancing techniques for grid. J Netw Comput Appl 65:103–119

    Article  Google Scholar 

  61. Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917

    Article  Google Scholar 

  62. Hirofuchi T, Lebre A, Pouilloux L (2016) SimGrid VM: virtual machine support for a simulation framework of distributed systems. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2015.2481422

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohsen Sharifi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khorandi, S.M., Sharifi, M. Non-clairvoyant online scheduling of synchronized jobs on virtual clusters. J Supercomput 74, 2353–2384 (2018). https://doi.org/10.1007/s11227-018-2262-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2262-4

Keywords

Navigation