Since the end of the exponential clock speed increase circa 2003, parallelism has imposed itself as the means to obtain speed-ups on general purpose platforms but the current fragile languages, programming abstractions, and compiler and runtime support have stifled its adoption. I will present the problems that make the state of the art parallel languages (and libraries) unsuitable for the general-purpose application programmer. Ideally, we would like all the available parallelism to be expressed by the programmer using high level programming constructs and delegate its efficient execution to the compiler and runtime. Besides sidestepping the tedious and fragile process of fine-tuning the granularity of parallelism, the maximal availability of parallelism affords the scheduler maximal flexibility during the execution when more information (e.g., core availability) is accessible.
I will propose Lazy Scheduling as one piece of the solution, which can be applied to several popular schedulers, most notably work-stealing. It optimizes the common case where most of the parallel tasks are executed locally and do not need to be made available to other processors. Lazy scheduling achieves this goal by using a simple heuristic for inferring whether other processors are hungry for work, and by making parallel work available to other processors based on that heuristic. Unlike previous attempts, Lazy Scheduling does not introduce potential deadlocks and does not make irrevocable serialization decisions that could hurt load-balancing and performance. Moreover, I will show how Lazy Scheduling is able to greatly outperform current work-stealing when a lot of parallelism is exposed by the programmer, and how little performance is lost compared to a carefully tuned program.