Loading…
For full conference details: http://llvm.org/devmtg/2017-10/
Back To Schedule
Wednesday, October 18 • 3:15pm - 4:00pm
Scale, Robust and Regression-Free Loop Optimizations for Scientific Fortran and Modern C++

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Modern C++ code and large scale scientific programs pose unique challenges when
applying high-level loop transformations, which -- together with a push towards
robustness and performance-regression freedom -- have been driving the
development of Polly over the last two years. In the following presentation we
discuss the transformation of Polly towards a scalable, robust and
"regression-free" loop optimization framework.

Correctness is essential when applying loop optimizations at scale. While over
the last years, hundreds of bugs and correctness issues have been addressed in
Polly, only recently the last fundamental correctness problem was resolved.
Even though rarely visible, Polly was for a long time inherently incorrect as
it just assumed the integer types it uses for all generated expressions are
sufficiently large to hold all possible values. While this assumption was
commonly true, similar to how rare corner cases limit most optimizations in
LLVM similar corner cases commonly prevent Polly to create code that is
correct for all possible inputs. We present a novel framework, which allows
Polly to derive after arbitrary loop transformations correct types for each
sub-expression or -- if requested -- preconditions under which the needed types
are smaller than the native integer type. As a result, a wide range of
high-level loop transformations suddenly can be proven correct -- surprisingly
-- often without any need for run-time preconditions and at very reasonable
compile time cost.

Robustness and real-world scalability is the next cornerstone to the
optimization of very large programs. In the second part of this presentation we
first report on our experience with compiling several large scale programs with
Polly: The Android Open Source Project, the COSMO weather and climate model
(500,000 LoC and 16,000 loops), as well as the Gentoo package repository. We
then discuss a new extension to our internal loop scheduler, which addresses
fundamental scalability limitations in our polyhedral scheduler. Traditionally
all scheduling choices within a large loop nest have been taken simultaneously,
which caused the underlying ILP problem to grow unlimited in dimensionality and
as a result limited the scalability of Polly. We present a novel incremental
loop scheduling approach, which ensures that the size of the scheduling ILP
problems are, independently of the size of the optimized loop nest, bounded.
As a result, Polly is not only able to process larger programs, but this
freedom can also be exploited to schedule loop programs with sub-basic-block
granularly.

Performance is the last cornerstone we care about. Before improving
performance, it is important to ensure to not regress performance.
Traditionally, Polly has been running at the beginning of the pass pipeline
where the needed additional canonicalization passes caused arbitrary
performance changes even in the common case where Polly did not propose any
loop transformations. Scalar data dependences introduced by LICM and GVN,
prevented Polly to run later in the pass pipeline, a position where no
pre-canonicalization is needed and Polly can leave the IR entirely unchanged in
case it cannot suggest a beneficial performance optimization. We present with
De-LICM a fully automatic approach to remove unneeded scalar dependences, that
commonly prevented advanced loop transformations late in the pass pipeline.

We conclude by presenting two sets of experimental performance results. First,
we used Polly to offload the physics computations of the COSMO weather model, a
large scientific code base, to our modern NVIDIA P100 accelerated compute
cluster. Second, we discuss how executing Polly late in the pass pipeline
enables Polly to improve the performance of linear algebra kernels written with
modern C++ expression templates to performance levels reached by tuned
libraries such as openblas.

Speakers
TG

Tobias Grosser

ETH Zurich
avatar for Michael Kruse

Michael Kruse

INRIA/ENS


Wednesday October 18, 2017 3:15pm - 4:00pm PDT
2 - Technical Talk (Rm LL21AB)