: introduction and implementation details Advanced Concepts Team European Space Agency (ESTEC) Course on Differential Equations and Computer Algebra Estella, Spain October 29-30, 2010
Outline A Brief Overview 1 A Brief Overview 2 3 4
Outline A Brief Overview 1 A Brief Overview 2 3 4
Piranha in a Nutshell It is an algebraic manipulation framework Around 12000 SLOC (Single Lines Of Code) Written in C++ and object-oriented It uses extensively existing Free-Software tools and libraries (Boost, GMP, Python,... ) Multiplatform (GNU/Linux, Windows, BSD) Free-Software itself
for Celestial Mechanics Polynomials: Fourier series: Poisson series: i i j C i p i i ( ) cos C i (i t) sin ( ) cos C i,j p j (i t) sin Echeloned Poisson series: C i,j,k p k i j k l (l d) δ j,l ( ) cos (i t) sin
The Framework 1/2 A Brief Overview Q: Can we manipulate these algebraic structures in a general and unified way? The Basic Ideas 1 Series are collections of terms 2 Terms are coefficient-key pairs 3 Terms are uniquely identified by their keys: t 1 t 2 t 1.key t 2.key 4 A key can appear at most once in a series, i.e., a series is a set
The Framework 1/2 A Brief Overview Q: Can we manipulate these algebraic structures in a general and unified way? The Basic Ideas 1 Series are collections of terms 2 Terms are coefficient-key pairs 3 Terms are uniquely identified by their keys: t 1 t 2 t 1.key t 2.key 4 A key can appear at most once in a series, i.e., a series is a set
The Framework 2/2 A Brief Overview
Object-Oriented and Generic Programming The C++ Language High performance and high-level design are not mutually exclusive OO: inheritance, polymorphism, encapsulation, modularity Generic programming: type-agnostic classes Template meta-programming (aka modern C++, see Alexandrescu [2001]): OO features with zero overhead, efficient compile-time optimizations and checks The Bottom Line It is possible to share a consistent portion of the implementation among the supported algebraic structures and reduce code duplication to a minimum without sacrificing performance
Object-Oriented and Generic Programming The C++ Language High performance and high-level design are not mutually exclusive OO: inheritance, polymorphism, encapsulation, modularity Generic programming: type-agnostic classes Template meta-programming (aka modern C++, see Alexandrescu [2001]): OO features with zero overhead, efficient compile-time optimizations and checks The Bottom Line It is possible to share a consistent portion of the implementation among the supported algebraic structures and reduce code duplication to a minimum without sacrificing performance
A Quick SLOC Analysis 1/2 Gregoire & Colbert Chapront [2003]: Fourier and Poisson series manipulators Written in Fortran 90 Feature set comparable with Piranha s 4000 SLOC each (Piranha is 12000 SLOC) Piranha supports additionally: polynomials as top-level series multiple representations for keys and numerical coefficients (complex, reals, integers, rationals, arbitrary-size, etc.) 12 different manipulators are currently implemented within the framework (other combinations can be trivially added)
A Quick SLOC Analysis 2/2 Piranha s SLOC count divided by directory:
Pyranha A Brief Overview Python bindings for Piranha Uses the Boost.Python library Compiled-code performance with the flexibility of an interpreted language Python is a real computer language (not some obscure ad-hoc language) Many possibilities for extensions Interactive graphical environment with IPython, matplotlib and PyQt4
Outline A Brief Overview 1 A Brief Overview 2 3 4
Schoolbook multiplication Given: a(x) = a 1 x + a 0, b(x) = b 1 x + b 0, compute a(x) b(x) as Complexity: O ( n 2). a 0 b 0 + a 0 b 1 x + a 1 b 0 x + a 1 b 1 x 2.
Asymptotically fast multiplication: Karatsuba Karatsuba s algorithm: given a(x) = a 1 x + a 0, b(x) = b 1 x + b 0, express a(x) b(x) as a 0 b 0 + [(a 0 + a 1 ) (b 0 + b 1 ) a 0 b 0 a 1 b 1 ] x + a 1 b 1 x 2, with 3 multiplications vs 4 of the classical method. Complexity: O ( n log 3) 2.
Asymptotically fast multiplication: FFT Convert polynomials to vector of coefficients Compute the FFT of both vectors Pointwise multiplication of the FFTed vectors Inverse FFT to recover the result of the multiplication Complexity: O (n log n).
Alas... A Brief Overview Issues Both Karatsuba and FFT: have a high constant factor in complexity which make them unsuitable for typical problems in Celestial Mechanics rely on the assumption that the polynomials being multiplied are dense perform poorly on real-world multivariate polynomials Bottom line: back to schoolbook multiplication.
Kronecker s trick A Brief Overview z y x Code 0 0 0 0 0 0 1 1 0 0 2 2 0 0 3 3 0 1 0 4 0 1 1 5 0 1 2 6 0 1 3 7 0 2 0 8 0 2 1 9 0 2 2 10 0 2 3 11............ 3 3 3 63 Idea: code the sets of exponents into integer values Maintain lexicographic order Homomorphism between the vector space of integers and Z which preserves addition and subtraction Operations on integer vectors are reduced to O (1) complexity Codes can be used as perfect hash values or indices in an array Series are encoded on-the-fly during multiplication
Exploiting modern computer designs memory hierarchies (to the whiteboard) spatial locality of reference temporal locality of reference prefetcher multi-core CPUs parallelization (multi-thread)
Memory hierarchy A Brief Overview
Dense multiplication use Kronecker exponents directly as indices in an array use cache-blocking to promote temporal locality of reference monomial ordering is prefetch-friendly when applicable, top performance is achieved
Memory access patterns Unoptimized vs Optimized
Sparse multiplication use Kronecker exponents directly as hash values optimized hash table: items stored in sequential and contiguous buckets order input polynomials according to exponent modulo table size cache-blocking
Parallelization A Brief Overview P 1 P 2 1 1 2 3 n 3 n 2 n 1 n 2 2 3 4 n 2 n 1 n 1 3 3 4 5 n 1 n 1 2 4 4 5 6 n 1 2 3 cache-blocking provides a natural way to avoid contention interval arithmetics on the exponents used to guarantee write-in memory areas are disjoint
Outline A Brief Overview 1 A Brief Overview 2 3 4
A Brief Overview Fateman s dense benchmark. Compute: s(s + 1), s = (1 + x + y + z + t) 30. 46376 x 46376 = 2 150 733 376 term-by-term multiplications Final polynomial length = 635 376 Monagan-Pearce s sparse benchmark. Compute: f g, f = (1 + x + y + 2z 2 + 3t 3 + 5u 5 ) 12, g = (1 + u + t + 2z 2 + 3y 3 + 5x 5 ) 12. 6188 x 6188 = 38 291 344 term-by-term multiplications Final polynomial length = 5 821 335
Benchmark results Test Coefficient System Time ccpm Fateman double Core2Quad 4.29s 4.8 Fateman double Core2Duo 5.62s 4.6 Fateman double PPC64 4.96s 4.6 Fateman double Xeon 3.73s 4.6 Fateman double Atom 20.15s 15.0 Fateman GMP mpz Core2Quad 67.90s 75.8 Fateman 61-bit integer SDMP-Core2 60.25s 67.2 Fateman 61-bit integer SDMP-Corei7 70.59s 85.3 ELP double Core2Quad 15.62s 10.3 MP-sparse double Core2Quad 1.71s 107.2 MP-sparse double Xeon 1.59s 110.5 MP-sparse double Corei7 1.15s 88.0 MP-sparse 37-bit integer SDMP-Core2 1.86s 116.6 MP-sparse 37-bit integer SDMP-Corei7 1.56s 108.4
Benchmark results: parallelization dense case: 90% of maximum theoretical speedup sparse case: 70% of maximum theoretical speedup SDMP gets (super)linear speedup in the dense case, but does not scale up in the sparse case possible improvements: reduce synchronization barriers make algorithm non-deterministic...
Outline A Brief Overview 1 A Brief Overview 2 3 4
Future Steps Code refactor and cruft-elimination Extension of the Python bindings, GUI improvements, etc. Documentation Create a community