High-Performance Modular Multiplication on the Cell Processor



Similar documents
Fast Implementations of AES on Various Platforms

Fast Software AES Encryption

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Optimizing Code for Accelerators: The Long Road to High Performance

Mathematics of Internet Security. Keeping Eve The Eavesdropper Away From Your Credit Card Information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Table of Contents. Bibliografische Informationen digitalisiert durch

Speeding Up RSA Encryption Using GPU Parallelization

IMPLEMENTATION AND PERFORMANCE ANALYSIS OF ELLIPTIC CURVE DIGITAL SIGNATURE ALGORITHM

The Piranha computer algebra system. introduction and implementation details

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Cryptanalysis with a cost-optimized FPGA cluster

Ashraf Abusharekh Kris Gaj Department of Electrical & Computer Engineering George Mason University

Principles of Public Key Cryptography. Applications of Public Key Cryptography. Security in Public Key Algorithms

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Speeding up GPU-based password cracking

An Efficient Hardware Architecture for Factoring Integers with the Elliptic Curve Method

Introduction to Cloud Computing

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

RSA Keys with Common Factors

Klaus Hansen, Troels Larsen and Kim Olsen Department of Computer Science University of Copenhagen Copenhagen, Denmark

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Generations of the computer. processors.

Ultra High Performance ECC over NIST Primes on Commercial FPGAs

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to GPU Architecture

GPUs for Scientific Computing

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Introduction to Cryptography CS 355

Evaluation of Digital Signature Process

Computer and Network Security

Index Calculation Attacks on RSA Signature and Encryption

Timing Attacks on software implementation of RSA

Public Key Cryptography. Performance Comparison and Benchmarking

Efficient and Robust Secure Aggregation of Encrypted Data in Wireless Sensor Networks

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Number Theory. Proof. Suppose otherwise. Then there would be a finite number n of primes, which we may

Crittografia e sicurezza delle reti. Digital signatures- DSA

Faster polynomial multiplication via multipoint Kronecker substitution

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method

Next Generation GPU Architecture Code-named Fermi

Rethinking SIMD Vectorization for In-Memory Databases

Cryptography and Network Security

ECE 842 Report Implementation of Elliptic Curve Cryptography

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Computer Architecture TDTS10

The Mathematics of the RSA Public-Key Cryptosystem

Library (versus Language) Based Parallelism in Factoring: Experiments in MPI. Dr. Michael Alexander Dr. Sonja Sewera.

Elliptic Curve Hash (and Sign)

International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

Cryptography and Network Security Chapter 10

Software implementation of Post-Quantum Cryptography

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Applied Cryptography Public Key Algorithms

An Introduction to Hill Ciphers Using Linear Algebra

Primality Testing and Factorization Methods

Runtime and Implementation of Factoring Algorithms: A Comparison

Outline. Performance comparisons of elliptic curve systems in software. Goals. 1. Elliptic curve digital signature algorithm (ECDSA)

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

CRYPTOGRAPHY IN NETWORK SECURITY

Math 319 Problem Set #3 Solution 21 February 2002

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Digital Signatures. Murat Kantarcioglu. Based on Prof. Li s Slides. Digital Signatures: The Problem

MATH 168: FINAL PROJECT Troels Eriksen. 1 Introduction

A PPENDIX H RITERIA FOR AES E VALUATION C RITERIA FOR

High-speed cryptography and DNSCurve. D. J. Bernstein University of Illinois at Chicago

Implementation of Elliptic Curve Digital Signature Algorithm

THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0

ALGEBRAIC APPROACH TO COMPOSITE INTEGER FACTORIZATION

A Factoring and Discrete Logarithm based Cryptosystem

AES-GCM software performance on the current high end CPUs as a performance baseline for CAESAR competition

Elements of Applied Cryptography Public key encryption

Figure 1: Application scheme of public key mechanisms. (a) pure RSA approach; (b) pure EC approach; (c) RSA on the infrastructure

Factoring Algorithms

Haswell Cryptographic Performance

A SOFTWARE COMPARISON OF RSA AND ECC

Digital Signature. Raj Jain. Washington University in St. Louis

Introduction to GPU hardware and to CUDA

Study of algorithms for factoring integers and computing discrete logarithms

Implementation and Comparison of Various Digital Signature Algorithms. -Nazia Sarang Boise State University


Introduction to Programming (in C++) Loops. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Example. Introduction to Programming (in C++) Loops. The while statement. Write the numbers 1 N. Assume the following specification:

Stream Processing on GPUs Using Distributed Multimedia Middleware

Computer Security: Principles and Practice

Transcription:

High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 19

Outline Motivation and previous work Applications for multi-stream modular multiplication Background Fast reduction with special primes The Cell broadband engine Modular multiplications in the Cell Performance results Conclusions 2 / 19

Motivation Modular multiplication is time-critical in cryptography RSA ( 2048 bits) ElGamal ( 2048 bits) ECC ( 224 bits) and in cryptanalysis ECDLP ( 100 200 bits) ECM ( 200 bits) 3 / 19

Motivation Multi-stream modular multiplication is time-critical in cryptography RSA ( 2048 bits) ElGamal ( 2048 bits) 2, ElGamal encryption (ElGamal, CRYP84) 3, Damgård ElGamal (Damgård, CRYP91) 4, Double hybrid Damgård ElGamal (Kiltz et al., EUR09) ECC ( 224 bits), Batch decryption PSEC, ECIES and in cryptanalysis ECDLP ( 100 200 bits), Pollard ρ ECM ( 200 bits), different curves 3 / 19

Previous works Misc. Platforms Lots of performance results for many platforms GNU Multiple Precision (GMP) Arithmetic Library many platforms, no Montgomery multiplication Bernstein et al. (EUR09): NVIDIA GPUs Brown et al. (CT-RSA01): NIST primes on x86 On the Cell Broadband Engine Optimize for one specific bit-size The Multi-Precision Math (MPM) Library by IBM (single stream) Costigan and Schwabe (AFR09): special 255-bit prime (multi-stream) Bernstein et al. (SHARCS09): 195-bit generic moduli (multi-stream) 4 / 19

Contributions Fast algorithms for modular multiplication on platforms with a multiply-and-add. Target a range of moduli: 192 521 bits. Compare generic and special modular multiplication All implementations set new speed records for the Cell Broadband Engine How much faster is using special moduli compared to generic ones? 5 / 19

Special Primes Faster reduction exploiting the structure of the special prime. By US National Institute of Standards Five recommended primes in the FIPS 186-3 (Digital Signature Standard) P 192 = 2 192 2 64 1 P 224 = 2 224 2 96 + 1 P 256 = 2 256 2 224 + 2 192 + 2 96 1 P 384 = 2 384 2 128 2 96 + 2 32 1 P 521 = 2 521 1 Prime used in Curve25519 Proposed by Bernstein at PKC 2006 P 255 = 2 255 19 6 / 19

Example: P 192 = 2 192 2 64 1 x 0 x < P192 2 = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H 7 / 19

Example: P 192 = 2 192 2 64 1 x 0 x < P192 2 = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H x H 2 64 = xh 2 64 2 192 2 192 + (x H 2 64 mod P 192 ) = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H 7 / 19

Example: P 192 = 2 192 2 64 1 x 0 x < P192 2 = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H x H 2 64 = xh 2 64 2 192 2 192 + (x H 2 64 mod P 192 ) = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H x = (c 11,..., c 0 ) s 1 = (c 5, c 4, c 3, c 2, c 1, c 0 ), s 2 = (c 11, c 10, c 9, c 8, c 7, c 6 ), s 3 = (c 9, c 8, c 7, c 6, 0, 0), s 4 = (0, 0, c 11, c 10, 0, 0), s 5 = (0, 0, 0, 0, c 11, c 10 ) Return s 1 + s 2 + s 3 + s 4 + s 5 7 / 19

Example: P 192 = 2 192 2 64 1 x 0 x < P192 2 = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H x H 2 64 = xh 2 64 2 192 2 192 + (x H 2 64 mod P 192 ) = x H 2 192 + x L x H 2 192 + x L x H P 192 mod P 192 = x L + x H 2 64 + x H x = (c 11,..., c 0 ) s 1 = (c 5, c 4, c 3, c 2, c 1, c 0 ), s 2 = (0, 0, c 7, c 6, c 7, c 6 ), s 3 = (c 9, c 8, c 9, c 8, 0, 0), s 4 = (c 11, c 10, c 11, c 10, c 11, c 10 ) Return 0 s 1 + s 2 + s 3 + s 4 < 4 P 192 Solinas, technical report 1999 7 / 19

The Cell Broadband Engine Cell architecture in the PlayStation 3 (@ 3.2 GHz): Broadly available (24.6 million) Relatively cheap (US$ 300) PS3 Slim or the newest firmware disables another OS! Also available in blade servers and PCI express boards. The Cell contains eight Synergistic Processing Elements (SPEs) six (maybe seven) available to the user in the PS3 one Power Processor Element (PPE) the Element Interconnect Bus (EIB) a specialized high-bandwidth circular data bus 8 / 19

Cell architecture, the SPEs The SPEs contain a Synergistic Processing Unit (SPU) Access to 128 registers of 128-bit SIMD operations Dual pipeline (odd and even) Rich instruction set In-order processor 256 KB of fast local memory (Local Store) Memory Flow Controller (MFC) 9 / 19

Programming Challenges Memory The executable and all data should fit in the LS Or perform manual DMA requests to the main memory (max. 214 MB) Branching No smart dynamic branch prediction Instead prepare-to-branch instructions to redirect instruction prefetch to branch targets Instruction set limitations 16 16 32 bit multipliers (4-SIMD) Dual pipeline One odd and one even instruction can be dispatched per clock cycle. 10 / 19

Modular Multiplication on the Cell I Four (16 m)-bit integers in m vectors: x i = m 1 j=0 x i,j 2 16 j x[0] =. 128-bit wide register { }} { } {{ } the 32 (or 16) least significant bits of x 2 are located in this 32-bit word (or in its 16 least significant bits). x[j] = 16-bit 16-bit } {{ } } {{ } high low order order. x[n 1] = } {{ } (x 1, } {{ } x 2,. } {{ } x 3, } {{ } x 4 ) 11 / 19

Modular Multiplication on the Cell II Implementation use the multiply-and-add instruction, if 0 a, b, c, d < 2 16, then a b + c + d < 2 32. try to fill both the odd and even pipelines, are branch-free. Do not fully reduce modulo (m-bits) P, Montgomery and special reduction [0, 2 m, These numbers can be used as input again, Reduce to [0, P at the cost of a single comparison + subtraction. 12 / 19

Modular Multiplication on the Cell III Special reduction [0, t P (t Z and small) How to reduce to [0, 2 m? (2 m 1 < P < 2 m ) Apply special reduction again Repeated subtraction ((t 1) times) For a constant modulus m-bit P Select the four values to subtract simultaneously using select and cmpgt instructions and a look-up table. 13 / 19

Modular Multiplication on the Cell IV For the special primes this can be done even faster. t t P 224 = t (2 224 2 96 + 1 ) = {c 7,..., c 0 } c 7 c 6 c 5 c 4 c 3 c 2 c 1 c 0 0 0 0 0 0 0 0 0 0 1 0 2 32 1 2 32 1 2 32 1 2 32 1 0 0 1 2 1 2 32 1 2 32 1 2 32 1 2 32 2 0 0 2 3 2 2 32 1 2 32 1 2 32 1 2 32 3 0 0 3 4 3 2 32 1 2 32 1 2 32 1 2 32 4 0 0 4 c 0 = t, c 1 = c 2 = 0 and c 3 = (unsigned int) (0 t). If t > 0 then c 4 = c 5 = c 6 = 2 32 1 else c 4 = c 5 = c 6 = 0. Use a single select. 14 / 19

Modular Multiplication on the Cell V P 255 = 2 255 19 Original approach Proposed by Bernstein and implemented on the SPE by Costigan and Schwabe (Africacrypt 2009): 19 Here x F 2 255 19 is represented as x = x i 2 12.75i. Redundant representation Following ideas from Bos, Kaihara and Montgomery (SHARCS 2009), 15 Calculate modulo 2 P 255 = 2 256 38 = x i 2 16, Reduce to [0, 2 256. i=0 i=0 15 / 19

Performance Results Number of cycles 1400 1200 1000 800 600 400 Modular Multiplication Performance Results NIST primes Generic primes Generic primes estimates Curve25519 prime 200 0 150 200 250 300 350 400 450 500 550 Bitsize of the modulus Montgomery multiplication multiplication + fast reduction 1.4 1.7, (512 bits: 2.2 times faster) 16 / 19

Comparison Special Moduli Number of cycles for what? Measurements over millions of multi-stream modular multiplications, Cycles for a single modular multiplication, include benchmark overhead, function call, loading (storing) the input (output), converting from radix-2 32 to radix-2 16. 17 / 19

Comparison Special Moduli Number of cycles for what? Measurements over millions of multi-stream modular multiplications, Cycles for a single modular multiplication, include benchmark overhead, function call, loading (storing) the input (output), converting from radix-2 32 to radix-2 16. Special prime P 255 Costigan and Schwabe (Africacrypt 2009), 255 bit. single-stream: 444 cycles (144 mul, 244 reduction, 56 overhead). multi-stream: 168 cycles. no function call, loading and storing, perfectly scheduled (filled both pipelines) this work, multi-stream: 175 cycles (< 168 + 56), = both approaches are comparable in terms of speed (on the Cell). 17 / 19

Comparison Generic Moduli Generic 195-bit moduli Bernstein et al. (SHARCS 2009): multi-stream, 189 cycles This work: multi-stream, 176 cycles (for 192-bit generic moduli) Scale: ( 195 192 )2 176 = 182 cycles. Generic moduli: single vs. multi stream Bitsize #cycles New MPM umpm 192 176 1,188 877 224 234 1,188 877 256 297 1,188 877 384 665 2,092 1,610 512 1,393 3,275 2,700 18 / 19

Conclusions We presented SIMD algorithms for Montgomery and { schoolbook, Karatsuba } multiplication plus fast reduction. Algorithms are optimized for architectures with a multiply-and-add instruction. Implementation results on the Cell: moduli of size 192 to 521 bits show that special primes are at least 1.4 times faster compared to generic primes. Future work Similair speed-up on other multi-core platforms like GPUs? 19 / 19