Timing Attacks on software implementation of RSA

Timing Attacks on software implementation of RSA Project Report Harshman Singh 903-40-5260 singhha@cs.orst.edu June 07, 2004

Abstract Timing attacks enable an attacker to extract secret information from a cryptosystem. It is based on the timing differences with respect to different inputs given to an encryption or decryption algorithm. Boneh and Brumley have recently showed an adaptive input attack in order to guess the upper half of an RSA prime factor. Dr. Werner Schindler has proposed an improved approach based on using different input values which are more efficient in terms of signal to noise ratio results. We implemented attacking clients based on both approaches to obtain RSA key using OpenSSL library routines using the later approach. 1 Introduction Timing attacks expose private information, such as RSA keys, by measuring the amount of time required to perform private key operations (Decryptions etc.). Timing attacks are related to a class of attacks called side-channel attacks. Others include power analysis and attacks based on electromagnetic radiation. Unlike the timing attack, these extended side channel attacks require special equipment and physical access to the machine. Here we only focus on the timing attack that targets the implementation of RSA decryption in OpenSSL. Until now, timing attacks were only applied in the context of hardware security tokens such as smartcards. It is generally believed that timing attacks cannot be used in complex environment like networks or to attack general purpose servers, such as web servers, since decryption times are masked by many concurrent processes running on the system. It is also believed that common implementations of RSA (using Chinese Remainder and Montgomery reductions) are not vulnerable to timing attacks. These assumptions are challenged by developing a remote timing attack against OpenSSL [15], an SSL library commonly used in web servers and other SSL applications. The basic attack works as follows: the attacking client measures the time an OpenSSL server takes to respond to decryption queries. The client is able to extract the private key stored on the server. 2 The RSA Cryptosystem 2.1 The RSA Algorithm The RSA algorithm was invented by Rivest, Shamir, and Adleman. Let p and q be two distinct large random primes. The modulus n is the product of these two primes: n = pq. Euler's totient function of n is given by (n) = (p-1)(q-1)

Now, select a number 1 < e < phi(n) such that gcd(e,phi(n)) = 1 and compute d with d = e -1 mod phi(n) using the using the extended Euclidean algorithm. Here, e is the public exponent and d is the private exponent. Usually one selects a small public exponent, e.g., e = 216 + 1. The modulus n and the public exponent e are published. The value of d and the prime numbers p and q are kept secret. Encryption is performed by computing C = M e (mod n), where M is the plaintext such that 0 <= M < n. The number C is the ciphertext from which the plaintext M can be computed using M = C d (mod n). 3 OpenSSL's implementation of RSA Here is a brief review about how OpenSSL implements RSA decryption that is closely related to the present context. OpenSSL closely follows algorithms described in the Handbook of Applied Cryptography. 3.1 Montgomery Multiplication RSA requires high-speed and space-efficient algorithms for modular multiplications. The Montgomery multiplication algorithm is used to speed up the modular multiplications and squarings required during the exponentiation process. The Montgomery algorithm computes: MonPro(a; b) = a. b. r -1 mod n given a; b < n and r such that gcd(n; r) = 1. Even though the algorithm works for any r which is relatively prime to n, it is more useful when r is taken to be a power of 2. In this case, the Montgomery algorithm performs divisions by a power of 2, which is an intrinsically fast operation on general-purpose computers,

The Montgomery reduction algorithm computes the resulting k-bit number u without performing a division by the modulus n. Via an ingenious representation of the residue class modulo n, this algorithm replaces division by n with division by a power of 2. The latter operation is easily accomplished on a computer since the numbers are represented in binary form. Assuming the modulus n is a k-bit number, i.e., 2k 1 n < 2k, let r be 2k. The Montgomery reduction algorithm requires that r and n be relatively prime, i.e., gcd(r; n) = gcd(2k; n) = 1. This requirement is satisfied if n is odd. In the following, we summarize the basic idea behind the Montgomery reduction algorithm. Given an integer a < n, we denote its n-residue or Montgomery representation with respect to r as a = a. r (mod n) Given two n-residues a and b, the Montgomery product is defined as the scaled product u = a. b. r -1 (mod n) where r -1 is the (multiplicative) inverse of r modulo n, i.e., it is the number with the property: r -1. r = 1 (mod n) : As the notation implies, the resulting number u is indeed the n-residue of the product u = a. b (mod n) since u = a. b. r -1 (mod n) = (a. r). (b. r). r -1 (mod n) = (a. b). r (mod n) In order to describe the Montgomery reduction algorithm, we need an additional quantity, n', which is the integer with the property r. r -1 - n. n' = 1 The integers r -1 and n' can both be computed by the extended Euclidean algorithm. The Montgomery product algorithm, which computes u = a. b. r -1 (mod n) given a and b, is given below:

In order to use Montgomery reduction all variables must first be put into Montgomery form. The Montgomery form of number x is simply xr mod q. To multiply two numbers a and b in Montgomery form we do the following. First, compute their product as integers: ar br = cr2. Then, use the fast Montgomery reduction algorithm to compute cr2 R 1 = cr mod q. Note that the result cr mod q is in Montgomery form, and thus can be directly used in subsequent Montgomery operations. At the end of the exponentiation algorithm the output is put back into standard (non-montgomery) form by multiplying it by R 1 mod q. For our attack, it is equivalent to use R and R 1 mod N, which are public. Hence, for the small penalty of converting the input g to Montgomery form, a large gain is achieved during modular reduction. With typical RSA parameters the gain from Montgomery reduction outweighs the cost of initially putting numbers in Montgomery form and converting back at the end of the algorithm. The key relevant fact about a Montgomery reduction is at the end of the reduction one checks if the output cr is greater than q. If so, one subtracts q from the output, to ensure that the output cr is in the range [0, q). This extra step is called an extra reduction and causes a timing difference for different inputs. Schindler noticed that the probability of an extra reduction during an exponentiation gd mod q is proportional to how close g is to q [18]. Schindler showed that the probability for an extra reduction is: Pr[Extra Reduction] = g mod q /2R (1) Consequently, as g approaches either factor p or q from below, the number of extra reductions during the exponentiation algorithm greatly increases. At exact multiples of p or q, the number of extra reductions drops dramatically. Figure 1 shows this relationship, with the discontinuities appearing at multiples of p and q. By detecting timing differences that result from extra reductions we can tell how close g is to a multiple of one of the factors.

For the small penalty of converting the input g to Montgomery form, a large gain is achieved during modular reduction. With typical RSA parameters the gain from Montgomery reduction outweighs the cost of initially putting numbers in Montgomery form and converting back at the end of the algorithm. Consequently, as g approaches either factor p or q from below, the number of extra reductions during the exponentiation algorithm greatly increases. At exact multiples of p or q, the number of extra reductions drops dramatically. Figure 1 shows this relationship, with the discontinuities appearing at multiples of p and q. By detecting timing differences that result from extra reductions we can tell how close g is to a multiple of one of the factors. 3.2 Chinese Remainder Theorem Two simultaneous congruences n = n 1 (mod m 1 ) and n = n 2 (mod m 2 ) are only solvable when n 1 = n 2 (mod gcd(m 1,m 2 )). The solution is unique modulo lcm(m 1,m 2 ). OpenSSL uses the Chinese Remainder Theorem (CRT) to perform these exponentiations. With Chinese remaindering, the function m = cd mod N is computed in two steps. First, evaluate m1 = cd1 mod p and m2 = cd2 mod q (here d1 and d2 are precomputed from d). Then, combine m1 and m2 using CRT to yield m. During an RSA decryption with CRT, OpenSSL computes cd1 mod p and cd2 mod q. Both computations are done using the same code. RSA decryption with CRT gives up to a factor of four speedup, making it essential for competitive RSA implementations. RSA with CRT is not vulnerable to Kocher s original timing attack [10]. Nevertheless, since RSA with CRT uses the factors of N, a timing attack can expose these factors. Once the factorization of N is revealed it is easy to obtain the decryption key by computing d = e 1 mod (p 1)(q 1). 3.3 Sliding window exponentiation OpenSSL uses an optimization of square and multiply called sliding windows exponentiation. When using sliding windows a block of bits (window) of d are processed at each iteration, where as simple square-and multiply processes only one bit of d per iteration. Sliding windows requires pre-computing a multiplication table, which takes time proportional to 2w 1+1 for a window of size w. Hence, there is an optimal window size that balances the time spent during precomputation vs. actual exponentiation. For a 1024-bit modulus OpenSSL uses a window size of five so that about five bits of the exponent d are processed in every iteration. Following is the sliding window exponentiation algorithm. k is called the window size:

3.4 Sliding window exponentiation attacked Dr Schindler made the following observation about implementation of sliding windows exponentiation in OpenSSL: Let y be the input for the exponentiation algorithm and R the Montgomery constant (in our case R = 2 512 ). Further, MM_q(a,b) means the Montgomery multiplication of a and b, i.e. MM_q(a,b):= a* b* R -1 (mod q). If the exponentiation uses CRT with sliding window (window size = 5) then for the prime q (the same is clearly true for the other prime p) the following values are precomputed and stored: y 1 :=MM_q(y,R 2 (mod q))= y*r (mod q), y 2 :=MM_q(y 1, y 1 )= y 2 *R (mod q), y 3 :=MM_q(y 1,y 2 )= y 3 *R (mod q), y 5 :=MM_q(y 3,y 2 )= y 5 *R (mod q), y 29 :=MM_q(y 27,y 2 )= y 29 *R (mod q), y 31 :=MM_q(y 29,y 2 )= y 31 *R (mod q). Boneh and Brumley's algorithm (shortly: B&B algorithm) exploits the multiplication with y 1 in the exponentiation phase where y=[z * (R -1 (mod n)] (mod n) where z ~ q. In particular, y 1 =z. For a 512 bit module q one would expect that about 5-6 multiplications with y 1. The exact number of multiplications depends on q. In Schindler s

attack the input value is y = [u * sqrt(r) -1 (mod n) ](mod n) where u ~ sqrt(q) Then y 2 = u. To compute the values y 3,y 5,...,y 31 one multiplies 15 times with y 2. If the exponent bit is 0 then exactly one u-value that is used for a timing difference is smaller than q and one u-value is larger than q. 4 Attack Implementation details 4.1 Background Both the B&B and Schindler attack are chosen input attacks. Basic idea is, make an initial guess and refine it by learning bits one at a time, from the most significant to the lease. As the Initial guess g of q lying between 2 512 (i.e. N/2) and 2 511 (i.e. N/4) So try all the combinations of the top few bits. Time the decryptions and pick the first peak for guess of q; (After all we at least know the first bit is 1) Suppose we already recovered the top i 1 bits of q. Let g be an integer that has the same top i 1 bits as q and the remaining bits of g are 0. Then g < q. At a high level, we recover the i th bit of q as follows: Step 1 - Let ghi be the same value as g, with the i th bit set to 1. If bit i of q is 1, then g < ghi < q. Otherwise, g < q < ghi. Step 2 - Compute ug = gr 1 mod N and ughi = ghir 1 mod N. This step is needed because RSA decryption with Montgomery reduction will calculate ugr = g and ughir = ghi to put ug and ughi in Montgomery form before exponentiation during decryption. Step 3 We measure the time to decrypt both ug and ughi. Let t1 = DecryptTime(ug) and t2 = DecryptTime(ughi ).

Step 4 - We calculate the difference = t1 t2. If g < q < ghi then, by Section 2.5, the difference will be large, and bit i of q is 0. Ifg < ghi < q, the difference will be small, and bit i of q is 1. We use previous values to know what to consider large and small. Thus we use the value t1 t2 as an indicator for the i th bit of q When the i th bit is 0, the large difference can either be negative or positive. In this case, if t1 t2 is positive then DecryptTime(g) > DecryptTime(ghi), and the Montgomery reductions dominated the time difference. If t1 t2 is negative, then DecryptTime(g) < DecryptTime(ghi), and the multi-precision multiplication dominated the time difference. To overcome the effect of using sliding window we query at a neighborhood of values g, g+1, g+2,..., g+n, and use the result as the decrypt time for g (and similarly for ghi). The total decryption time for g or ghi is then: Tg = n i=0 DecryptTime(g + i) We define Tg as the time to compute g with sliding windows when considering a neighborhood of values. As n grows, Tg Tghi typically becomes a stronger indicator for a bit of q (at the cost of additional decryption queries). The B&B attack exploits the multiplications with the base (multiplied with the Montgomery constant R) whereas Schindler s attack exploits the multiplications with the second power of the base (multiplied with R) in the initialization phase of the table. Therefore, different strategies are used. 4.2 Implementation Presented here is a timing attack which enables the factorization of an RSA-modulus and Montgomery s algorithm. Earlier variants used to assumes that both exponentiations are carried out with a simple square and multiply algorithm. Looking at the OpenSSL code, the following functions are called in this order whenever a decryption query is made: int RSA_private_decrypt(int flen, unsigned char *from, unsigned char *to, RSA *rsa, int padding); calls the following function, int BN_mod_exp(BIGNUM *r, BIGNUM *a, const BIGNUM *p, const BIGNUM *m, BN_CTX *ctx); which in turn calls the Montgomery multiplication function is the modulus is odd, int BN_mod_mul_montgomery(BIGNUM *r, BIGNUM *a, BIGNUM *b, BN_MONT_CTX *mont, BN_CTX *ctx);

this function then uses the following code to convert back and forth the Montgomery form: int BN_from_montgomery(BIGNUM *r, BIGNUM *a, BN_MONT_CTX *mont, BN_CTX *ctx); Here the above mentioned Extra reduction as appears in OpenSSL code: (bn_mont.c file): if (BN_ucmp(ret, &(mont->n)) >= 0){ if (!BN_usub(ret,ret,&(mont->N))) goto err; } Here is the psuedocode as BnB s attacking client is implemented: Our Server : 1. Creates a TCP socket. Binds socket to a port. 2. Generates a 1024 bit key by calling RSA_generate_key() of Crypto library. 3. Accepts a connction from the client. 4. Sends the public modulus (n) to the client. 5. Disable Blinding by using RSA_blinding_off() 6. while(true) { Recieves the guess. Decrypts it using RSA_private_decrypt(). Sends End of decryption message back to client } Attacking Client: 1. Create Tcp Socket. Connect to Tcp Server. 2. Receive n and q from server. 3. Generate a 512 bit guess(g). 4. Set the first 36 bits of g equal to 36 bits of q. rest all bits to 0. 5. For i(bit number) 37 to 256 { T = T1 = 0 g1 = g (except bit i is 1) for neighbourhood = 0 to 800 { a. Claculate R as 2^{num bits of q} (mod n) b. calculate ug = R -1 * g (mod n) c. convert the guess to binary using BN_bn2bin() d. send the above guess 7 times and record the difference in clock ticks from the time the message is sent to the time end of decryption is received each time. e. t = median of these 7 values.

f. T = T + t g. Do the above steps to g1 too. And the summation time is T1. } } Calculate (T-T1). If Diff is large, bit is 0 else bit is 1. Following is the code for Schindler s attacking client: For i(bit number) 32 to 240 { T = T1 = 0 g1 = g (except bit i is 1) h1=round(\sqrt{g1}); h =Round(\sqrt{g}); /*The Round(.) function rounds the argument to the closest integer*/ } for neighbourhood s= 0 to 800 { a. Claculate R as 2^{num bits of q} (mod n) /* in our case:r=2^{512}*/ a+. R05=sqrt{R}; /*here: R05=2^{256}*/ b. calculate uh = R05^{-1}*(h+s) (mod n) d. send the above guess 7 times and record the difference in clock ticks from the time the message is sent to the time end of decryption is received each time. e. t = median of these 7 values. f. T = T + t g. Do the above steps to h1 too. And the summation time is T1. } Calculate (T-T1). If Diff is large, bit is 0 else bit is 1. 5 Attack Experiment Results 5.1 Setup The attacks was performed against OpenSSL 0.9.7d, and RSA blinding was turned off. All tests were run under Linux on the bee machine. using gcc 2.96. All keys were generated at random via OpenSSL s key generation routine.

Results were obtained according to following strategy: Notation: a = number of bits of the prime factors p and q, typically a= 512. R = 2 a, for a=512 we have R = 2 512. R05 = 2 a/2, assuming that a is even; for a=512 we have R05=2 256. After having guessed bit i-1 of the prime p counting from the left, starting with i=1; there exist two bounds g i-1;low and g i-1,high with, g i-1,high = g i-1;low + 2 a-i. If all guessings had been correct so far then this holds g i-1;low < p < g i-1,high. Then g:= g i-1;low and g_1:= g + 2^{a-i-1}. This is the same for both attacks. 5.1.1 Boneh & Brumley's attack: Determine the following, b t L := 1 / (2b+1) * i=-b DecryptTime( (g+i) * R -1 (mod n) ) b t H := 1 / (2b+1) * i=-b DecryptTime (g 1 +i) * R -1 (mod n)) then compute BB := t L - t H. Repeat this for b=100, 200, 300, 400. As different time measurements of the SAME input sometimes yields varying reslts (due to delays from the network) each term on the right-hand side is measured more than once. Then the median of the values is used. 5.1.2 Schindler s attack: Notation: h:= Round(sqrt{g}) h 1 := Round(sqrt{g 1 }) Determined: t L;S := 1 / (2b+1) * b i=-b DecryptTime( (h+i) * R05-1 (mod n) ) t H;S := 1 / (2b+1) * b i=-b DecryptTime( (h 1 +i) * R05-1 (mod n) ) then compute S := t L;S - t H;S. Repeat this for b=100, 200, 300, 400.

5.2 Graphs Following figure shows the time variance for decrypting a particular ciphertext decreases as we increase the number of sample takes. Fig 1: Timings for bit 57(value 1 in q) as 59(value 0 in q) with Brumley s Algorithm.

This graph depicts timing differences as obtained by B&B attack. As can be seen it is very easy to distinguish between q bit that is 0 and other that is 1. Figure above depicts the timings for Schindler s Algorithm.

Conclusion In this report we discussed and investigated two approaches of timing attacks on fast software implementation of RSA exponentiation that uses CRT and Montgomery s algorithm and sliding window techniques. The experiments show that, counter to current belief, the timing attack is effective even when carried out for software implementations and in complex environments like networks with machines separated by multiple routers. To defend against these attacks these measures could be adopted: use only one multiplication routine and always carry out extra reduction in Montgomery s algorithm Aso quantize all RSA computations Blinding as is Currently preferred. References [1] David Brumley and Dan Boneh, Remote Timing Attacks are Practical, 2003 [2] Werner Schindler. A timing attack against RSA with the chinese remainder theorem. In CHES 2000, pages 109 124, 2000. [2] Werner Schindler. A combined timing and power attack. Lecture Notes in Computer Science, 2274:263 279, 2002. [3] Werner Schindler. Optimized timing attacks against public key cryptosystems. Statistics and Decisions, 20:191 210, 2002. [4] Werner Schindler, Franois Koeune, and Jean-Jacques Quisquater. Improving divide and conquer attacks against cryptosystems by better error detection/correction strategies. Lecture Notes in Computer Science, 2260:245 267, 2001. [5] C. K. Koc, T. Acar, B.S. Kaliski, "Analyzing and Comparing Montgomery Multiplication Algorithms," IEEE Micro, vol. 16, No. 3, pp. 26-33, June 1996. [6] Werner Schindler, Franois Koeune, and Jean- Jacques Quisquater. Unleashing the full power of timing attack. Technical Report CG-2001/3, 2001. [7] Paul C. Kocher, Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems 2001 [8] OpenSSL Project. Openssl. http://www.openssl.org.