GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through a hybrid use of the CPU and GPU. BACKGROUND. This project was motivated by a clear deficiency of GPU utilization for system software. We observed that one area in particular that creates a lot of overhead in a machine is security. GPUs are an attractive way to accelerate security as they are readily available on most machines and are cheaper to use than investing in more powerful processors. The algorithm we used for data encryption and decryption for this project was AES, a popular encryption algorithm. We began with the a common implementation of AES. Specifically, AES implementation by Karl Malbrain from Berkeley (http://www.geocities.ws/ malbrain/aestable2_c.html). This implementation is used by commercial products like Adobe Media Player (http://www.adobe.com/products/eula/third_party/mediaplayer/ AdobeMediaPlayer1_1_ReadMe.pdf) and several others. AES is inherently a byte-parallel algorithm so we took the following approaches to parallelize the algorithm. As mentioned AES is byte-parallel algorithm, so we make static assignment of n thread per byte. So, for a 16*16 Bytes of data, for example - we will have 16 blocks with 16 threads - each working on a byte. Each thread will do all the operations to be done on a byte including multiple variants of ShiftRows, MixSubColumns and AddRoundKey. Cache integer stage in shared memory Constant memory for S tables. The Expanded Key computation from a 128 bit key, for the individual rounds of AES, will be done on the CPU as it is a 1 time effort APPROACH. To get the best performance, we used both the CPU and GPU when encrypting and decrypting files. The challenge was determining when to use which one. The challenge was balancing latency and throughput. For small read/writes sending data to the GPU takes a long time, because copying to GPU occurs over a slow bus. However, once the read/writes were of a

larger magnitude, the parallelized version of AES implementation had significantly better performance. With this in mind, we thought about what systems would benefit the most from this approach. We identified that computers with low end processors could especially benefit from an implementation that utilizes the GPU. We ordered an ASUS netbook with a 1.8ghz processor and NVIDIA graphics card. After countless hours and many failed attempts however, we could not set up CUDA drivers on the laptop because of conflicts with NVIDIA Optimus technology. We then attempted setting up the environment on a Lenovo W520, but once again we ran into the same problems. Frustrated, we were forced to use a computer in the Gates cluster which ran Intel Xeon 2.67 GHz CPU and had an NVIDIA Quadro FX580, both quite powerful. The next major hurdle was creating the file system and pipelining all reads and writes through our encryption and decryption implementations. This proved to be quite challenging because of the issue of correctly dividing and padding the file for correct encryption. Resolving this took some time because of debugging challenges. The next challenge was figuring out what was a small and what was a large file. This step unfortunately is hardware specific and in our current implementation we found the threshold through running multiple tests with files of different sizes. The results section below has some graphs for how we determined that threshold for the computer setup we were using. Below are some graphs illustrating GPUFS design. Overiew

Fuse Layer RESULTS. Overall, we were able to achieve significant speedup, and more importantly we were able to show that there is potential for even greater speed up with the right tools. After running through the implementation with various size file, we came up with the following results that are illustrated in the graphs below. We were able to come up with these figures by including timers for all the different computational methods. We also had timers in between various system calls and data transfers,

so that we had a better sense of exactly what was causing bottlenecks or what the limitations were for our model. We made comparisons between encryption and decryption on both CPU and GPU, and we compared GPU and CPU only performances. With encryption we would often run into problems because the timings were affected by caching, we were able to overcome this by avoiding reusing files.

As we can see from two graphs above, initially, when the data size is very small there is huge overhead that we see when using GPU. This is due to the relatively slow bus that is used when sending information to and from the GPU. However as the data size increases we notice a very evident trend, and by 16KB we see that the GPU implementation becomes faster for encryption/ decryption than the CPU version. This trend continues until just over 1MB at which point both implementations roughly plateau and we see that the GPU implementation is roughly 6x faster than CPU. The chart above shows our final results. Here we see that the absolute speedup of the algorithm is about 13.2. When we include data transfer, as in the two graphs prior, we see that the speedup

drops to 6. Furthermore, once we include our filesystem we see the performance drops down to roughly 3. Below we will look into some potential reasons why the speed is so much smaller with the filesystem. There are a number of limitations that we ran that prevented us from achieving ideal theoretical speedup. One major one was the fact that we were not able to do all the processes on the VFS layer. This means that for every write, for example, we must create 9 different copies of the file which creates a lot of overhead: 1) User space to VFS layer 2) VFS Layer (Kernelspace) back to FUSE (Userspace) 3) FUSE layer to a buffer (our implementation bad) 4) CPU to GPU global memory 5) Global memory to shared memory (bottleneck) 6) shared memory to Global memory (bottleneck) 7) Global memory to CPU 8) GPUFS to Kernel space 9) Kernel space to user space Another limitation to higher speedup, as mentioned earlier, is that we were not able to use the hardware that we believe would benefit the most from the CPU+GPU hybrid over just a CPU implementation(slow CPU computers, such as netbooks). There are a few issues that could affect our analysis. One is that, though we are using a commonly used implementation of AES, there is the possibility of improving the serial implementation of AES, which would in turn affect the threshold and performance of the GPUFS. Major challenges in real-world adoption: 1) Since we use the hybrid approach in choosing CPU or GPU for AES, we typically set a threshold on when to route data to the GPU based on the CPU speed, GPU speed and the GPU cores available. This means that we will be able to use the GPU only if the data being read or written is larger than the threshold. In applications such as cp, vi etc if one tries to copy or read an entire large file, the applications try to read the disk block by block with each chunk being 4096 bytes. If the threshold happens to be above 4096 bytes (in case of fast CPUs), all the AES data will be routed to the CPU although it is intended to read/write more than the threshold. Image related applications like photo viewers usually read more then a block of data at a time and thus crossing the threshold and helping us test our system! This problem can be overcome by: 1) optimistically buffering data in memory until the threshold is crossed and routing it to GPU. Need to decide uptil what time will we see if the threshold is crossed - this parameter affects the latency of the file system 2) Streamlined copying of data to the GPU as and when read/written, to hide the memory transfer latency. CUDA provides this feature, but we are yet to utilize it

2) Concurrent file encryptions TO support File I/O in parallel, with GPU encryption, we have to implement the streams for every file. STreams in CUDA provide for mutual exclusion among the CUDA kernels launched for different Files. We are yet to do this as well. REFERENCES. http://www.geocities.ws/malbrain/aestable2_c.html http://www.adobe.com/products/eula/third_party/mediaplayer/ AdobeMediaPlayer1_1_ReadMe.pdf http://www.cs.columbia.edu/techreports/cucs-002-04.pdf LIST OF WORK BY EACH STUDENT. We worked together on all the setup, coding, presentation preparation, and final report. CODE The code is attached the main files are aesserial.c, aesparellel.cu, and bbfs.c