Sharemind - the Oracle of secure computing systems Dan Bogdanov, PhD Sharemind product manager dan@cyber.ee
About Sharemind Sharemind helps you analyse data you could not access before. Sharemind resolves trust issues by removing centralised control and unwanted data access points.
Architecture of an IT service Web/mobile apps or other services Business logic incl. data analysis Data storage and query engine Client 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 3 Service Storage Data access points for internal and external parties
Levels of data encryption Database Client 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 3 Client 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 3 Queries Client 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 3 Service Service Service Service Storage Storage Storage Industry standard SSE/OPE/SDE Sharemind
Sharemind platform Encrypted computing Policy enforcement Audit and verification link sort correlate late MPC/HE Intel SGX Multi-party consensus Disclosure control Online verification Offline audit
Programming toolchain SecreC language (C-like, no cryptographic knowledge necessary) Standard library with array and matrix operations, oblivious access, statistical testing, sorting, linking, regression modelling, etc. 15 000 lines of reusable SecreC code
Application Server paradigm interfaces Java/JavaScript/C/C++/Haskell Mobile apps Web apps Desktop apps SQL queries Rmind statistics package application servers database backends Host 1 Host 2 Host n
Sharemind Analytics Engine Rmind
Sharemind Analytics Engine Rmind
Case study: Government data analytics
IT training has a failure rate New IT students Quit studies before November 2012 1800 1 769 Number of students 1350 900 450 1 352 796 1 165 661 1 398 1 438 1 180 558 616 583 1 504 486 89 0 2006 2007 2008 2009 2010 2011 2012 Year By 2012, a total of 43% of students enrolled in in the four largest IT higher learning institutions in Estonia during 2006-2012 had quit their studies. Source: Estonian Ministry of Education and Research, CentAR.
Legal breakthroughs January 2014: Estonian Data Protection Agency declared that Sharemind technology and processes protect data so well that the Personal Data Protection Act doesn t apply. January 2015: after a code audit, the internal oversight at the Tax Board agreed to upload actual income tax records into the Sharemind-based analysis system. February 2015: the Tax Board, Ministry of Education, Information Systems Authority, Ministry of Finance IT Center and Cybernetica signed the world s first secure multi-party data analysis agreement.
Step 1: Import data Estonian Education Information System Ministry of Education and Research Register of taxable persons Estonian Tax and Customs Board Estonian Information System's Authority Ministry of Finance IT Center Cybernetica Data owners uploaded data with the Sharemind importer. Each value was encrypted at the source, private data never left the data owner. Over 600 000 study records (100 MB) used. Over 10 million tax records (1 GB) used. Largest MPC application on real-world data.
Step 2: Run the analysis Estonian Information System's Authority Ministry of Finance IT Center Statistician (Centar) Universities Companies Policymakers Statisticians used Rmind to post queries. Sharemind ensured that only queries in the study plan were actually executed. Additional microdata protection controls were enforced. Cybernetica
Operations performed Tax and Customs Board Extract data Employment tax payments Secret share and upload Higher study events Extract data Ministry of Education and Science Monthly income Aggregate by month Employment tax payments Higher study events Aggregate by year Aggregate by person Average yearly income Expand by years and aggregate by person Employment record of a person Merge by person's ID University career of a person Complete record of a person Analysis results Analysis table Compute additional attributes and align tax payments Data stored with secret sharing and processed with secure multi-party computation Recover results from shares Analysis results? Statistical analyst
2. Tulemused Tulemused kinnitavad, et nominaalajaga lõpetajate osakaal on madal tudengite hulgas üldiselt ja IKTtudengite hulgas eriti. IKT-tudengite hulgas varieerub nominaalajaga lõpetajate osakaal bakalaureuse- IT is harder to graduate õppes 20 protsendi piirimail, mis on madalam kui muude õppekavade tudengite vastav number (vt Joonis 1). Samasugune tendents ilmneb rakenduskõrgharidusõppe puhul. Magistriõppes on nominaalajaga lõpetajate osakaal veidi kõrgem, varieerudes sõltuvalt aastast 30% ja 40% vahel, kuid siingi on IKT õppurite hulgas see madalam kui teistel. Joonis 1. Nominaalajaga lõpetajate osakaal immatrikuleerimisaastate lõikes, IKT- ja mitte-ikt õppekavad, bakalaureuseõpe Meessoost tudengid lõpetavad nominaalaja jooksul väiksema tõenäosusega kui naistudengid (vt Joonis 2, Joonis 3). IKT tudengite madalam nominaalajaga lõpetamise tõenäosus ilmneb mõlema soo puhul.
Nominaalaja jooksul töötamist vaadates selgub üllatuslikult, et IKT-tudengid ei tööta õpingute ajal rohkem kui mitte-ikt õppekavadel õppivad tudengid. Bakalaureuseõppes on kõigi õppeaastate lõikes enamikul aastatel mitte-ikt õppekavade tudengite hulgas tööhõive määr kõrgem kui IKT-tudengitel (vt Joonis 4). Sama on järeldus rakenduskõrghariduse õppurite osas. Magistriõppes, kus tööhõive määrad ületavad 80%, All students are working on aga tulemus vastupidine: IKT-tudengite hulgas on tööhõive määr kõrgem kui mitte-ikt õppekavade õppuritel. Joonis 4. Nominaalaja jooksul töötanud tudengite osakaal kõigist tudengitest aastati, IKT- ja mitte-ikt õppekavad, bakalaureuseõpe Naissoost tudengite tööhõive määr on mõnevõrra kõrgem kui meessoost õppuritel, seda nii IKT- kui mitte- IKT tudengite hulgas (Joonis 5, Joonis 6). Soolised erinevused hõivemäärades varieeruvad aastati on aastaid, kus erinevus on märkimisväärne, ning aastaid, kus olulist erinevust pole.
Case study: Tax fraud detection
VAT evasion is a problem MEUR VAT Social tax Income tax Alcohol excise Tobacco excise Fuel excise Packaging excise
The story of the 1000 law
Secure implementation Benefits Analyze, combine and build reports without decrypting data. Benefits Encryption is applied on the data directly at the source. Confidentiality is guaranteed against all servers and against malicious hackers. Values are only decrypted when all hosts agree to do so. Risk queries Tax Office server secure multi-party computation system with database Taxpayer's association's server The data is cryptographically protected during processing. Transactions No need to unconditionally trust a single organization. Risk scores Watchdog NGO server Tax Office Taxpayers
First performance results Total running time of aggregation (s) 4000 2000 1000 500 250 100 50 1 aggregator 2 aggregators 4 aggregators 8 aggregators 100 200 500 1000 2000 Number of companies processed We estimated that it would have taken 10 days to process one month of data (50M invoices, 80 000 companies). Matching is a hard problem.
Cloud deployment on AWS
4 Benchmark results Much larger data sizes We used three input data sets with different size in our benchmarks (see Table 3). The largest data set corresponds to the estimates of Estonia s Tax and Customs Board on the number of taxable persons and performed business transactions in one month in Estonia. Each company s tax declaration is an XML-file consisting of a number of sales and purchase transactions with different business partners. Table 3. Descriptions of the three data sets used in the experiments. No. of companies No. of transaction partner Total no. of transactions pairs 20 000 200 000 25 000 000 40 000 400 000 50 000 000 80 000 800 000 100 000 000 The source data for 100 000 000 transactions had a In the upload phase, declarations were uploaded to the 80 Sharemind processes, each process receiving a single declaration at a time. After aggregating the data, the results were moved total together size into of a 35 single GB process in XML running format on three instances, (about and 1 the GB remaining in the instances were secret-shared closed. Note that each database). party only moves data shares between instances that it controls. The single process then merged the data and performed the risk analysis computations. We used Amazon CloudWatch to monitor the CPU, network and memory usage of the instances. The running times of all computations are presented on Figure 4. The performance of the prototype has significantly improved compared to the earlier version and is well within practical limits as the analysis only needs to be performed once in a single tax period (each month). As can be expected, in multi-region deployments the computations are slower due to the
Impressive running times 9 hours us 2 eu 2 us,1 eu 08:53:00 8 hours 7 hours Computation time 6 hours 5 hours 4 hours 3 hours 02:47:53 02:25:12 05:05:16 04:26:15 Computation phase Risk analysis Aggregation Upload 2 hours 1 hours 38:44 01:23:10 01:14:36 0 hours 20k 40k 80k 20k 40k 80k 20k 40k 80k Number of companies
We build applications Learn about Sharemind http://sharemind.cyber.ee/ Open source prototyping tools http://sharemind-sdk.github.io/ Contact us for more information and collaborations E-mail: sharemind@cyber.ee Twitter: @sharemind