Scuola Politecnica e delle Scienze di Base Corso di Laurea Magistrale in Ingegneria Informatica Tesi di Laurea Magistrale in sistemi distribuiti A Visual Interactive Realtime EXplorer for Bitcoin Anno Accademico 2013/14 relatore correlatore Dott. Marco Benedetti candidato " matr. M63000017 Ch.mo Prof. Stefano Russo
Introduction 3 Chapter 1: Bitcoin: the protocol and the currency 5 Addresses 5 Transactions 6 Blocks 8 Monetary aspects and incentive 10 Bitcoin origins 12 Bitcoin ecosystem 13 Chapter 2: State of the art in bitcoin forensic analysis 15 Related literature 15 The flexcoin case 16 The biggest unsolved case: Mt.Gox eruption 16 Chapter 3: Considerations about the blockchain 18 Chapter 4: Requirements specification 23 Chapter 5: High-level architecture 27 Chapter 6: The Identity Reasoner 29 Address clustering 29 Data structure 31 Invariants of the identity reasoner database 32 Considerations about address graph size 39 Implementation details and experimental results 41 Chapter 7: The Query Engine 46 Balance queries 46 Flow queries 49 Implementation details 52 Chapter 8: The graphical user interface 54 Implementation details 60 Chapter 9: Experimental results 61 Future directions 66 References 67
Introduction Bitcoin is a peer-to-peer, decentralised virtual currency born a few years ago and currently used for trading various services and goods all over the world. Such currency is not legally recognised by most government and is controlled by no central entity. Despite these shortcomings (or perhaps thank to them), the market capitalisation of BC is already in the orders of billions of dollars. The identity of Bitcoin users is hidden behind pseudonyms, but the ledger book of financial transactions is globally visible. Many academic studies deal with the problem of linking pseudonyms to real-world identities and infer knowledge from the graph of transactions (whose size is in the order of the tens of millions and rising quickly). At present, the graph of transactions is explored manually or by ad-hoc scripting or by using software developed for other goals, such as visualizers for (graph) DBs. Looking for and making sense of interesting/novel transaction patterns may be quite challenging. This work aims to produce a modular, scalable, adaptable software toolkit meant to assist a human expert in analysing and making sense of a network of bitcoin transactions. The software will be be called VIREX-BC, as in Visual Interactive Realtime EXplorer for BC: visual, in that all the information will be presented and explored by sophisticated imagery and info graphics generated on-the-fly depending on the search context; interactive, in that input from users will be accepted at any moment to direct and refine the exploration, and 3
realtime, in that fresh transactions will be included and analysed on-the-fly as they are timestamped in the network. This work is inspired to bitiodine, an open source tool for extracting intelligence from the bitcoin network, developed by Michele Spagnuolo for his master thesis at Politecnico di Milano and published on Financial Cryptography [3]. 4
Chapter 1: Bitcoin: the protocol and the currency Bitcoin is a decentralised global digital currency, based on an open-source software, implementing a peer to peer network that agrees on a logical order of transactions thanks to a distributed algorithm. Its first appearance is dated to Jan. 2009, with the first release of the Bitcoin client. In this chapter we will describe the working principles of Bitcoin (addresses, transactions and blocks) and the ecosystem of services born around this striking technology. Addresses Before starting accepting payments in bitcoin, we have to create a bitcoin address. The address, in bitcoin, is a string of letters and numbers which can be thought of as an International Bank Account Number (IBAN) code: it s public, it has a reduced risk of transcription error and it s needed to receive money An example of address is 1Peppe- MEUXx6XgjubBnEQKtay2xpefnCZT and currently this address has a balance of 0.2 bitcoin. Generating a bitcoin address has no cost and it can be done using the bitcoin client. The privacy model introduced by bitcoin (see picture on the left) has public transactions, so addresses work also as pseudonyms to hide people real identities. A person can have multiple addresses, therefore one of the principal activities in bitcoin forensic analysis is linking addresses 5
controlled by the same user. In the chapter The Identity Reasoner a definition for address control and for cluster of addresses is given. Transactions Bitcoin transactions are public transfers of funds among addresses and, in particular, they can transfer bitcoins from from zero or more bitcoin addresses, to one or more bitcoin addresses. The following picture shows four transactions, in a scenario in which Alice and Bob are two bitcoin users. For each of the four transactions, inputs and outputs addresses are represented by colours: transactions on the utmost right have no input addresses and one output address, the mid transaction has one input address (green) and two output addresses (red and cyan) and finally, the fourth transaction has two input addresses and two output addresses. 6
Moreover, each input of a transaction, has a reference (marked with a broken line) to the output of a previous transaction. These references are needed to prove that an address owns an amount of bitcoins that wants to transfer and, together with transactions form a directed acyclic graph. Ron and Shamir analyse this graph in details in [6]. Transactions on the utmost right don t have neither input addresses nor references to previous transactions: they are the so called mining transactions. Mining transactions are the way bitcoins are injected into the network, and their amount is a reward for bitcoin miners, the ones that contribute to write the bitcoin public ledger book (see Blocks section for details). The mid transaction has a very common structure: it has one input address (the green one, that belongs to Bob) and two outputs (Bob s red address, and Alice s cyan address). In this case, Bob is spending his mined 50 bitcoins, to pay 30 bitcoins to the Alice s cyan address, that already has a balance of 50 bitcoins. Since the bitcoin protocol forces users to spend the whole output of a transaction, Bob needs to spend the whole content of its green address (50 bitcoins), even if he just wants to transfer 30 bitcoins to Alice. In order to collect change, Bob creates a new red address, called change address, and dispatches to it an amount of 19 bitcoins. As probably you have already noticed, the sum of the inputs is not equal to the sum of the outputs, but there is a difference of 1 bitcoin. This difference (that actually is lower that 1 bitcoin) is called a transaction fee, and is necessary to let the network accept and timestamp that transaction. The last transaction is multi-input. Let s suppose that Alice wants to buy a service for 60 bitcoins. She must prove to own 60 bitcoins, so she needs to insert two references to previous transactions that have deposited 80 bitcoins in her cyan address. In order to collect change, Alice generates an orange address that will be used in the future. It s not 7
necessary that multi input transactions have the same addresses as inputs. As we will see, multi input transactions are the primary source of linking addresses to the same controller. Technically speaking a transaction is a section of data that is broadcast to the bitcoin network. As shown in the picture at left, it s signed with the public key of the payee and the reference to the previous transaction is obtained through hashing. All transactions are public, not encrypted and permanently recorded into the blockchain since the origin. Blocks Now that we have defined a data structure for transactions, we need a way to logically order them, so that the payee knows that the previous owners of an output did not sign any earlier transaction, therefore avoiding double spending of a transaction output. A common solution to this problem is to introduce a trusted central authority, or mint, that checks every transaction for double spending. Bitcoin proposes a completely distributed solution, in which all transactions are publicly announced and the network agrees on a single history in which they were received [1]. Using the following picture as reference, we will now explain the distributed algorithm that each node performs in order to reach this goal. 8
Step1: new transactions are broadcast to all nodes Step2: each node collects new transactions into a block Blocks are represented as green boxes: they contain transactions, and are stored in an ordered list, called the blockchain, that contains the logical order of transactions in the bitcoin network. In fact, each block contains an hash of the previous one, proving that data of the previous block must have existed (in order to get into hash) at the time that the next is added to the blockchain. In the example shown in figure, we can state that Tx1 and Tx2 must have existed when Tx6 and Tx7 are added to the blockchain. transactions in the same block have to be considered concurrent, in the sense that it s not possible to logically order them. As we will see in the following sections, all nodes agree on a single blockchain and a new block is generated at a rate of 1 every (about) 10 minutes. Step3: each node works on finding a difficult proof-of-work for its block Once all new transactions are collected into a block, a node tries to add this block to the current blockchain and to persuade other nodes to agree that the next ring of the blockchain is the block he forged, but this is a very difficult task In fact, users in the bitcoin network will accept only blocks that carry with them a proof-of-work, just like the one described in [7], a piece of data difficult to find but immediate to verify. In particular, the proof-of-work for a block is a nonce value that gives a blocks s hash that be- 9
gins with a fixed number of zero bits. Once the CPU effort has been expended to make it satisfy the proof-of-work, the block cannot be changed without redoing the work. As later blocks are chained after it, the work to change the block would include redoing all the blocks after it. The majority decision is represented by the longest chain, which has the greatest proof of work invested in it. This algorithm works if a majority of CPU power is controlled by honest nodes, where a node is said to be honest when accepts to work on the longest chain he knows. Some considerations about proof-of-work integrity can be found in [1] and [8]. Step4: When a node finds a proof-of-work, it broadcasts the block to all nodes. Step5: Nodes accept the block only if all transactions in it are valid and not already spent. Step6: Nodes express their acceptance of the block by working on creating the next block in the chain, using the hash of the accepted block as the previous hash. Monetary aspects and incentive Bitcoin is designed as a system where no central monetary authority is involved. In fact, new money is created and introduced into the system via the process of validating transactions (i.e. finding valid blocks): by convention, the first transaction in a block is a spe- 10
cial transaction that starts a new coin owned by the creator of the block. This, apart from providing a way to initially distribute coins into circulation, adds an incentive for nodes to support the network. The steady addition of a constant of amount of new coins is analogous to gold miners expending resources to add gold to circulation. In our case, it is CPU time and electricity that is expended [1]. The supply of money evolves based on an agreement between users performing the mining activity [2]. Currently, the scheme has been technically designed to supply money with a predictable pace, and the number of bitcoins generated per block will half every 4 years, reaching a total number of bitcoins into circulation equals to 21 millions in 2040 (see graph from [2]). This solution has many macroeconomic negative implications, such as price instability and deflationary economy. When the money supply has reached the plateau, the incentive will be found with transaction fees. If the output value of a transaction is less than its input value, the difference is a fee that is added to the incentive value of the block containing the transaction. Both incentives may help encourage nodes to stay honest. If a greedy attacker is able to assemble more CPU power than all the honest nodes, he would have to choose between using it to defraud people by stealing back his payments, or using it to generate new coins. He ought to find it more profitable to play by the rules, such rules that favour him with more new coins than everyone else combined, than to undermine the system and 11
the validity of his own wealth. [8] assesses integrity by proof of work in a scenario in which bitcoin is used as a primary currency for online transfers currently carried out by credit cards. Bitcoin origins The theoretical roots of Bitcoin can be found in the Austrian school of economics and its criticism of the current fiat money system and interventions undertaken by governments and other agencies, which, in their view, result in exacerbated business cycles and massive inflation [2]. In 1988, cryptography advocate Wei Dai suggested a system in which the currency would be both regulated and created through crowdsourced cryptography. In 2008, a person (or a group of people) under the pseudonym of Satoshi Nakamoto distributed a paper named Bitcoin: A Peer-to-Peer Electronic Cash System [1] and then released an open source software named Bitcoin, which was a first attempt to give a shape to this idea. The first bitcoin transaction was dated Jan, 3rd 2009. In the first years of its life, bitcoin was used in small communities of early adopters: everyone could install the open source software on its personal computer and participate to the network also by minting new bitcoins. In 2010 Bitcoin was used by an individual to trade a real good for the first time, but the true explosion of its popularity can be dated to the mid of 2012. Since then a wide variety of service providers began to accept bitcoin as a mean of payments and an ecosystem of support services, such as wallet services or exchanges, was born. Some of these third-parties are noteworthy. 12
Bitcoin ecosystem Wallet services allow bitcoin users to transact with others without installing the bitcoin client. These services manage a bitcoin address in place of its respective owner, so that he can send and receive bitcoins, in a home-banking fashion. greenaddress.it and blockchain.info offer wallet services. Bitcoin currency exchanges allow users to trade bitcoins with other currencies, earning commissions for each trade. They usually operate also as a wallet service, storing amounts of money on behalf of their customers, and allow deposits and withdrawals in different currencies. The Silk Road was a famous online shop in the deep web that could only be accessed via TOR. This site allowed people to buy a variety of items, but became famous for being a drug market and other illicit items. On October 2, 2013 the FBI shut down the silk road and its creator was arrested on charge of alleged murder-for-hire and narcotics trafficking violation. Mining pools are distributed services aimed at transaction validation, in which clients contribute together to the validation of a transaction, and then split the reward that comes from this activity, according the processing power that each participant put into play. Pooled mining effectively reduces the granularity of the block generation reward, spreading it out more smoothly over time. Deepbit was an example of mining pool. Bitcoin network is an optimal infrastructure for gambling. Its protocol allows online gambling services to confirm that the results were actually calculated fairly without trusting any external party. Hundreds of gambling sites exploit bitcoin network, including dice games, casino, lotteries, slot machines, and poker rooms. Satoshidice and just-dice are two of the most famous dice games. 13
Nowadays many vendors accept bitcoin as a mean of payment, including restaurants and shops, even if it s rather unusual that bitcoins are used to purchase physical goods or services, in particular because of price instability. coinmap.org shows vendors accepting bitcoin spread all around the world. It s important to know that bitcoin is not the unique exemplar of virtual currency: many other currencies, called alt-coins, have been created since 2009, by modifying bitcoin core source code, such as litecoin, namecoin, dogecoin. Moreover, a lot of currencies, usually called meta-coins, are built upon the bitcoin infrastructure, each adding a particular service (e.g. zerocoin adds strong anonymity to bitcoin). In general, the bitcoin protocol and its infrastructure (the blockchain), currently mainly used to transfer coins among people, can be used to send public, potentially anonymous, timestamped, timeless, and certified messages, therefore has a wide variety of applications and can replace many forms of intermediations. It s not sure that bitcoin will success as a currency, but for sure it s technology is worth of attention by entrepreneurs and regulators. 14
Chapter 2: State of the art in bitcoin forensic analysis Related literature Due to Bitcoin claimed anonymity, forensics analysis in its network has been a well studied topic in literature since 2009. In 2011 Reid and Harrigan [9] first linked addresses belonging to the same entity and showed some implications for anonymity. In 2012 [10] analysed and evaluated the privacy implications of Bitcoin if it was used as a primary currency to support the daily transactions of individuals in a university setting. Through a simulator that faith-fully mimics the use of Bitcoin within a university, they show that the profiles of almost 40% of the users can be, to a large extent, recovered. In 2013 some researchers at the University of California collected information on the web and tried to group bitcoin addresses based on the evidence of shared authority. Their work is published in [4]. In 2011, Michele Spagnuolo released the open source software Bitiodine, simultaneously with his thesis at the university of Illinois. Its work has been later published on Financial Cryptography [3], with the name Bitiodine: Extracting Intelligence From The Bitcoin Network. Bitiodine is able to cluster addresses and classify them using a dataset partially obtained in an automatic fashion, using scrapers for major web sources of bitcoin addresses. Bitiodine has been the main source of inspiration for this work: with the help of its creator, its source has been deeply studied and analysed. Virex tries to maintain bitiodine strengths and to add some improvements, such as an architecture for real time tracking of transactions and a graphical user interface. 15
The flexcoin case Flexcoin 1, a bitcoin bank, has been forced to close because of a theft of 896 bitcoin on March, 3rd. The company posted on its website the following statement: The attacker logged into the flexcoin front end from IP address 207.12.89.117 under a newly created username and deposited to address 1DSD3B3uS2wGZjZAwa2dqQ7M9v7Ajw2iLy The coins were then left to sit until they had reached 6 confirmations. The attacker then successfully exploited a flaw in the code which allows transfers between flexcoin users. By sending thousands of simultaneous requests, the attacker was able to "move" coins from one user account to another until the sending account was overdrawn, before balances were updated. This was then repeated through multiple accounts, snowballing the amount, until the attacker withdrew the coins (1NDkevapt4SWYFEmquCDBSf7DLMTNVggdu, and 1QFcC5JitGwpFKqRDd9QNH3eGN56dCNgy6) Provided information are enough to visualise flows between flexcoin and its attacker and also to infer some conclusions about the end of stolen coins. The biggest unsolved case: Mt.Gox eruption Mt. Gox, called "Mount Gox" or "MTGOX", was one of the most widely used bitcoin currency exchange market: it was launched in July 2010 and by 2013 was handling 70% of all Bitcoin tradings 2. The market was closed on February,. Mark Karpelès, Mt. Gox CEO, claimed bankruptcy and announced that around 850,000 bitcoins belonging to customers and the company were missing and likely stolen. Although 200,000 bitcoins 1 2 http://www.flexcoin.com http://blogs.wsj.com/briefly//02/25/5-things-about-mt-goxs-crisis/ 16
have since been found, the reason(s) for the disappearance theft, fraud, mismanagement, or a combination of these are unclear as of March. The timeline of the events that lead to Mt.Gox shutdown are the following. On 07 February Mt. Gox halted all bitcoin withdrawals. The company said it was pausing withdrawal requests to obtain a clear technical view of the currency processes. On 10 February The company issued a press release stating that the issue was due to transaction malleability, a known bug that affected many bitcoin clients, including the official one. For technical details about transactions malleability, see Decker and Wattenhofer [11]. On 24 February, Mt. Gox suspended all trading, and hours later its website went offline, returning a blank page On 28 February Mt. Gox filed for bankruptcy protection in Tokyo, reporting that the company had lost almost 750,000 of its customers' bitcoins, and around 100,000 of its own bitcoins. On 20 March, Mt. Gox reported on its website that it found 200,000 bitcoins in an old format cold wallet. That brings the total number of lost bitcoins, down to 650,000 from 850,000. 17
Chapter 3: Considerations about the blockchain In this chapter the reader will find some considerations about the blockchain size and consequent scalability of virex. First, the state of the current blockchain will be analysed and some assumptions about the structure of transactions will be made. Then, the trend of the total number of transactions will be considered in order to make an attempt to predict the size of the blockchain in the future. In the following table there are some measurements obtained from the blockchain at time of writing (Tue, 13 May 06:42:37 GMT). Number of blocks 300.493 Number of transactions 38.649.948 ~39M Number of distinct addresses 35.741.676 ~36M Number of outputs 99.854.793 ~100M Number of inputs 88.905.461 ~89M Each transaction can have an arbitrary number of inputs and outputs, and can generate an arbitrary number of new addresses, but some considerations about the distributions of the number of inputs and outputs per transaction can be made. The following charts gives evidence to the fact that these distributions show a peak at 1 input and 2 outputs, respectively. 40000000 0 1 2 3 4 5 30000000 20000000 10000000 0 #inputs/transaction #outputs/transaction 18
Number of inputs Probability Number of outputs Probability 0 0.01 0 0 1 0.62 1 0.07 2 0.21 2 0.85 3 0.05 3 0.04 4 0.05 4 0.01 5 0.02 5 0 A transaction structure with one input and two outputs is the most common, since the input address is used to collect money, the former output address is controlled by the payee and the latter is used to collect change. Sometimes one address is not sufficient to collect an high amount of money to transfer and more input addresses are needed. We can assume that these distributions as approximatively time-invariant because an increasing in the size of a transaction (in terms of number of outputs and inputs) leads to expensive transaction fees. The expected values for the above distributions are summarized in the following table. Current number of transactions ~39M Expected number of addresses per transaction E[na/nt] 0,92 Expected number of inputs per transactions E[nin/nt] 2,30 Expected number of inputs per transactions E2[nin/nt] 2,31 Expected number of outputs per transactions E[nout/nt] 2,58 NB: The value expected number of inputs per transaction E2 considers mining transactions as transactions with one input. Now let s focus on the total number of transactions and on its derivative, the number of transactions per day (see diagrams below). Both have been growing quite slow from 2009 to mid 2012, but straight afterwards they started to grow faster, with a change in 19
the trend of the number of transactions per day, in the mid of 2012. The steeper slope is in accordance with the diffusion of bitcoin among not-very-early adopters, and probably we will experience other trend-changes in the future, but it s possible to state that, when the growth will significantly slow down, the total number of transactions will level off at an approximately constant value. In order to disclose the current trend behind the growth of the total number of transactions after mid 2012, and predict its value in the near future, a simple linear regression between the number of days elapsed from 2012 June 01 (next called the reference date) 20
and the number of transactions per day is estimated, resulting in a fitting line with intercept at 29.884 transactions/day and slope 53,75 transactions/day/day. Afterwards, to estimate the total number of transactions it s necessary to integrate this quantity, considering an initial value at the reference date of 3.590.000 transactions. 21
The results are shown in the following table. Date (Jun 01) Blockchain estimated size Number of transactions per day Total number of transactions 2012 29.884 3.590.000 2013 49.502,75 18.078.082 69.121,5 39.727.008 2015 88.740,25 68.536.777 2016 108.359 104.507.390 These results are obtained through a rough calculation, but could be useful to asses the feasibility of the project: as explained in chapters The Identity Reasoner and The Query Engine, the database size depends linearly from the number of transactions. Since transactions per day currently grows linearly and, according to Moore s law, memory size doubles every year (or every three years) it should be possible, to keep in memory the whole database. However, this analysis is quite optimistic. If bitcoin will become commonly accepted as a mean of payment, transactions will grow at a very higher pace before reaching the saturation plateau. 22
Chapter 4: Requirements specification Requirements of VIREX can be resumed in a set of questions to which the system tries to answer. They are all intended to be queries in the sense that they don t modify the state of the internal systems and for this reason, virex interface is often refereed as the virex query language. A first classification of virex operations separates questions about balances and questions about flows. In the first category there are questions about the amount of bitcoins controlled by an addresses, an entity or a cluster; in the second there are questions about bitcoin transfers among addresses, but also about mined bitcoins. A second (orthogonal) classification separates questions about addresses and questions about cluster of addresses. The first class considers flows among single bitcoin addresses or bitcoin entities, without applying any clustering algorithm, and all information are extracted from the public ledger book of the bitcoin blockchain. The second class of operations answers applying clustering to bitcoin entities, with the aid of clustering heuristics and algorithms described in literature (see chapter named The Identity Reasoner ). Tables in this section specify all interface methods. Implementation details are in the chapter named The Query Engine. 23
F.R.1 Natural language questions BALANCE What s the balance of the address 1dice8EMZ at May 26 14:07:44 UTC? What s the balance of the addresses controlled by Satoshi at May 26 14:07:44 UTC? Name Type Description Inputs entity String The address, or the supposed controller, of which we are interested in the balance timestamp Number The unix timestamp of the date and time Outputs balance Number The balance, in satoshis, of specified entity at requested date and time F.R.2 Natural language question BALANCE CLUSTERED What s the balance of the cluster to which address 1dice8EMZ belongs, at 26 May 14:07:44 UTC? What s the balance of the cluster controlled by Giuseppe, at 26 May 14:07:44 UTC? Name Type Description Inputs entity String A representative address or supposed controller of the cluster of which we are interested in the balance timestamp Number The unix timestamp of the date and time Outputs balance Number The balance, in satoshis, of specified cluster at requested date and time In the un-clustered version of a balance query, when a controller is specified, virex returns the sum of the amounts of bitcoin deposited in the addresses controlled by the selected entity. 24
F.R.3 Natural language question FLOW What s the flow between address 1dice8EMZ and address Inputs 1NDpZ2wyFe... in the period of time that goes from 15 Jan 00:00 UTC to 26 May 14:07:44 UTC? What s the flow between addresses controller by Satoshi and address 1NDpZ2wyFe... in the period of time that goes from 15 Jan 00:00 UTC to 26 May 14:07:44 UTC? payer entity payee entity Name Type Description from date timestamp to date timestamp String String Number Number The address, or supposed controller, of the payer of the flow we are interested in The address, or supposed controller, of the payee of the flow we are interested in The unix timestamp of the initial date and time The unix timestamp of the final date and time Outputs flow Number The flow between addresses in the specified period F.R.4 Natural language question FLOW CLUSTERED What s the flow between the cluster to which address 1dice8EMZ belongs and the cluster controlled by Giuseppe, in the period of time that goes from 15 Jan 00:00 UTC to 26 May 14:07:44 UTC? Name Type Description Inputs payer entity String A representative address, or the supposed controller of the payer cluster payee entity String A representative address, or the supposed controller of the payee cluster from date timestamp Number The unix timestamp of the initial date and time to date timestamp Number The unix timestamp of the final date and time Outputs flow Number The flow between clusters in the specified period 25
F.R.5 Natural language question MINED BITCOINS (UN-CLUSTERED OR CLUSTERED) What s the amount of bitcoin mined by address 1dice8EMZ (or by the cluster to which the address 1dice8EMZ belongs) in the period of time that goes from 15 Jan 00:00 UTC to 26 May 14:07:44 UTC? It s important to notice that virex has been designed to ask a lot of more questions, such as: How many addresses did 1dice8EMZ payed, in the period of time going from 5 Jan 00:00 UTC to 26 May 14:07:44 UTC? Who controls the clusters that 1dice8EMZ payed in the period of time going from 5 Jan 00:00 UTC to 26 May 14:07:44 UTC? This questions enable for a deeper analysis of bitcoin flows, but are not formalized here, and no implementation is still available. Now let s define the real-time word in the VIREX acronym. We say that virex is realtime in the sense that all questions specified with the the bitcoin query language shall receive answers updated to the latest confirmed 3 transactions. After a transaction is broadcast to the bitcoin network, it may be included in a block and when that happens it is said that one confirmation has occurred for the transaction. With each subsequent block that is added to the blockchain, the number of confirmations is increased by one. To protect against double spending, a transaction should not be considered as confirmed until a certain number of blocks have been added. Just like the classic bitcoin client, we will consider a transaction as confirmed when at least 6 blocks confirm the transaction. 3 https://en.bitcoin.it/wiki/confirmation 26
Chapter 5: High-level architecture High level architecture for VIREX system is shown in the following figure. Backend components are enclosed in white boxes, data flows are represented by lines, and arrow s direction identifies the component that takes the initiative (push/pull). At the origin of data there is the Bitcoin Network that, block by block, timestamps transactions and inject them into an extended client that is responsible for realtime tracking of transactions (Realtime Tracker). The Transaction Manager is responsible to analyse new transactions and extract from them essential information needed for address clustering, and to update information controlled by the query engine. In particular, it takes into account flows and balances that generate from analyzed transactions. 27
The Query Engine is the core of the system and is the component responsible at answer questions described in requirements. It must be extremely fast and scalable, in order to support requests coming from the user interface (graphical or not). The Identity Reasoner tries to link addresses together, using information gathered from the blockchain itself and from the web. It clusters together all the addresses likely to be controlled by the same entity. Currently, not all described components have a real implementation. In particular only prototypes for the Identity Reasoner, Query Engine and Web user interface have been implemented. Moreover, all these components have to be orchestrated and synchronised to maintain a consistent state of the bitcoin transaction graph, but this problem is not addressed in this work. 28
Chapter 6: The Identity Reasoner Virex Identity reasoner is the component responsible to cluster addresses and associate them to entities of the real word (a person, a service, a forum user), with the aid of address clustering and data collection. In particular, it needs to: Track clusters of addresses in realtime, while transactions are timestamped in the block chain. Merge clusters that belong to the same entity according to heuristics and user knowledge. Collect and store information about addresses. Address clustering Address clustering in Bitcoin is the activity that seeks to identify groups of addresses that are probably controlled by the same entity. It s possible to reach this goal to some extent, thanks to two well-known heuristics able to link addresses from the structure of transactions in which they are involved. Before presenting heuristics it s important to define the meaning of address control, as in [4]. In short, the controller of an address is the expected entity responsible for forming transactions on behalf of that address. Private key knowledge is a necessary requirement for address control, but not a sufficient one. Consider, for example, buying physical bitcoins from a vendor such as Casascius. Both creator and buyer of the physical bitcoin know the private key, but, according to the previous definition, the controller is the bitcoin buyer. Moreover, it s important to emphasise that this definition of address control, is quite different from account ownership. For example, a wallet service or an ex- 29
change service is the controller of all addresses it generates (often used by customers for deposits / withdrawals), but the funds in these addresses are owned by a wide variety of distinct users. The first linking heuristic is often referred as heuristic of multi-input transactions and was already identified by bitcoin creators: it s described in the privacy section of the original bitcoin paper [1]. Briefly, in the hypothesis that users don t share their private keys, if two addresses are used as inputs to the same transaction, then they are controlled by the same entity. For a more formal definition of this heuristic, it s possible to read [3] or [4]. The second linking heuristic is often called shadow address guessing [3] and aims at guessing, for each transaction, the address used for change. According to this heuristic, the address used for change is controlled by the same entity controlling input addresses. As Satoshi Nakamoto suggests in its paper, a new key pair should be used for each transaction to keep them from being linked to a common owner, and in fact, current bitcoin implementation generates, for each transaction, a new address for collecting change. Many techniques to identify this address are described in literature, but In this work the more stringent one will be used, i.e. the variant described in [3]: If there are two output addresses (one payee and one change address, which is true for the vast majority of transactions), and one of the two has never appeared before in the block chain, while other has, then we can safely assume that the one that never appeared before is the shadow address generated by the client to collect change back. 30
This version, although effective, has proven significantly less safe than the multi input transaction heuristic. [4] reports very high rate of false positives, ending up with a giant super-cluster containing the public keys of Mt.Gox, Instawallet, BitPay, and Silk Road. Moreover, it s possible to understand that two addresses are controlled by the same user thanks to data collection, by labelling addresses as being controlled by some known real-world entity. Data collection can be performed by transacting with real actors in the bitcoin ecosystem (e.g. playing with just-dice, depositing and withdrawing from an exchange), but always more frequently the primary source of this data is the big and unstructured word of the internet. A very huge dataset was collected and described in [4]: services include mining pools, wallets, exchanges, vendors and many others, while Bitiodine [3] includes scrapers for just-dice, bitcointalk, bitcoin-otc and many other sites. In addition, many users publicly claim their own addresses on the web, and many of these are collected at blockchain.info/tags. Data structure Identity reasoner core data structure is a graph in which nodes represent addresses and relationships represent links between addresses that state the two addresses are controlled by the same entity. Each node has the following properties: An address, a string representing the bitcoin address of the node A controller, a string identifying the controller of the bitcoin address A cluster id, a numeric identifying the cluster to which the address belongs There are three types of relationships: HEURISTIC1, directed, to identify a link between two addresses caused by a multi input transaction. 31
HEURISTIC2, directed, to identify a link between two addresses caused by change address guessing. SAME_CONTROLLER, undirected, to identify a link between two addresses caused by knowledge of shared control between the two addresses. Each relationship has a description property, giving information about its origin (e.g. for H1 and H2 relationships, the description is an identifier of the transaction that caused the linking). Given the identity reasoner data structure, it s possible to identify and track clusters of addresses using well knows graph algorithms. It is straightforward to compute connected components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph) using either breadth-first search or depth first search. There are also efficient algorithms to dynamically track connected components of a graph as vertices and edges are added. Invariants of the identity reasoner database Some invariants are defined to keep the data structure consistent with knowledge extracted from the blockchain and from the web. INVARIANT0: There aren t two nodes in the graph with the same address. INVARIANT1: Two addresses are in the same connected components if and only if then they have the same cluster identifier. 32
INVARIANT2: Given a transaction, with M ordered input addresses and an irrelevant number of output addresses, then exists in the identity reasoner the following path, with edges of type HEURISTIC1: INVARIANT3: Given a transaction, with an input address in position 0 (first position), say, and a shadow address, then exists in the identity reasoner the following edge of type HEURISTIC2: INVARIANT4: Two addresses have the same controller property if and only if they are linked by a SAME_CONTROLLER relationship. Algorithms The data structure should be upgraded each time one of the following event happens A new transaction is confirmed in the blockchain. In this case, the identity reasoner should add new nodes corresponding to new addresses that appeared in the network and new edges, corresponding to heuristics that have been evaluated. E new controller for an address is discovered. In this case, the identity reasoner should merge clusters that are controlled by the same entity or separate addresses that are no more controlled by the same entity. 33
Primitive operations Virex identity reasoner graph is supposed to have, as well as setters and getters for node s properties, a series of primitive operations which don t guarantee identity reasoner invariants, but are useful to define more complex transactional operations described in subsequent paragraphs. 1. create_node(address): if a node with the specified address doesn t exist in the network, create the node. 2. create_relationship(address1, address2, type): if a relationship between address1 and address2, with the specified type, doesn t exist, create the relationship. 3. delete_relationship(address1, address2, type) 4. traverse_address(address): starting from the node with the specified address, and using a breadth/depth first algorithm, identify all nodes in the same connected components of the starting node, marking them with the same cluster id. 5. merge_clusters(address1, address2): merge clusters of selected addresses, without the need of re traversing a portion of the graph. Bootstrapping To initially bootstrap the identity reasoner graph, it s necessary to read the whole blockchain and importing into the graph all identified nodes (addresses) and edges relative to heuristics 1 and 2. Then we need to traverse all nodes of the graph in order to identify connected components for the first time. Adding a bitcoin transaction When a new transaction is timestamped into the blockchain, it s necessary to update the virex identity reasoner data structure with all new addresses and new heuristics. 34
From the point of view of the identity reasoner, a transaction can be considered as a set of addresses and a set of heuristics. In the following example a new transaction involves Address6 and Address4 and a new heuristic of type 1. When a new transaction is added to the identity reasoner, there is always no need to delete edges, hence there is no need to re-traverse portions of the graph. Setting controller Given an address A, if we are going to set its controller property to C, it s important to guarantee invariants for each possible state of the network. We summarise this state using three binary variables as shown in the following table. Six out of eight possible states are consistent with invariants and are therefore noteworthy. The address has a different controller property The address is linked to another one with a SAME_CONTROLLE R relationship Another address with the same controller as C exists in the network FALSE FALSE FALSE 1 FALSE FALSE TRUE 2 Case FALSE TRUE FALSE Inconsistent FALSE TRUE TRUE Inconsistent TRUE FALSE FALSE 3 TRUE FALSE TRUE 4 TRUE TRUE FALSE 5 35
The address has a different controller property The address is linked to another one with a SAME_CONTROLLE R relationship Another address with the same controller as C exists in the network TRUE TRUE TRUE 6 Case Each of the consistent state will be analyzed in details. Setting controller 1/6 In the first case, you just need to set the controller property to C for the given address. Setting controller 2/6 In this second case, after setting the controller property for Address6 to Alice, and adding an edge between Address6 and Address5, it s needed to merge clusters 2 and 3. Primitive operations to execute are the following: 1. set_controller( Address6, Alice ) 36
2. create_relationship( Address5, Address6, SAME_CONTROLLER) 3. merge_clusters( Address5, Address6 ) Setting controller 3/6 In this example you need to change controller for Address5 from Alice to Chris. The node Address5 is not connected to other nodes with a SAME_CONTROLLER relationship, and no node with controller Chris exists in the network, so you just need to change the controller property. Setting controller 4/6 In this example you need to change controller for Address5 from Alice to Bob. The node Address5 is not connected to other nodes with a SAME_CONTROLLER 37
relationship, but a node with controller Bob already exists in the network. Primitive operations to execute are the same as in case 2. Setting controller 5/6 In this example, you need to change the controller for Address1 from Bob to Chris. The SAME_CONTROLLER relationship between Address1 and Address5 has to be dropped and connected components involving these addresses need to be identified again. Primitive operations to be executed follows: 1. set_controller( Address1, Bob ) 2. delete_relationship( Address1, Address5,SAME_CONTROLLER) 3. traverse_address( Address1 ) 4. traverse_address( Address5 ) Setting controller 6/6 38
In the last example you need to change controller for Address6 from Alice to Bob. Primitive operations to be executed are the following: 1. set_controller( Address6, Bob ) 2. remove_relationship( Address6, Address5,SAME_CONTROLLER) 3. traverse_address( Address6 ) 4. traverse_address( Address5 ) 5. create_relationship( Address1, Address6,SAME_CONTROLLER) 6. merge_clusters( Address1, Address6 ) Considerations about address graph size The number of nodes is proportional to the number of addresses in the blockchain. Denoting with E[na/nt] the expected number of addresses addresses per transaction, we have that the number of nodes is Considering heuristic 1, we have an edge for each couple of addresses in a transaction, so the number of relationships of type HEURISTIC1 can be expressed as 39
Where nt is the number on transactions in the blockchain and E[nin/nt] is the expected number of inputs per transaction. Considering heuristic 2, we have at most a single shadow address per transaction, so an upper bound to the number of relationships of type HEURISTIC2 can be expresses as It s important to note that both the number of nodes and the number of edges are linear with the number of transactions. 40
Implementation details and experimental results Identity reasoner has been implemented using neo4j, a famous graph database. A graph database uses graph structures, such as nodes, edges, and properties to represent and store data and is a powerful tool for graph-like queries, for example traversing or computing the shortest path between two nodes. The resulting database size is about 12 GB, and the number of identified clusters for each heuristic, in the current blockchain (~ 35.7M addresses) is reported in the following table. Expected number of edges Actual number of edges Number of identified clusters Maximum cluster size (addresses) Average cluster size (addresses) H1 ~ 50 M ~ 50 M ~ 16M ~ 1M ~ 2.18 H2 ~ 39 M ~ 13 M ~ 23M ~ 3M ~ 1.54 H1+H2 ~ 89 M ~ 63 M ~ 8.5M ~ 13M ~ 4.22 20000000 H1 H2 H1+H2 15000000 10000000 5000000 0 1 addr 2 addr 3 addr 4 addr 5 addr It is evident that implemented heuristic 2 is quite unsafe, since it ends up in a giant supercluster of about 40% of addresses and, for this reason, it won t be taken into account in the discussions that will follow. A refined implementation is described in [4] and should be implemented in the near future. 41
It s now time add some prior knowledge to the entity reasoner, and to link addresses to their supposed controllers. A first dataset is taken from the BitIodine software and is composed of about 70,000 addresses potentially belonging to the authors of CryptoLocker, a famous ransomware that locks computers running MS Windows, by encrypting important files with an RSA public key and then offers to decrypt the data if a payment through bitcoin is made. These addresses have been obtained by searching on google for extracts of the text of the money request displayed by the malware and by reading a Reddit thread in which victims and researchers post addresses 4. When adding this dataset to the entity reasoner, we end up in a giant supercluster of about 13M addresses that contains addresses controlled by both MtGox and Cryptolocker. This result has two potential implications, not excluding each other: the first is that there is some false information in the dataset, i.e. some addresses have been announced as controlled by Cryptolocker, but are actually controlled by, for example, Mt. Gox and have nothing to do with the famous malware. The second is that there could be a connection between Cryptolocker and MtGox, that can lead to think that Cryptolocker was a Mt. Gox customer, and some coins stored in addresses controlled by Mt. Gox are owned by Cryptolocker. This scenario highlights the central role that exchanges play in the bitcoin ecosystem, since nowadays goods and services are mostly payed with fiat currencies. http://www.reddit.com/r/bitcoin/comments/1o53hl/disturbing_bitcoin_ virus_encrypts_instead 4 - _of/ 42
A second dataset is obtained form another Reddit thread started just after MtGox filed for bankruptcy 5, with the aim of trying to find an acknowledge for the story told by its CEO and to figure out the financial situation of the famous currency exchange. Many of these addresses belong to the second biggest cluster (about 500k addresses) whose representative is 1LNWw6yCxkUmkhArb2Nf2MPw6vG7u5WG7q and some of them belong to very small clusters with 1 up to 4 addresses, that are likely to belong to MtGox. Third, we were able to identify a Bitstamp 6 hot wallet thanks to the knowledge of one of its addresses 18xgnWy7HmrPnUsD6NJCc29nu4QL21vaYD. In the following picture we show, in linear scale, the size, in number of addresses, of the biggest four clusters and of hot wallets for known entities. As shown in the diagram, we can state that the second cluster is a MtGox hot wallet, but other big clusters controllers are unknown. 5 http://www.reddit.com/r/bitcoin/comments/1z14j0/needed_any_bitcoin_addresses_y- ou_have_used_to/ 6 https://www.bitstamp.net 43
If we plot the portion of the identity reasoner graph relative to big clusters and report some clustering statistics, we are able to identify some false positives heuristic1 edges. Lessons learned Dealing with address clustering and identity reasoning is for sure one of the most fascinating challenge of bitcoin forensics analysis. We tried to describe a model, based on a graph data structure, that can incrementally evolve with the bitcoin network and with an increasing knowledge of address-controller associations. Unfortunately this model is incredibly subject to corruption, and if raw information (e.g. collected on the web) reveals affected by errors, it suddenly will bring clusters (especially the biggest ones, belonging to influential actors of the network) to tie together, hence distorting results. A model for adding edges to the graph is necessary, and should take into accounts the size of the clusters that are going to be merged, the amount and the quality of information collected. Bitcoin service providers are neither strong neither decentralised (yet), so users have strong interest in forensic analysis, as confirmed by discussions about Mt. Gox and 44
Cryptolocker cases. For this reason it would be very nice if they could play an active role in this activity, by reporting information they own about subjects they want to control, in a crowd-sourced fashion. 45
Chapter 7: The Query Engine Bitcoin query engine is the component responsible to answer users requests defined in requirements chapter. It s designed to be extremely fast and scalable. Balance queries Let s recall example queries about balances such as: What was the balance of 1dice at Mon, 26 May 14:07:44 UTC? What was the balance of the cluster containing 1dice at Mon, 26 May 14:07:44 UTC? Given an address a and an instant of time t, balance is a non negative value and can be obtained from bitcoin transactions with the following formula: where T is a transaction with Nt outputs (Nt>0) and Mt inputs (Mt may be 0 for mining transactions), whose timestamp isn t greater than t. In other words, evaluating the balance of an address means to sum over the boundaries (unspent outputs) of the transaction graph at time t: if an output is already spent at time t, it s necessary to cancel the correspondent positive addend using a negative one. Given this definition of address 46
balance, to evaluate a cluster balance is just necessary to sum over single balances of all addresses of that cluster. To efficiently compute balance of addresses and clusters, a data structure called balance element is defined. Balance elements are build starting from transactions: given a transaction T, timestamped at time t, with Mt input addresses in(i) and Nt output addresses out(i), then for each address in input addresses or in output addresses, a balance elements is defined as follows: field tx id address cluster id timestamp amount description (optional) identifies the transaction that generated the balance element. Can be either the hash of the transaction or a progressive identifier. identifies the address of the input/output identifies the cluster to which the address belongs the time at which the transaction was timestamped into the blockchain amount of satoshis transferred from / to the specified address. Amount is positive for output addresses and negative for input addresses. Using balance elements, the balance of an address can be evaluated aggregating all amounts of interesting balance elements. Let s consider, for instance, the transaction identified by 58545bb4cdbd0272df60efa969e1f9604944c507d5634926c6bfd113a9712c2d 7, with two inputs and two outputs and has been timestamped in the block with height 300934 and timestamp -05-15 23:41:01. 7 https://blockchain.info/tx/58545bb4cdbd0272df60efa969e1f9604944c507d5634926c6bfd113a9712c2d 47
The selected transaction has two inputs (1Ai and16u) and 2 outputs (1Et and 1Q9) hence produces 4 balance elements with negative variations for balances of 1Ai and 16U and positive variations for balances of 1Et and 1Q9. In particular, the following balance elements are inserted into the query engine database. tx_id address cluster_id (invented) timestamp amount BE1 5854 1Ai 5 1400197261-0,01 BE2 5854 16U 5 1400197261-0,01 BE3 5854 1Et 5 1400197261 0,008 BE4 5854 1Q9 3 1400197261 0,012 Since each transaction produces a balance elements for each input and a balance element for each output, the total number of balance elements can be estimated, starting from the number of transactions, using the following formula: and considering that expected values for the number of inputs and outputs of a given transaction is constant in time, it s possible to assume that the number of balance elements is linear with the number of transactions. 48
Flow queries Flow queries aim to answer questions like the following: What was the flow between address1 and address2 between May 26, 2013 and May 01,? What was the flow between the cluster controller by Alice and the cluster controlled by Bob between May 26, 2013 and May 01,? The issue of flow queries deserves more attention. In fact, because of multi input transactions, trying to define flow between address it s not trivial. For example, let s consider a transaction T, with two inputs and two outputs. How does this transaction contributes to the flow between in(0) and out(0)? It may be 5, but also 1, or 0. In short, flow between addresses is not well defined in case of multi input transactions. However, assuming that bitcoin users do not share their private keys (therefore making deterministic the first heuristic) it s possible to give a definition of the flow between addresses that is consistent with the flow between clusters. The flow between two addresses (a1 and a2) is the sum of output values deposited to a2 in transactions having a1 as inputs, divided by the total number of input addresses of each transaction. Considering the previous transaction, the flow between in(0) and out(0) is equals to 2.5, just like the flow between in(1) and out(0). Since in(0) and in(1) are in the same cluster 49
(first heuristic), it s possible to obtain the flow between clusters by summing over flows between addresses, obtaining a total flow between the cluster in(0)-in(1) and out(0) of 5. The following formula defines flow between addresses and clusters. Another issue with flows is mining. It would be interesting if it would be possible to link together balances and flow with the following formula: This though simple equation is not so obvious in bitcoin if we do not extend flow definition to transactions with no inputs (mining transactions). In particular the addresses set was extended with a special address, defined as mine address. So, if you ask to the virex query engine the flow between the mine address and another address, you are asking for the amount of bitcoins mined by that address. To efficiently compute flows between addresses and clusters, another simple data structure, called flow element is defined. Flow elements are also built starting from transac- 50
tions: given a transaction T, timestamped at time t, with Mt>0 input addresses (mining address is considered as an input address if the transaction has no input addresses) and Nt output addresses out(0) out(nt-1), then for each pair (in(i), out(j)) a flow element is defined as follows: field tx id payer payer_cid payee payee_cid timestamp flow description (optional) identifies the transaction that generated the flow element identifies the payer address in(i), or the mining address if the transaction has no inputs identifies the cluster of the payer address identifies the payee address out(j) identifies the cluster of the payee address the time at which the transaction was timestamped into the blockchain amount of satoshis transferred from payer to payee address evaluated using the definition of flow between addresses Using flow elements the flow between two addresses can be evaluated aggregating all amounts of interesting flow elements. Total number of flow elements can be estimated starting from the number of transactions. For each transaction, a flow element is generated for each pair of input and output addresses, as described by the following formula: Just like balance elements, it s possible to conclude that the number of flow elements is linear with the number of transactions. 51
Implementation details Virex query engine has been implemented using SQLite 8, an open source software li- brary that implements a self-contained, server less, zero-configuration, transactional SQL database engine. SQLite has been chosen for its simplicity and speed, but probably there may be many other alternative and faster solutions, in particular nosql databases for realtime analytics. Balance elements and flow elements are stored in two different tables; the first has the following columns 9: Colum name Column datatype type size tx_id INTEGER 1 to 8 bytes, depending on the magnitude address_id INTEGER 1 to 8 bytes cluster_id INTEGER 1 to 8 bytes timestamp INTEGER 1 to 8 bytes amount INTEGER 1 to 8 bytes row_id (hidden) INTEGER 1 to 8 bytes Moreover, in order to speed up selection operations on address, cluster and timestamp, and aggregation operations on amounts, two covering indices 10 are defined. CREATE INDEX x_balance_elements_covering ON balance_elements (address_id,timestamp,amount); CREATE INDEX x_balance_clusters_covering ON balance_elements (cluster_id,timestamp,amount); In sqlite indices are implemented using a B-Tree, so the index size is proportional to the number of indexed elements. Note that each index element has, except from indexed fields, an hidden row id one of type INTEGER. 8 9 10 www.sqlite.org www.sqlite.org/datatype3.html www.sqlite.org/queryplanner.html 52
In the other table, flow elements, the following columns and indices have been defined: Colum name Column datatype type size tx_id INTEGER 1 to 8 bytes payer INTEGER 1 to 8 bytes payer_cid INTEGER 1 to 8 bytes payee INTEGER 1 to 8 bytes payee_cid INTEGER 1 to 8 bytes timestamp INTEGER 1 to 8 bytes flow REAL 8 byte row_id (hidden) INTEGER 1 to 8 bytes CREATE INDEX x_flow_elements_covering ON flow_elements(payer,payee,timestamp,flow); CREATE INDEX x_flow_clusters_covering ON flow_elements (payer_cid,payee_cid,timestamp,flow); In the following tables, table size for balance elements and flow elements and respective indexes, are reported. As highlighted, there is a slight mismatch between the estimate and the actual value of the number of flow elements. Balance Elements Maximum row size per element Expected number of elements Actual number of elements Maximum expected size Actual size 48 bytes ~188M ~188M ~8.4 GB ~5.5 GB Flow Elements 64 bytes ~230M ~283M ~16.8 GB ~ 11 GB Maximum row size per element Index size per element Actual number of elements Maximum expected size Actual size Balance Elements Flow Elements 48 bytes 24+24 bytes ~188M ~ 17.6 GB ~ 14.7 GB 64 bytes 40+40 bytes ~283M ~ 39 GB ~ 27 GB 53
Chapter 8: The graphical user interface Virex graphical user interface is a web based, single page application, that shows a stacked balance-time graph of different bitcoin entities with arrows representing flows among them. The interaction starts with the input search box (1) in the navigation bar, on the top of the page, that allows the user to find an existing address or a controller he knows. For example it s possible to search for an address (e.g. 1PeppeMEUXx6XgjubBnEQKtay2xpefnCZT), or for the name of a controller, (e.g. Cryptolocker, ). By clicking on a string that appears on the type-ahead, we add the selected entity to a list of interesting entities (2) and it s balance is plotted (3) on a graph with a logarithmic scale. 54
It s possible to remove an entity from the diagram by clicking on the corresponding X in the list of entities (2). It s important to notice that, as confirmed by the checkbox on the navigation bar (4), virex is displaying a clustered diagram, i.e. the selected entity is considered as a representative for its cluster, and the sum of balances of all addresses in the same cluster is shown (according to the definition of balance for clusters given in the chapter The Query Engine ). It s possible to show the balance of the sole selected entity by turning off the checkbox (4). 55
We can now switch to the Profile page (5) to get some information about selected entities (cluster id, cluster size, current balance and alleged controller). It s also possible to modify current controller by clicking on the edit button (6) and then confirm with the checkmark (7). When changing controller of addresses, the entity reasoner is involved to check that all invariants (SAME_CONTROLLER above all) are verified. In particular, if a controller is set for an address and another address that has the same controller exists in the entity reasoner database, than the clusters of these addresses are merged into one. 56
Now let s add another entity to the list of interested entities, by searching it with the text box on the navigation bar. In addition to balances, you can now see flows among interesting entities (8). Bitcoin flows are represented with arrows and the amount of bitcoin transferred from an entity to another in the interval of time delimited by white vertical lines is shown next to the arrow. The entities displayed on the graph can be swapped by using the left column and dragging the selected entity in the desired position. Moreover, it s possible to interact with the diagram by clicking and dragging to select a period of time to zoom in (10). 57
Alternatively, you can set up x-axis by showing the collapsed time setup window (11). In the panel that appears, you can select the starting and ending date (13), and the number of thicks (12). In the above screenshot we have reduced the number of ticks to 5, and hidden flows using the checkbox on the navigation bar (14). In the picture above, the number of thicks has been increased and a more detailed balance diagram is shown. Unfortunately it s impossible to see dates, because they are overlapping each other. 58
Where are mined bitcoins? It s possible to search for the mine address (15) and the amount of mined bitcoins appears in the diagram. 59
Implementation details Virex graphical user interface has been implemented using html5 technologies (html, css, javascript), with the aid of many different libraries. The web application is completely orchestrated thanks to the angular-js 11 library. This open source framework adds a model-view-controller abstraction over the top of DOM manipulation and excels at building dynamic views. User interface components, such as navigation bar, type-ahead and date-time pickers are bootstrap 12 components completely rewritten natively in angularjs from the angular-ui project 13 team. The draggable list of interesting addresses has been implemented using angularjs implementation of jquery-ui draggable component 14. The balance diagram is implemented using the SVG stacked diagram component of the Data-Driven-Documents library 15, also known as d3. D3 is a powerful library that allows you to bind arbitrary data to the DOM, and then apply data-driven transformations to the document. Flow arrows are completely written from scratch using the D3 library. 11 12 13 14 15 https://angularjs.org http://getbootstrap.com http://angular-ui.github.io/bootstrap/ http://jqueryui.com/draggable/ http://d3js.org 60
Chapter 9: Experimental results Let s start from studying the biggest four clusters and flows of money among them. These clusters have been presented in chapter The Identity Reasoner, have a number of addresses in the order of hundreds of thousands, and are likely to be controlled by automated services, since it s impossible to think that such amount of addresses have been manually generated by a single person. According to our data set, the second one is a MtGox hot wallet. As suggested by the diagram, all big clusters are early adopters and Mt.Gox is active from 2010 Oct. If we zoom on the period of time that goes from 2013 March to May, and increase the number of thicks for the diagram to 20, we end up with a figure that shows a significant drop in Mt.Gox, cluster3 and cluster2 balances, while cluster1 has an approximately constant balance (remember that the scale is logarithmic). 61
If we display flows among these entities in the period of time that goes from 2009 Jan to May (full bitcoin history), we see important amount of flows between MtGox and clusters 3 and 4, while cluster 1 remains quite isolated. 62
These results can lead to think that clusters 3 and 4 are controlled by Mt.Gox too, but this is just an hypothesis. They all have the same descending trend (potentially due to transactions malleability attacks in the period from 2013 March to March) and are highly coupled by bitcoin flows (transfer of funds among hot wallets). While heuristics link together addresses that belong to the same wallet (and hence have the same controller), this line of reasoning, that uses flows and balances, with the aid of massive data collection could enable for clustering together distinct wallets. We will now demonstrate that cluster1 and cluster4 are used to launder bitcoins stolen in the flexcoin theft case. Thanks to information provided by flexcoin, we were able to identify two of its hot wallets, whose representatives are addresses 1GEhfbj and 1DSD3B3. These addresses are respectively coloured in purple and sky blue. 63
Moreover, starting from accused addresses 1NDkeva and 1QFcC5J, declared by flexcoin as belonging to the thief, we identified a cluster of seven addresses of the same wallet, coloured in light green. We focused on the period of time that goes from 2013 March 02 and 2013 March 03 and visualise theft flows: as described by flexcoin, the thief deposited 0.011 bitcoin to one of flexcoin addresses (from 21:32 to 00:46 CEST), then transferred 864 bitcoin to his wallet (00:46 to 04:01 CEST), and soon after emptied it. We then use the query engine, directly and without graphical user interface, to identify clusters that received these bitcoins. A part of them (408) flow to cluster4 (green, MtGox?), that is also a common friend between flexcoin and its thief, in the sense that there are many flows from 1DSD3B3 to Cluster4 that date back also to 2011. If further investigations should confirm that cluster4 actually belongs to Mt.Gox, this is just another chance to highlight the current centrality of bitcoin exchanges in forensic analysis. 64
Another part of the stolen bitcoins (185) flow to cluster1 (violet), whose controller is currently unknown. Flexcoin has no flow relationships with cluster1. Moreover, 175 bitcoins flow from the theft to a single and unclustered address 12Cxy5 and are then spent, but we didn t follow this track. As depicted by virex diagrams these transactions happen very quickly and money is stolen and laundered in less than half a day. 65
Future directions Virex4bitcoin enables for many investigations on the blockchain. Thanks to its graphical user interface is possible to easily get an idea about what s going on the bitcoin network. Further development are listed below: Implementation of the real time tracker. The architecture should be scalable and ready for real time tracking of confirmed transactions. Formal definition of other flow based queries and their implementation. During a forensic exploration of the bitcoin transaction graph, questions like Where does this money amount goes?, Who is the entity with which this address interacted in this period of time? usually arise. These questions lack for a formal definition and implementation. Moreover, simple GUI interactions should be considered to allow for a relaxing and intriguing forensic session that includes this kind of queries. Flow based approach to address clustering. Do flows include significant patterns that enable further heuristics, in order to cluster wallets controlled by the same entity? Statistical approach to address tagging. The dataset collected on the web has proven to be affected by errors. A statistical approach, with the aid of machine learning techniques, could reduce the number of incorrect tags. 66
References (1) Satoshi Nakamoto - Bitcoin: A peer-to-peer electronic cash system (2) European Central Bank - Virtual Currency Schemes (3) Michele Spagnuolo, Federico Maggi and Stefano Zanero - BitIodine: Extracting Intelligence from the Bitcoin Network (4) Meiklejohn, Pomarole, Jordan, Levchenko, McCoy, Voelker, Savage - A Fistful of Bitcoins: Characterising Payments Among Men with No Names (5) IEEE Spectrum 2012 - Various Authors - The Cryptoanarchists Answer to Cash (6) Dorit Ron, Adi Shamir - Quantitative Analysis of the Full Bitcoin Transaction Graph (7) A.Back - Hashcash - a denial of service counter-measure (8) Jörg Becker, Dominic Breuker, Tobias Heide, Justus Holler, Hans Peter Rauer, and Rainer Böhme - Can We Afford Integrity by Proof-of-Work? Scenarios Inspired by the Bitcoin Currency (9) Fergal Reid, Martin Harrigan - An Analysis of Anonymity in the Bitcoin System (10)Elli Androulaki, Ghassan O. Karame, Marc Roeschlin, Tobias Scherer, and Srdjan Capkun - Evaluating User Privacy in Bitcoin (11)Christian Decker, Roger Wattenhofer - Bitcoin Transaction Malleability and MtGox 67