Blockchain, Throughput, and Big Data Trent McConaghy



Similar documents
High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Challenges for Data Driven Systems

NoSQL Data Base Basics

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Can the Elephants Handle the NoSQL Onslaught?

nosql and Non Relational Databases

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Database Scalability {Patterns} / Robert Treat

Introduction to NOSQL

Integrating Big Data into the Computing Curricula

The Sierra Clustered Database Engine, the technology at the heart of

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Transactions and ACID in MongoDB

A survey of big data architectures for handling massive data

CSE-E5430 Scalable Cloud Computing Lecture 11

Amr El Abbadi. Computer Science, UC Santa Barbara

Distributed Systems. Tutorial 12 Cassandra

An Approach to Implement Map Reduce with NoSQL Databases

NoSQL and Graph Database

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

The Quest for Extreme Scalability

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Software Execution Protection in the Cloud

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Using the Bitcoin Blockchain for secure, independently verifiable, electronic votes. Pierre Noizat - July 2014

Big Systems, Big Data

A1 and FARM scalable graph database on top of a transactional memory layer

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Cassandra vs MySQL. SQL vs NoSQL database comparison

In Memory Accelerator for MongoDB

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Putting Apache Kafka to Use!

bigdata Managing Scale in Ontological Systems

Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

Enova X-Wall LX Frequently Asked Questions

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Filecoin: A Cryptocurrency Operated File Storage Network

Introduction to Apache Cassandra

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

INTRODUCTION TO CASSANDRA

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

DATABASE REPLICATION A TALE OF RESEARCH ACROSS COMMUNITIES

Parallel & Distributed Data Management

Big Data Database Revenue and Market Forecast,

NoSQL Database Options

Practical Cassandra. Vitalii

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 )

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Report Data Management in the Cloud: Limitations and Opportunities

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management

Peer-to-peer Cooperative Backup System

iservdb The database closest to you IDEAS Institute

Distributed Architecture of Oracle Database In-memory

Outline. Clouds of Clouds lessons learned from n years of research Miguel Correia

Database Scalability and Oracle 12c

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data & Scripting storage networks and distributed file systems

How To Use Big Data For Telco (For A Telco)

Domain driven design, NoSQL and multi-model databases

Low-Latency Multi-Datacenter Databases using Replicated Commit

Preparing Your Data For Cloud

Big Data Storage Architecture Design in Cloud Computing

Hosting Transaction Based Applications on Cloud

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

How graph databases started the multi-model revolution

HadoopRDF : A Scalable RDF Data Analysis System

Public Key Infrastructure (PKI)

The CAP theorem and the design of large scale distributed systems: Part I

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Coding Techniques for Efficient, Reliable Networked Distributed Storage in Data Centers

Transaction Management in Distributed Database Systems: the Case of Oracle s Two-Phase Commit

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

White Paper: Big Data and the hype around IoT

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Database Replication with Oracle 11g and MS SQL Server 2008

MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3

Completing the Big Data Ecosystem:

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

NoSQL Systems for Big Data Management

Consistency Trade-offs for SDN Controllers. Colin Dixon, IBM February 5, 2014

Transcription:

Blockchain, Throughput, and Big Data Trent McConaghy Bitcoin Startups Berlin Oct 28, 2014

Conclusion Outline Throughput numbers Big data Consensus algorithms ACID Blockchain Big data?

Throughput numbers Bitcoin typical 1 transactions per second Blockchain size is 25Gb. >10Gb / yr. Takes 1d to download 25 Gb 11 Gb Nov 13 Nov 14 [blockchain.info, retrieved Oct 29, 2014]

Conclusion Throughput numbers Bitcoin 1 tps (observed) Bitcoin 7 tps (theoretical max due to block size limit) VISA 2000 tps Twitter 5000 tps Twitter 150,000+ tps at peak [https://en.bitcoin.it/wiki/scalability] https://blog.twitter.com/2013/newtweets-per-second-record-and-how

Conclusion Throughput numbers What ifs on Bitcoin, all else equal: 2000 tps 10Gb*2000/7 = 1.42 Pb/yr = 3.9 Gb/day (grows faster than you can download!) 150,000 tps 214 Pb/yr You might say we only need unspent outputs. But there are many use cases where we want to see all past transactions (blockchain apps, transparency, auditing,.. almost anything beyond simple payment.)

Conclusion Throughput numbers What ifs on Bitcoin, all else equal: 2000 tps 10Gb*2000/7 = 1.42 Pb/yr = 3.9 Gb/day (grows faster than you can download!) 150,000 tps 214 Pb/yr Q: Why not just focus on unspent outputs? A: There are many use cases where we want to see all past transactions. Auditing, compliance, ownership history, other Blockchain apps -- maybe most things beyond a simple payment?)

How? Conclusion Scaling up blockchain? Is this the right question to ask? What does Big Data have to say?

Conclusion Block chain: what "A block chain is a transaction database shared by all nodes participating in a system based on the Bitcoin protocol. https://en.bitcoin.it/wiki/block_chain

From Big Data: Cassandra DB. Scale-up linearity! http://1.bp.blogspot.com/- ZFtW7MFMqZQ/TrG5ujuDGdI/AAAAAAAAAW w/heceemd50x4/s1600/scale.png

Cassandra DB: How it works http://stevenpoitras.com/the-nutanix-bible/ http://1.bp.blogspot.com/- Hsr6O8pwyzU/TrG5didDPjI/AAAAAAAAAWM/ A3vL3wMkQgw/s1600/global.png

Cassandra DB: Consensus Paxos Algorithm (more later)

Conclusion Block chain DB: Consensus Bitcoin uses the block chain algorithm to achieve distributed consensus on who owns what coins. https://en.bitcoin.it/wiki/alternative_chain

A walk down history lane: a Conclusion gauntlet was thrown down Can you implement a distributed database that can tolerate the failure of any number of its processes (possibly all of them) without losing consistency, and that will resume normal behavior when more than half the processes are again working properly? -Leslie Lamport to colleagues, 1980 http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html (referring to Paxos)

Gauntlet 2, Byzantine problem Before [1], it was generally assumed that a three-processor system could tolerate one faulty processor. This paper shows that "Byzantine" faults, in which a faulty processor sends inconsistent information to the other processors, can defeat any traditional three-processor algorithm. (The term Byzantine didn't appear until [46].) In general, 3n+1 processors are needed to tolerate n faults. However, if digital signatures are used, 2n+1 processors are enough. This paper introduced the problem of handling Byzantine faults. I think it also contains the first precise statement of the consensus problem. http://research.microsoft.com/enus/um/people/lamport/pubs/pubs.html -Lesley Lamport [1] L. Lamport et al, Reaching Agreement in the Presence of Faults, J. ACM 27(2), April 1980

Towards Practical Conclusion Solutions to Byzantine Generals Problem "A fault-tolerant file system called Echo was built at SRC in the late 80s. The builders claimed that it would maintain consistency despite any number of non-byzantine faults, and would make progress if any majority of the processors were working. As with most such systems, it was quite simple when nothing went wrong, but had a complicated algorithm for handling failures based on taking care of all the cases that the implementers could think of. I decided that what they were trying to do was impossible, and set out to prove it. http://research.microsoft.com/enus/um/people/lamport/pubs/pubs.html

The Practical Conclusion Solution to Byzantine Generals Problem "I decided that what they were trying to do was impossible, and set out to prove it. Instead, I discovered the Paxos algorithm... At the heart of the algorithm is a three-phase consensus protocol...to my knowledge, Paxos contains the first three-phase commit algorithm that is a real algorithm, with a clearly stated correctness condition and a proof of correctness.. http://research.microsoft.com/enus/um/people/lamport/pubs/pubs.html L. Lamport et al, "The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems 4 (3): 382 401, July 1982

Conclusion Extensions to Paxos Byzantine Paxos[8][10] adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors http://en.wikipedia.org/wiki/paxos_(computer_science)#byzantine_paxos [8] Lamport, Leslie (2005). "Fast Paxos". [10] Castro, Miguel (2001). "Practical Byzantine Fault Tolerance".

Conclusion Opinions on Paxos there is only one consensus protocol, and that s Paxos -Mike Burrows, inventor of Chubby service at Google all other approaches are just broken versions of Paxos. The Paxos protocol.. is famously subtle and a bit difficult to.. it s clear that a good consensus protocol is surprisingly hard to find. http://the-paper-trail.org/blog/consensusprotocols-two-phase-commit/

Conclusion Production Use of Paxos Google all products that use BigTable or Spanner (search, analytics, email,..) Clustrix - distributed SQL DB Apache Cassandra distributed NoSQL FoundationDB distributed NewSQL Neo4j HA graph DB (and most other modern distributed DBs!) Heroku / Salesforce, Microsoft, IBM, [wikipedia, more]

Explaining Conclusion Paxos: Two Phase Commit (2PC) (1) Do you? (2) I do! I do! Problem: inconsistent result if node failure after Do you? http://the-paper-trail.org/blog/consensus-protocolstwo-phase-commit/

Explaining Conclusion Paxos: Three Phase Commit (3PC) Break the I do part into two phases Prepare to commit. ( Don t listen to any more do-you s for now ) Commit. I do! Can still fail: one site is in prepared to commit when another is not. http://the-paper-trail.org/blog/consensusprotocols-three-phase-commit/

Conclusion Explaining Paxos: The Algorithm http://the-papertrail.org/blog/consensu s-protocols-paxos/

Explaining Paxos: The Algorithm http://the-papertrail.org/blog/consensusprotocols-paxos/

Conclusion ACID Compliance A set of properties that guarantee that DB transactions can be processed reliably. [wikipedia] Atomicity -- each transaction should be all or nothing Consistency -- any transaction will bring the DB from one valid state to another Isolation concurrent execution of transactions results in a system state that would be obtained if transactions were serial Durability once a transaction has been committed, it will remain so even in the event of power loss, errors,.. Similarities to prevent double spend Many distributed NoSQL and NewSQL DBs are ACID compliant.

Conclusion Block chain: A design decision The block chain is broadcast to all nodes on the networking [sic] using a flood protocol [https://en.bitcoin.it/wiki/block_chain] (Very inefficient!)

My thoughts: Two approaches to scale up Big data-fy the blockchain Builds on man-decades of work Significant scalability hurdles? <or> Blockchain-ify big data Builds on man-centuries (millennia?) of work Scalability challenges already resolved Needs distributed control (which isn t easy either!)

Conclusion Questions I have To what extent is ACID compliance disallows double spend? (Especially for practical purposes) Why doesn t the Bitcoin community talk more about Paxos? (Yet it does talk about Byzantine Generals) Why isn t the blockchain itself truly distributed? (in the sense that pieces of it are sharded throughout the network, rather than a full duplicate everywhere) Am I missing something? Is there something fundamentally wrong with blockchain-ifying big data?

Conclusion A Partial A to One Q: How does Paxos compare to the Bitcoin protocol? I've seen claims that Bitcoin is the first practical solution to the Byzantine Generals Problem.. but it seems like Paxos is a better solution. (Handles network partition, avoids arbitrary rollback, finite time to reach consensus.) -kens Bitcoin defends against Sybil attacks by making 'hashing power' the metric for vote weight mindslight https://news.ycombinator.com/item?id=7108417

Summary The blockchain is a DB, as are modern Big Data NoSQL and NewSQL DBs. They re all distributed. Distributing a DB by making a full copy on every node scales extremely poorly. Distributed DBs need a consensus algorithm. Posed as the Byzantine Generals problem (Lesley Lamport) Most modern Big Data DBs use the Paxos Algorithm. all other approaches are just broken versions of Paxos. Open Q s summary: ACID <-> Double spend? (in practice) Paxos <-> BTC relation? Can we blockchain-ify big data?