Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB
Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what should I use
What is a Database? I know it when I see it USCS Justice Potter Stewart
What is a Database? Data storage for multi-user apps Conservative philosophy Failure better than partial success All errors should be reported Connect multiple simultaneous clients Chooses throughput over speed Currently dominated by OpenSource Oracle is an exception
Concepts
Durability Safe on disk before acknowledged Reliably saved abrupt termination power failure Disk failure should be detected Recovery often takes a long time
Atomicity Saves are all-or-nothing Data is rolled back for errors Know the atom for your database
Queries Read or change the data Filtering, Aggregating,Calculations Insert, Update, Delete, Replace Typically do not change the records Move the problem not the data Transaction is an atom of queries All queries succeed or fail Wrapped up by a commit/rollback
Isolation Transactions build on each other Simulate serialization Roll back conflicting transactions Not visible to others until commit
Consistency Saved data must fit defined rules Never allowed to not fit rules One good state to another Rules can be programs Does not guarantee correct data
ACID The gold-standard for databases Atomicity Consistency Isolation Durability
Organization Database - top level container Table and Record/Row Primary Key Columns Rows ID Name Age 1 Sam 32 2 Abigail 28 3 Ron 23 4 Jennifer 47 Primary key is required One or more columns
Indexes Quickly access to data and ranges Usually implemented as b-tree Lists data in-order Search is log(n) 1 million records -> 6 steps Easy access to next and previous Multiple indexes for single table Can take up more space than data
Drivers Specific to the database software API to connect to and use database Multiple programming languages Can allow network connections
FileMakerPro
FileMakerPro Includes both App and DB layers Create forms without developers Relational, but not SQL Less programming, more clicking Frustrates many SQL developers Suitable for smaller data sets (10K)
Relational Databases A.K.A.: SQL Databases
Structured Query Language SQL The most common form of database Standardized, but many dialects Declarative language Examples: SELECT id, name FROM people columns to return table SELECT id, name FROM people WHERE age > 21 columns to return table limits SELECT count(*) FROM people WHERE age >21 columns to return table limits
Schemas Tables are rigidly defined Columns each take one data type Data storage can be very efficient Types: String (fixed with an varchar) Ints, Floats, Bytes Vary by vendor
Joins Data matched between tables SELECT person.name, phone.number FROM person, phone WHERE person.id = phone.person_id Returns only where data matched
Replication Data is copied to multiple servers Available even with downed servers XA Data? Data Data B
MySQL Replication Master-slave replication One master that allows changes Tree of slaves that allow reads Near-line server or load balancing Slaves slightly behind Master Write Master Read Slaves
Vendors License Features Oracle MySQL GPL Well supported Oracle backing MariaDB GPL Many Table Types More experimental PerconaDB GPL Takes from Oracle and MariaDB PostgreSQL BSD Standard based JSON columns SQLite Public Domain Small and ubiquitous
Scale Up vs.scale Out
Scale Up A few cheap computers have more aggregate power than a single expensive one. Take advantage of hardware progress CPU speed CPU Cores/Multiple CPUs Memory increase SSDs Faster networking
Scale Out - Facebook PHP Servers Query Routers Database A - G H - R S - Z
Scale Out Pros: More CPU, Memory, and Storage Fits well with cloud servers Cons: Coordinating servers costs time Cluster can partially fail Single server failure Network outages Complexity
CAP Theorem Choose two: Consistency All nodes see the same thing Availability Always get success or failure Partition tolerance Handles node and network errors No such thing as CP
Partitioning Servers each take one part of a table Data routed to the proper server A - K A X L - Z B
Map-Reduce Function run on every record (Map) Can filter or manipulate records Reduce function run to aggregate First run on each server Those results are then aggregated Used on enormous data sets Results are stored in a table Hadoop and Apache Spark
NewSQL
NewSQL SQL language with multiple servers Very different approaches/strengths Usually a subset of SQL
NewSQL Features Trade-off Percona Cluster Replication Complete SQL SQL-fast reads Writes are slower Full copies Clustrix Map-Reduce like speed Most SQL Slower on most queries MySQL Cluster Partitioning Replication Very fast Very Limited Joins
Document Databases
JSON Databases Usually not ACID No multi-record transactions Atom is usually one record Partitioning and Replication Can survive failures Tables are Key + JSON value No schema so records can be mixed Settings for speed vs safety
JSON JavaScript Object Notation Strings, Numbers, Arrays, and Dicts More types: Binary, Files, Time, etc Complex data without schemas
JSON Databases Features MongoDB Map-Reduce Hash sharding RethinkDB Cassandra Map-Reduce Joins Changefeeds High Speed Apache CouchDB HTTP data interface eventual consistency
Other Types Timeseries looking for patterns in time needle in haystack In memory very fast but small datasets limited queries Graph store relations between records friend-of-a-friend type problems
Conclusion Use Cases Limitations SQL NewSQL Default Well structured data Small queries on large data No partitioning Limited size Rigid Structure Limits on Queries MapReduce Huge data Difficult to use Batch operation Not ACID JSON Large data Evolving structure Not ACID
Questions? Karl Kuehn Automation Engineer RethinkDB