From 1000/day to 1000/sec



Similar documents
Transcription:

From 1000/day to 1000/sec The evolution of our big data system Yoav Cohen VP Engineering

This Talk A walk-through of how we built our big-data system 2

About Incapsula Vendor of a cloud-based Application Delivery Controller Web Application Firewall Load- Balancing CDN & Optimizer DDoS Protection 3

How does it work? 4

Modeling Web-Traffic 1. First request to a website starts a new session 2. Subsequent requests are part of the same session 3. After being idle for 30 minutes the session ends Session 1 starts 10:03:01 GET www.incapsula.com/ Session 1 request 1 10:03:10 GET www.incapsula.com/ddos Session 1 request 2 10:03:12 GET www.incapsula.com/cdn Session 1 ends Session 2 starts 10.35:05 GET www.incapsula.com/signup 5

The Data A stream of messages in Google Protobuf format msgtid: 144021710000000001 ype: SESSION_MESSAGE_CREATE siteid: 7 starttime: 1409578192017 clientip: ****** countrycode: "US" entryurlid: 5544402418256865164 visitorid: "7e59c804-f663-4595-a0df-35d9b02eb747" useragent: "Incapsula Site Monitor - OPS" visitorclappid: 209 requeststarttime: 1410004769258 responsestarttime: 1410004769258 responseendtime: 1410004769261 sessionid: 151009030147748952 urlid: 5544402418256865164 request_id: 567472919066130553 querystring: "" postbody: "" statuscode: 200 serialnumber: 1 content_length: 6350 protocol: HTTP requestresult: REQ_CACHED_FRESH... 6

The Problem Transforming the stream of messages to readable data Processing throughput Read performance Scalability Session 1 starts Session 1 request 1 Session 1 request 2 Session 1 ends Session 2 starts? 7

Architecture 8

System Evolution Gen 1 2010 2011 Gen 2 2011 2013 Gen 3 2013 Gen 4 2015 9

Gen 1: Code Name rtproc 10

Gen 1: OLAP Cube A text book solution Time x IP x Country x! # requests, # attacks, dimensions counters Slice and dice to answer any question (how many attack from Germany in Jan-2010?) select sum(number_of_attacks) from Attacks where site_id=140 and country_code= DE and time > 20100100 and time < 20100200 11

Gen 1: OLAP Cube Loading data for individual attacks requires joins: 12

Gen 1: Analysis Generic solution Very big tables Overly complex (lots of moving parts) Processing Read Scalability 13

System Evolution Gen 1 2010 2011 Gen 2 2011 2013 Gen 3 2013 Gen 4 2015 14

Gen 2: Code Name rtprocng Main problems to solve: > Read Performance > Simplify New approach: > Count things on the edge instead of centrally > NoSQL model to improve read performance (no joins) 15

Gen 2: Simpler Design 16

Gen 2: Stats NoSQL Storage One document per day, containing all the data to build the charts Read performance improved (one lookup for all charts) Can even load parts of the data (MongoDB feature) {"_id" : "7_09-04-2014", "pageviews" : [ NumberLong(2369), NumberLong(2380), NumberLong(2520), NumberLong(5651), NumberLong(2912), NumberLong(3357), NumberLong(3723), NumberLong(3301), NumberLong(3092), NumberLong(2984), NumberLong(3791), NumberLong(3069) ], "humsess" : [ NumberLong(213), NumberLong(258), NumberLong(298), 17

Gen 2: Events NoSQL Storage One document per session, containing all its actions Lookups are easy (no joins) Searches use MongoDB indexes (OK but not great) { "_id": 226000330131098770, "start": { "$date": "2014-09-09T10:19:00Z" }, "cc": ["CA"], "securityflags": ["rid4"], "badbot": true, "prxy": [226], "clappt": 1, "actns": [ { "reqres": 10, "u": "www.incapsula.com/", "attack": [ { "loc": 1, "acode": 0, "act": 7, "rid": 4, "more": 0, "atype": 314, "hidden": false, "match": "", "pval": "" }... 18

Gen 2: Python Processor Batch process: > Process the files in the directory for up to X minutes > Flush to storage and exit How to achieve good processing throughput? > Cache objects in memory > When processing messages, update object in memory > When process finishes, flush all the objects from memory to storage 19

Gen 2 Storage Bottleneck Single DB for all sessions Reality check: > MongoDB coarse-grained locking (lock per DB server) > When batch process flushes, UIs are stuck (lock prefers writes) > Dropping old data impossible > Fragmentation caused excessive disk usage 20

Gen 2 Storage Re-Factoring Single DB! DB per day > Drop DBs that are X days old Live sessions! Live DB Dead sessions! per-day DB > 0% fragmentation in per-day DBs > Daily maintenance of Live DB (but it s relatively small) DB locking not resolved (later MongoDB versions have lock per DB) 21

Gen 2: Analysis Simple and scalable MongoDB is easy to get started with > Over time TCO increases Reached batch processing limits Processing Read Scalability 22

System Evolution Gen 1 2010 2011 Gen 2 2011 2013 Gen 3 2013 Gen 4 2015 23

Gen 3: Code Name Graceland Main problems to solve: > Faster, online processing > Better search capabilities New approach: > Multi-threaded Java-based processor: - Faster protobuf library than python - Keep objects in memory for longer periods of time and reduce flushes to storage > Lucene for search > A DB we can understand and control 24

Gen 3: Design 25

Gen 3: Multi-Threaded Java Processor One reader thread reads the files and distributes the data between the workers Workers process the data > Load object from cache > If not in cache, load from storage > Update object > Flush to storage - Periodically - On certain events 26

Gen 3: Cache Design Design goal: large cache, but not all in JVM heap Layered LRU cache (extends LinkedHashMap) One layer is the map, backing layer on tmpfs or disk 27

Gen 3 Stats Storage ( Segmented Storage ) Binary file per day Keep recent files separate, archive older files 2014-02-03 2014-02-03.pbz 0 14325654845 2014-02-02 2014-02-02.pbz 0 14326542128 2014-02-01 2014-02-03.pbz 0 14325654845 2014-01-31 archive.pbz 76515 14325654845... 2014-01-01 archive.pbz 0 14365428845 28

Gen 3 Stats Storage (Segmented Storage) Files are served via nginx Clients keep cache 29

Gen 3 Events Storage Tried different DBs: > LevelDB, KyotoCabinet - Storing the raw session data inside the lucene index - Index memory footprint grew (all the session data got memory-mapped) > LevelDB, KyotoCabinet - Couldn t get these to work reliably > Cassandra - Rule of thumb: if your DB has its own conference, you need a DBA - We felt it s easier to write our own than read the docs 30

Gen 3 Events Storage ( Indexing Partition ) A partition (directory) per-day, containing: > Lucene index of sessions > Big file with sessions in it Same approach as in Gen 2 for live sessions: > Live sessions! Live partition > Dead sessions! per-day partitions > 0% fragmentation > Complicates searching a bit > Live partitions require cleanup or re-building 31

Gen 3 Events Storage ( Indexing Partition ) Searches are more efficient: > Search requests are served directly from index > Session data is loaded only on-demand, and via nginx using HTTP Range header 32

Gen 3: Analysis Good processing throughput Good read performance Reaching JVM issues (big heap) Processing Read Scalability 33

System Evolution Gen 1 2010 2011 Gen 2 2011 2013 Gen 3 2013 Gen 4 2015 34

Gen 4: 2015 Based on Gen 3 Distribute work to more than one system > One data server in each POP (> 20 POPs) > Each POP processes and stores its own data > Upload processed outputs to central servers or search on all POP servers 35

Summary It is equally important to understand how your system works as it is to understand every other aspect of your business At some point we realized it s better for us to build our software from scratch than use off the shelves products as black-boxes: > We need to find people who know the products - Which is crazy since we tried tons of them over the last 4 years > We usually have less requirements - Who needs multi-dc replication since day 1? > We prefer coding it than reading documentations and stackoverflows - Then we can hack it in the middle of the night if needed - It s way more fun (at least for the developers ) 36

Questions? 37

Types of Data Statistics just numbers, used for charts, billing, etc. 38

Types of Data Events in-depth information, used for forensics and research 39