Pavlo Baron Big Data and CDN
Pavlo Baron
www.pbit.org pb@pbit.org @pavlobaron
What is Big Data
Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools (Wikipedia)
Huh?
Somewhere a mosquito coughs
and somewhere else a data center gets flooded with data
Huh???
More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) get shared each month on Facebook
Aha
Twitter users are, in total, tweeting an average of 55 million tweets a day, also including links etc.
OMG!
But there is even much more: cameras, sensors, RFID, logs, geolocation, GIS and so on
kk
There are several perspectives at Big Data
Data storage and archiving
Data preparation
Live data provisioning
Data analysis / analytics
Real-time event and stream processing
Data visualization
Where does Big Data come from
Uncontrolled human activities in the world wide web, or Web 2.0 if you like
Huh?
Das Bild kann nicht angezeigt werden. Dieser Computer verfügt möglicherweise über zu wenig Arbeitsspeicher, um das Bild zu öffnen, oder das Bild ist beschädigt. Starten Sie den Computer neu, und öffnen Sie dann erneut die Datei. Wenn weiterhin das rote x angezeigt wird, müssen Sie das Bild möglicherweise löschen und dann erneut einfügen. Every human leaves a vast number of data marks on the web every day: intentionally, accidentally and unknowingly
Huh???
Intentionally: we blog, tweet, upload, flatter, link, etc
And: the web has become an industry of its own. With us in the thick of it
Accidentally: we are humans and we make mistakes
Unknowingly: we get tricked, misled, controlled, logged etc
The vast number of data marks we leave on the web every day gets copied, duplicated. Data explodes.
Panic!
Wait! There s even more!
Huh?
Data flowing on streams at a very high rate from many actors
Huh??
The amount of data flying over the air has become enormous, and it s growing unpredictably
Aha
It s not only nuclear reactors anymore having hi-tech sensors and generating tons of data
Aha
And our physically huge globe
has become a tiny electronic ball. It s completely wired. Data needs just seconds to circumnavigate the world
OMG!
But there s even more!
Huh?
Laws and regulations force us to store and archive all sorts of data, and it s getting more and more
Human knowledge grows extremely fast. It s far too gigantic for one single brain
Oh no
And there s still more!
Huh?
Big Brother Big Data. We get observed, filmed, recorded, logged, geolocated etc.
Panic!
Don t panic. Get over it. Brace yourself for the battle.
First of all, some major changes have happened
Instead of huge expensive cabinets
we can use lots of cheap commodity hardware
Physics hit the wall
and we need to think parallel
Our physically huge globe
s has become a tiny electronic ball. It s completely wired
Spontaneous requirements
can be covered by the fog (aka cloud)
And what are my weapons
Cut your data in smaller pieces
Make those pieces bitesize (manageable)
Bring the data closer to those who need it
Bring the data closer to where it s physically accessed
Give up relations where you don t need them
Give up actuality where you don t need it
Find optimal and effective replication mechanisms
Consider latency an adjustment screw if you can
Consider availability an adjustment screw if you can
Be prepared to deal with unlimited amount of data depending on the perspective
Know your data
Know your volumes
Know your scenarios
Consider it what it is: a science
Right tool for the job
kk
And how does this technically work
Live data provisioning
What s the problem
Your users are widely spread, maybe all over the world
And you own Big Data, which has many facets geographic, financial etc.
And your classic silo architecture could break under the weight of such data
And why would I need that
You start and want to be one of those. Aha, ok
You simply grew up to a level
Now you need to segment your users and thus to be faster and more reliable at locations,
to keep your servers free of load and thus to avoid bottle necks,
to cut your big data in smaller, better manageable chunks
What are my weapons
If your content is static in web terms, you are already well prepared
In many cases, you can make your dynamic data static (precompute content)
Huh?
Let s take a look at an online bookstore
Hey, the online bookstore is completely dynamic (except images) it s a shop system!
Really?
Book description page: even when you modify the prices and offer Web 2.0 features such as rating you still can pre-compute the page at some time you don t need to compute the content while the page is getting accessed
Browser mode: this is a classic use case for static content precomputation. There is often simply no need to navigate through dynamically built paths
Book search: even this ultimately dynamic sounding feature can be (partially) dedynamized. Consider the index as static content, not necessarily the data itself
You see: many parts of an online bookstore seem dynamic, but can be actually pre-computed and delivered as static content in web terms. It s all about the frequency of change and the big data pain
Owning big data doesn t necessarily mean owning 100% dynamic data in terms of web
Aha
And now distribute it with CDN content delivery network
Huh?
Akamai web traffic dominance
Akamai web traffic monitoring
Akamai EdgePlatform
73,000 servers 70 countries 1,000 networks 30% of world s web traffic (OMG, is the rest Google?)
There are several CDN providers offering (world wide) such infrastructures
And now let s get a little insane
Huh???
Yeah, something s going on behind the scenes
How does this technically work
CDN is like a deputy. You make a contract, and it takes over parts of your platform. From here, it delivers to your users the content you tell it to deliver, but being much closer to them and much more intelligent than you when it comes to managing the load
Huh?
CDN has its infrastructure including actual nodes directly at the backbones, offering web caching, server load balancing, request routing and, based upon these techniques, content delivery services
Aha
What you have seen earlier: based on the IP address of the machine (origin) which made the DNS A query, the DNS server of the CDN has each time decided to return a different IP address e.g. one from the same geographic region
Aha
What you now can expect is that the returned IP address leads you to a load balancer your gate to a whole subinfrastructure of the CDN which balances between web caches or web servers or similar
Aha
CDN uses different algorithms to decide where it routes user requests to: based upon current load, cost, location etc.
Aha
But in the end, your content gets delivered to the user. If it expires, CDN refreshes it from your servers in the background
Often, you have to offer the last mile the very last database access, e.g. the last item view or similar. Here, the user hits your server
Huh?
cache access 10.2.3.40 5.6.7.8 1.2.3.4 A query inter-cache updates caches cache refresh caches 50.6.7.80 your servers
kk
How can I benefit from this having big data
When you have e.g. images as your big data, you can consider this data as static and thus push down-/uploads to CDN. So, you segment your users and keep your own servers free of load. What you might lose, is consistency between segments
Or you pre-compute static content out of your dynamic big data a sort of snapshots, and push them to CDN. So, you keep you database servers free of load and scale only through the web servers. Complexity comes with the snapshot management
Or you can even push some functional parts of your platform to CDN such as searches etc. You win a lot dealing with big data, but you are more dependent from the CDN provider, and your overall architecture is weaker
Or if you really want to experiment, you can even try to push whole executed database queries to CDN like you would do it with memory caches. That s really cool, but even much more complex and unreliable than a clusterdistributed memory cache
If you use CDN to collect your new data, you might need some complex replication mechanisms
Anyway, with all that in mind: you can have a lot of your big data out there with CDN
Thank you
Most images were licensed from istockphoto.com Several images were taken from corresponding Wikipedia articles, product pages and open sources