Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com Matt Wilson Director, Consumer Web Operations, WebMD @mattwilsoninc 9/12/2013
About this talk Go over original site architecture and challenges How a request goes through the system Caching DB / NAS How were the challenges addressed Reasons why we picked the technology we did
About WebMD Technology WebMD, Medscape, MedicineNet, emedicine, UK cobrand Serving nearly 1 Billion Pageviews a month, 132 million unique visitors Running over 200 separate applications, vast majority in-house developed Environments: Dev/Devint, QA01/02, QA00, Production/DR Two main data centers, geographically diverse OS: mix of Linux and Windows Datastores: Sql Server, Oracle, Mongo, Vertica and mysql Web: mix of Apache and IIS App: mix of Tomcat, ASP,.Net 2.x - 4.x Service: ActiveMQ, Memcache,
Anatomy of a Request www.webmd.com/allergies WebMD User Layer 7 Switch Load Balancer WebMD Runtime Server Runtime DB Server (Clustered) NAS
Anatomy of a Request What assets/xml/xsl are associated with this URL? www.webmd.com/allergies Runtime DB Server (Clustered) WebMD User Layer 7 Switch Load Balancer WebMD Runtime Server NAS
Anatomy of a Request www.webmd.com/allergies Here you go sir. \\nasserver\blah\blah\ blah.xml \\nasserver\blah\blah\blah.xsl Runtime DB Server (Clustered) WebMD User Layer 7 Switch Load Balancer WebMD Runtime Server NAS
Anatomy of a Request www.webmd.com/allergies Runtime DB Server (Clustered) WebMD User Layer 7 Switch Load Balancer WebMD Runtime Server Fetch Content NAS
Anatomy of a Request www.webmd.com/allergies Runtime DB Server (Clustered) WebMD User Layer 7 Switch Load Balancer Return content Blah.xml Blah.xsl WebMD Runtime Server NAS
Anatomy of a Request www.webmd.com/allergies Server processes XML/ XSL and returns content to user Runtime DB Server (Clustered) WebMD User Layer 7 Switch Load Balancer WebMD Runtime Server NAS
Where s the cache bro? Page object cached in server s memory for 5 min
Where s the cache bro? Page object cached in server s memory for 5 min Widgets code snippets on page are cached at different variables. 60 min, 24 hours, 3 hours.
Where s the cache bro? Page object cached in server s memory for 5 min Widgets code snippets on page are cached at different variables. 60 min, 24 hours, 3 hours. Widget caching is determined at page design time in the content publishing system
Where s the cache bro? Page object cached in server s memory for 5 min Widgets code snippets on page are cached at different variables. 60 min, 24 hours, 3 hours. Widget caching is determined at page design time in the content publishing system Runtime system caches widgets on disk and/or in memory which is configurable in the publishing system
Caching, Caching, Caching How are existing cached objects updated?
Caching, Caching, Caching How are existing cached objects updated? How are in-memory page objects updated?
Caching, Caching, Caching Background thread calls NAS/DB for data and replaces object for widget and page cache pulls content Background thread works from a queue like data structure
This is all fine and dandy until
Caching, Caching, Caching What s the problem with this method? What does this method protect? Is it good enough?
Caching, Caching, Caching The background thread queues calls to the NAS/DB for update requests which creates a natural barrier to new content herding problems
Caching, Caching, Caching The background thread queues calls to the NAS/DB for update requests which creates a natural barrier to new content herding problems Not all web servers will get the updated content at the same time
Caching, Caching, Caching The background thread queues calls to the NAS/DB for update requests which creates a natural barrier to new content herding problems Not all web servers will get the updated content at the same time Additional web servers means more calls to the NAS/DB
Caching, Caching, Caching The background thread queues calls to the NAS/DB for update requests which creates a natural barrier to new content herding problems Not all web servers will get the updated content at the same time Additional web servers means more calls to the NAS/DB Individual Web Servers do not have the same cache
Caching, Caching, Caching The background thread queues calls to the NAS/DB for update requests which creates a natural barrier to new content herding problems Not all web servers will get the updated content at the same time Additional web servers means more calls to the NAS/DB Individual Web Servers do not have the same cache Publishing event could take up to an hour to refresh content
NAS Problem Problem Constraint: Still need a proven storage method Ubiquitous protocol Solution Constraint: 200 apps to update or not Use NAS as a backup method
What can replace a NAS Filestore? Does the solution need to provide SMB / NFS interface? Can we use something else
What can replace a NAS Filestore? Does the solution need to provide SMB / NFS interface? Can we use something else Remember, we have 200 apps to update
What can replace a NAS Filestore? Does the solution need to provide SMB / NFS interface? Can we use something else Remember, we have 200 apps to update Looked at Scality, Cassandra, MongoDB, Couchbase
What can replace a NAS Filestore? Does the solution need to provide SMB / NFS interface? Can we use something else Remember, we have 200 apps to update Looked at Scality, Cassandra, MongoDB, Couchbase
Why Couchbase? Memcached protocol already in use at WebMD Add servers to cluster without client reconfiguration Support for hundreds of thousands of transactions per second Content stored in memory fast set/get
We have to fail over the DB Cluster
Problem Constraints: Improve Availability Improve Scalability DB Problem Solution Constraints: No code updates. Needs to just work
Data Access Pattern narrow-read workload many copies on many nodes (above line borrowed from Theo) Database workload does not exceed server hardware All data is read only Load balancing works better than clustering in this pattern
DB Solution Read/only DB Calls Peer to Peer Replication Publishing System Writes to one SQL server
DB Solution Read/only DB Calls Peer to Peer Replication Publishing System Writes to one SQL server
Putting it all together Couchbase Persistent Data Store Read Only DB Requests WebMD Web Servers ActiveMQ NAS Publishing System
Putting it all together Content is written to DB / NAS and Couchbase. Couchbase gets same content as NAS. DB has metadata about the content. All are part of a transaction Couchbase Persistent Data Store WebMD Web Servers ActiveMQ NAS Publishing System
Putting it all together Couchbase Persistent Data Store WebMD Web Servers ActiveMQ NAS Publish Object IDs Publishing System
Putting it all together Couchbase Persistent Data Store Web Server Gets Object ID s off the Queue WebMD Web Servers ActiveMQ NAS Publishing System
Putting it all together Fetches xml/xsl content from Couchbase Couchbase Persistent Data Store WebMD Web Servers ActiveMQ NAS Publishing System
Putting it all together Web Server compiles page and stores cache objects on disk and in Couchbase Cache Object Couchbase Persistent Data Store WebMD Web Servers ActiveMQ NAS Publishing System
Add New Server Cache Object New Web Server gets cache objects from couchbase All Web Servers have the same cache Couchbase Persistent Data Store WebMD Web Servers ActiveMQ NAS Publishing System
Was it worth it? Fixed DB problem by using Peer to Peer replication and Load Balancing no code changes Fixed NAS problem by adding caching layer to reduce calls to NAS Fixed cache pull model with push model for the content publishing system reduces publishing times to all web servers in seconds and all web servers have the same cached content Serving content is now faster less latency Able to virtualize the web servers
Why was this Successful? Multi-Disciplined Team Operations, Development, QA and Project Management Buy-in from Senior Management Creative solutions within constraints resources, time, problem and solution Phased Implementation
Questions? www.webmd.com/careers