GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and mobile applications are driving the need for scalable, real-time platforms that can handle streaming analysis and processing of massive amounts of data. Today, creating an analytics system for big data generally means collecting multiple technologies from various providers, and building the system yourself. This presents challenges in terms of erformance, costs, scalability, real-time, and more. GigaSpaces resolves these issues: You need to handle massive amounts of data in real time, without losing data and at minimum cost. Most analytics systems are not designed for real-time: it can take hours or days to see the impact of an event in reports, enabling you to take action. The challenge becomes even greater as events are gathered from more sources at significantly higher volumes. One option: Construct your own solution by combining various available technologies. This can be complex: In addition to messaging, data storage, and processing, you need management and orchestration for automating the deployment and ensuring continuous availability the assorted parts. A simpler option: Just plug in the GigaSpaces Real-Time Analytics solution. You can focus on your business logic, and leave the rest to us. GigaSpaces makes building and deploying a large-scale real-time analytics system simple. You just provide simple event processing business logic, and we handle the scalability, performance, and database integration. Seamlessly. GigaSpaces delivers software middleware that provides enterprises and ISVs with end-to-end application scalability and cloud-enablement for mission-critical applications for hundreds of tier-1 organizations worldwide. It s Open: Use any stack, avoid lock-in. Pick your own Big Data database (RDBMS or NoSQL); Plug in consistent management and monitoring across the stack without changing your code; Write event handlers using common languages; Access your data using standard SQL/JPA APIs. All while minimizing costs. A unique combination of memory and disk-based databases ensure the optimum cost/performance ratio. Leveraging automation and cloud-based deployment reduces operational costs. The GigaSpaces Real-Time Analytics solution for Big Data Applications eliminates the complexity XAP Real-Time Solution for Big Data Cassandra HBase MongoDB Redis
CURRENT TECHNOLOGIES OVERVIEW There is no one-size-fits-all technology. Building an analytic application that addresses real-time and batch analytics requirements requires a combination of the available technologies. The challenge becomes the integration of these various pieces, tuning the system to ensure consistent performance and scaling through the entire stack, and providing consistent management and monitoring across the entire stack. Most analytics systems can be broken down into three stages of data flow in the system: Metrics Correlation Research various metrics are collected into counters. For example, number of requests per day. (Real-time) Correlate metrics for a more aggregative system view. For example, analyze which features hook users. (Near real-time) Use this information to run research and trend analysis over a period of time. (Batch map/reduce processing) Currently, you must integrate different products and technologies to provide the entire analytics functionality. This method has many associated challenges: Traditional App: Database (RDBMS) Used to run many analytics systems Complex Event Processing (CEP) Designed to correlate data in real time Associated Challenges: Performance: Not designed for real time Scaling: Not designed to grow at the speed and volume of information required in a Big Data environment, doesn t fit well for data that is continuously evolving Cost: Most RDBMS rely on expensive set-up and hardware to maintain reliability and performance Scaling: It is often necessary to aggregate events into a centralized source, which doesn t scale Capacity: Not designed to deal with historical data Hadoop Designed for batch analytics and complex correlation Performance: Not designed for real time In-Memory Data Grid Fast processing power for storing and processing data Capacity: Capacity for storing vast amounts of information in-memory doesn t scale, in terms of both system scaling and cost NoSQL Designed to handle large data volumes at low cost Processing capability: Sheer amount of data can be challenging
THE SOLUTION Google, Facebook, and Twitter have already shown us the way by moving many of their analytics systems to real time. The question now is how businesses can build their own Google/Facebook/Twitter-like analytics, but in a significantly simpler way that fits existing applications and skillsets. Step 1: Collect and Store Enable collection of large volumes of data from multiple sources in real time. The process must be reliable, to ensure the accuracy of the analytics. Solution: Use an In-Memory Data Grid Memory enables x100k msg/sec Reliability is achieved through redundancy and replication Can be accessed through large set of APIs (Document, JMS, Memcache...) Step 2: Speed up processing through co-location of business logic with data By co-locating your business logic and data, you can process events as they enter the system, reducing multiple network hops and serialization/de-serialization overhead. You can also reduce the number of moving parts, making the entire system significantly simpler to scale and maintain. Step 3: Integrate with the Big Data store to meet volume and cost demands Integrate with the Big Data store through a generic plug-in, compatible with your data store of choice, whether NoSQL or SQL. Avoid lock-in to a specific NoSQL API Performance: Reduced network hops & serialization overhead Simplicity fewer moving parts Scalability without compromising consistency (strict consistency at the front, eventual consistency for the long-term data) JPA/Standard API
PUTTING IT ALL TOGETHER 1. Store events in memory 2. Co-locate business logic with data for RT processing 3. Integrate with Big Data store for long-term data 1. 2. 3. 4. 5. Cluster of in-memory data grids (IMDG) at the front and a Big Data database at the backend. Feeds are stored directly into the IMDG. The feeds trigger a set of co-located processors that process them. The processing can include validation and enrichment of the data as well as creation of new data sets needed for further correlation and post-processing of data. Data is forwarded to the back-end Big Data store through the built-in write-behind feature of the IMDG. The IMDG can be used as a processing buffer: After processing by the IMDG, data is stored in the Big Data storage. It can also be used to store the last day of information. Data sent to the NoSQL data store is stored in batches to maximize write throughput. The analytics application reads the data directly from the NoSQL data store. When the app needs only the last day of activity, it can access the data grid directly through the built-in JPA/SQL interface.
MAIN FEATURES & BENEFITS Performance Maximum throughput is achieved using in-memory devices and by distributing events between nodes and processing them in parallel. The write to the database is done in batches, asynchronously, maximizing throughput to the underlying database. Built-in synchronization (write-behind), uses batches to speed up write performance. Simplicity All you need to do to build your entire Facebooklike analytics system is to write your event handler business logic. GigaSpaces takes care of performance, high availability, scalability, and deployment management. Continuous Availability Keeping the real time part and long term decoupled makes it possible to continue and serve real time feeds even when the database is down. It also makes it easier to deal with planned downtime that is required when maintaining long term data, such as for re-shading. Cloud Enabled Works with any private and public cloud such as CloudStack, VMware, OpenStack, Amazon, Rackspace, Azure etc. Consistent Management The GigaSpaces cluster management offers built-in integration with many popular databases, such as MySQL, Postgress, Cassandra, and MongoDB, and with popular web platforms such as Tomcat, JBoss, and NodeJS, enabling you to deploy the entire application stack with a single click. Elasticity Scaling is achieved by adding more machines without any downtime Security Access to the data is secured both from the feeder side and the analytics system. You can also set roles that control the data sets that are accessible to specific users Transactionality and Consistency The entire processing is done under transaction, ensuring the consistency and reliability of the data. Openness Choose any Big Data database (RDBMS or NoSQL), and plug in consistent management and monitoring across the stack without changing your code. Write event handlers using common Java,.Net, Groovy, JavaScript, JRuby, and a large set of dynamic languages, and access the data using standard SQL/JPA APIs. REAL-TIME IN-MEMORY PROCESSING GRID AND BIG DATA STORAGE FEATURES Real-Time Event Processing Events are stored in memory. A built-in mechanism enables triggering of events based on SQL templates. Standard Query Users can access the data through a standard JPA/SQL interface. Write/Read Dynamic Scalability With a NoSQL back-end data store, the system can grow with the data, reducing the costs associated with over-provisioning. Built-In Pub/Sub Remote clients and services can subscribe to the processed data directly, without a need for additional messaging system.. Map/Reduce Data correlation and aggregation is done through parallel query and code execution across the entire data grid. Open Database Plug-In Easily plug in different sets of SQL and NoSQL databases without changing the application code. You can start with SQL databases at small scale, and switch to NoSQL at later, as your system grows.
COST BENEFITS Economic Data Scaling Leverage commodity hardware and software-based storage to provide a large-scale data store at low cost. Solution: Memory short-term data Disk long-term data Combine memory and disk for optimum cost performance ratio: Memory is x10, x100 lower than disk for high data access rate (According to Stanford research) Disk is lower cost for high capacity lower access rate Example: Cost RAM Use Disk for this throughput Throughput Disk Use Memory for this throughput Optimum Cost The cost of processing 10K events per second and storing it for a window of an hour (till it gets pulled to the long-term storage) with 500B message size in memory requires only ~16G at a cost of ~$32 per month per server. Economic App Scaling Automation: Reduce operational cost Elastic Scaling: Reduce over-provisioning cost Cloud portability: Choose the right cloud for the job Cloud bursting: Scavenge extra capacity when needed Industry use cases that particularly need real-time insights from big data sets include: Social Networking: Measure the immediate impact to your site traffic from social media, whether a new blog post, a tweet, a Like, or even a comment. Knowing this information translates to better conversion and more effective online campaigns. SaaS: Measuring user behavior and acting upon it is crucial for improving customer satisfaction and conversion rates which represent immediate increases in revenue. Financial Services: Determining in real time whether your portfolio is losing money, or if there is fraud in your system means that you can prevent disasters as they occur, not after the damage is done. Correlating multiple sources from the market in real time results in a more accurate view of the market and enables more accurate actions to maximize your profit. ABOUT GIGASPACES GigaSpaces Technologies is the pioneer of a new generation of application virtualization platforms, and a leading provider of end-to-end scaling solutions for distributed, mission-critical application environments, and cloud enabling technologies. GigaSpaces is the only platform on the market that offers truly silo-free architecture, along with operational agility and openness, delivering enhanced efficiency, extreme performance, and always-on availability. Our technology was designed from the ground up to support any cloud environment private, public, or hybrid and offers a pain-free, evolutionary path from today s data center to the technologies of tomorrow. GIGASPACES OFFICES WORLDWIDE US East Coast Office, New York Tel: +1-646-421-2830 US West Coast Office, San Jose Tel: +1-408-878-6982 International Office, Tel Aviv Tel: +972-9-952-6751 Europe Office, London Tel: +44-207-117-0213 Asia Pacific Office, Singapore Tel: +65-65497220