ROCANA WHITEPAPER Rocana Ops Architecture
CONTENTS INTRODUCTION... 2 DESIGN PRINCIPLES... 2 EVENTS: A COHESIVE AND FLEXIBLE DATA MODEL FOR OPERATIONAL DATA... 4 DATA COLLECTION... 5 Syslog... 6 File tailing... 6 Directory spooling... 6 Java API... 6 REST API... 7 Log4J... 7 PUBLISH-SUBSCRIBE MESSAGING... 7 EVENT PROCESSING AND STORAGE... 7 Event Storage... 7 Search Indexing... 8 Metric Aggregation... 8 Advanced Analytics... 8 Data Lifecycle Management... 8 DATA EXPLORATION, VISUALIZATION, AND ANALYSIS... 9 EXTENDING ROCANA OPS... 9 ABOUT ROCANA... 10 2015 Rocana, Inc. 1
INTRODUCTION The infrastructure that runs the modern business looks fundamentally different today than it did twenty, ten, or even five years ago, and operations has been struggling to keep pace with the pace of change. Applications that ran on single large machines gave way to specialized servers numbering in the tens or hundreds, followed by the virtualization movement, which turned the hardware substrate into a general purpose platform. Today, applications themselves look different; they re decomposed into finegrained services, distributed across many machines, and highly interdependent. Organizations have begun shifting from discrete line-of-business applications to shared services built and maintained by platform groups. The modern application may be the amalgamation of tens of services, each of which operates and scales independently. While incredibly powerful, resilient, and scalable, these systems are also very complex, and are challenging to operate in mission critical environments. Traditional IT operations management tools that worked well in the past can no longer provide the necessary insight and run at the scale required for modern IT infrastructure. A fundamentally new approach is required. One that treats IT Operations Analytics as the Big Data problem it has become. Rocana Ops was built from the ground up to give system, network, and application administrators and developers true insight into modern infrastructure and applications. It was designed with the assumptions that there are hundreds of thousands of machines to be managed, and that the relationship between services, applications, and the underlying infrastructure is ephemeral and can change dynamically. The fundamental goal of Rocana Ops is to provide a single view of event and metric data for all services, while separating the signal from the noise using advanced analytics and machine learning techniques associated with Big Data challenges in other domains. DESIGN PRINCIPLES When building Rocana Ops, we held the belief that it would be used across a large organization as a shared service. As a result, it needed to operate at enterprise scale, handling tens of terabytes per day of event and metric data, and hundreds or even thousands of users engaging with the application concurrently. We ve anticipated collected data would be used for more than the most obvious operational use cases: Our customers would want to extend the platform and use the data in new and unexpected ways. We therefore needed to eliminate proprietary formats and inefficient batch exports, to allow customers to truly own their data assets. Also, in order to promote interoperability, extensibility and scalability, we maximized the use of open source platforms in the underlying architecture. In order to support production operations, we dramatically reduced end-to-end latency from when data is initially produced to when it s available for query and analytics. An operational analytics platform, by definition, also needs to be more available than the systems it monitors. And therefore any failures in the system or the infrastructure on which it runs need to be anticipated and handled appropriately. 2015 Rocana, Inc. 2
Pulling together operational data from disparate systems is inherently an exercise in data integration. It is infeasible to require modification of source systems; therefore, extensible data processing and transformation needs to be a first class citizen. The system has many different kinds of users. The needs of application, network, and system administrators differ from those of developers, and DevOps staff have still different requirements. To be truly useful, the system needs to support all of these users and use cases. High-Level Logical View of Rocana Ops Architecture Analyzing complex operational data is not that different from analyzing security, customer behavior, or financial data, but these IT operators rarely have tools as sophisticated as those of their peers in these other disciplines. There s typically a lot of noise in operational monitoring at scale, and users can be easily overwhelmed by it, lacking the time to separate the signal. Further, the signal required to solve one problem may be the noise for another. Rocana Ops aims to address this by providing out-of-the-box machine-learning algorithms that guide the user in the analysis process. The solution s interaction model is based on visualizations that facilitate narrowing down the scope of analysis and pinpointing problem areas. Many years of operational experience, big data expertise, and advanced analytics practice have come together to build the next generation of operational analytics for today s modern infrastructure. One of the primary goals of Rocana Ops is to eliminate data silos by combining the critical features of different operational systems into a single application. That said, much of the data collected by Rocana Ops is the same data used by specialized applications in security, marketing, e-commerce, and other such systems. Rather than force developers to reinvent the kind of scale-out infrastructure that drives Rocana Ops and source this data a second time, we actively encourage direct integration with and extension of the Rocana Ops platform. 2015 Rocana, Inc. 3
EVENTS: A COHESIVE AND FLEXIBLE DATA MODEL FOR OPERATIONAL DATA To simplify collection, transformation, storage, exploration, and analysis, Rocana Ops uses an extensible event data model to represent all data within the system. This event data model provides a simple and flexible way of representing time-oriented discrete events that occur within the data center. An event can represent a miscellaneous log record, an HTTP request, a system or application authentication result, a SQL query audit, or even a group of application metrics at a particular point in time designated with an event type. Application and device performance data is also captured as events, categorized under one of the system built-in event types. Data sources generate events containing one or more metrics in the attributes of a metric event, and the system automatically builds and maintains a time series of each metric over multiple dimensions. All events have a set of common fields, including a millisecond-accurate timestamp, event type, unique identifier, and event message body, as well as the originating host, location, and service. Since all events have these common fields, it s possible to correlate otherwise disparate data sources without requiring complex and bespoke queries. Additionally, each event contains a set of key-value pairs which can carry event type-specific data. These attributes act as a natural point of customization for user-defined event types. Moreover, because all parts of Rocana Ops understand these attributes, new or business-specific event data can be easily captured, processed, and queried without custom development. Here s an example of a typical syslog (RFC 3164) event, represented as text. Within the system, events in transit are actually encoded in Apache Avro format; an efficient, compact, binary format designed for the transport and storage of data. Example: A syslog (RFC 3164) event. { // Common fields present in all events. } } id: JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====, event_type_id: 100, ts: 1436576671000, location: aws/us-west-2a, host: example01.rocana.com, service: dhclient, body:..., attributes: { // Attributes are specific to the event type. syslog_timestamp: 1436576671000, syslog_process: dhclient, syslog_pid: 865, syslog_facility: 3, syslog_severity: 6, syslog_hostname: example01, syslog_message: DHCPACK from 10.10.1.1 (xid=0x5c64bdb0) 2015 Rocana, Inc. 4
This example shows how HTTP request events from an application server can be easily represented as well. Example: An application server-neutral HTTP request event. { id: JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====, event_type_id: 103, ts: 1436576671000, location: nyc/fac2/rack17, host: example01.rocana.com, service: httpd, body:..., attributes: { http_request_url: http://www.example.com/product/104?ref=search&uid=1007, http_request_vhost: www.example.com, http_request_proto: HTTP/1.1, http_request_method: GET, http_request_path: /product/104, http_request_query: ref=search&uid=1007, http_request_scheme: http, http_response_code: 200, http_response_size: 2452,... } } When building an operational analytics platform as a service to business units within a larger organization, it s common to onboard new event types constantly. This flexible event format allows the system to capture new event types without requiring configuration within the platform prior to receiving data. Additionally, both raw and extracted data can be retained within an event, allowing for easy reprocessing in case data processing errors are found. DATA COLLECTION As noted before, we see operational analytic a first class exercise in data integration, a solution needs to collect data from every system that makes up the larger organizational infrastructure, as well as handle the different ways in which those systems produce logs, metrics, and diagnostics. In simple cases, integration may be possible via configuration; elsewhere, custom plugins may do the trick. Specialized cases may require deeper integration involving direct access to traditionally internal formats and data. All of these methods are supported by Rocana Ops using two main mechanisms: 2015 Rocana, Inc. 5
1. A native Rocana Agent plays three major roles on Linux and Windows operating systems. The Agent acts as a syslog server and provides file tailing and directory spooling data integration for locally written logs. Additionally, the Agent collects OS- and host-level metrics of the machines on which it runs. ROCANA OPS AGENT File tailing Syslog Server Directory Spooling 2. A high-performance Java API is also available to collect data directly from applications. It can be used directly for system extension, or employing wrapper REST and Log4J APIs. JAVA API Log4J REST API SYSLOG Syslog is the primary source for Unix-variant operating system data, as well as most network devices. The Rocana Agent operates a RFC 3164- and 5246-compliant syslog server, supporting both TCP and UDP protocols. Syslog messages are automatically parsed, becoming events within the system. FILE TAILING Text-based log files are a common and simple method of exposing application activity. The Rocana Agent supports reliable real-time tailing of log files with customizable regular expressions for extracting fields from records. DIRECTORY SPOOLING While file tailing is used to watch well-known log files that receive a continuous stream of small changes, directory spooling supports the use case of directories that receive larger data files that should be ingested once. If systems dump diagnostic trace or performance files, for example, the Rocana Agent can respond by processing each file as it arrives in a specified filesystem directory. JAVA API The Rocana Ops Java API is the highest performance, most direct, and most flexible method of producing or consuming event data. Those who wish to explicitly instrument applications or integrate with custom systems can use this API to produce data directly to, or consume data from, the publish/subscribe messaging system used by the rest of the platform. This same API powers the REST API as well as many of the internal components of Rocana Ops and, as a result, is highly optimized and proven. 2015 Rocana, Inc. 6
REST API A simple REST API is provided on top of the Java API for easy integration of systems where performance is less critical. This API can be used with any language or thirdparty system that supports HTTP/S and JSON parsing. LOG4J An appender plugin for the common Apache Log4J Java logging framework is provided for code-free integration with systems already instrumented with these APIs. Using the Rocana Ops Log4J appender obviates the need to write text logs to disk, sending data directly and reliably to the pub/sub messaging system. PUBLISH-SUBSCRIBE MESSAGING At its core, Rocana Ops uses the high-throughput, reliable, scale-out, open source publish-subscribe ( pub-sub ) messaging system Apache Kafka for the transport of all data within the system. This pub-sub layer acts as the central nervous system, facilitating all of the real-time data delivery and processing performed by the rest of the system. All event data captured by the sources described in the Data Integration section is sent to a global event stream, which is consumed by the system in different ways. This firehose of event data provides a single, full-fidelity, real-time view of all activity within an entire organization, making it the perfect data integration point for both Rocana Ops and custom applications. Kafka has a very familiar logical structure, similar to any other pubsub broker, with a few notable deviations that allow it to function at this scale. Just as with traditional pubsub messaging systems, Kafka employs the notions of producers, consumers, and topics to which producers send, or from which consumers receive data. All data in Kafka is always persisted to disk in transaction logs. This is the equivalent of most reliable or durable modes in traditional messaging systems. Rather than assume all data fits on a single broker, however, Kafka partitions each topic, spreading the partitions across multiple brokers that work together as a cluster. Each partition of a topic is also optionally replicated a configurable number of times so broker failures can be tolerated. The Rocana Ops data sources described earlier automatically distribute data across these partitions, taking full advantage of the aggregate capacity of the available brokers. When more capacity is required, additional brokers may be added to the cluster. For additional technical information about Apache Kafka, see http://kafka.apache.org/documentation.html#introduction. EVENT PROCESSING AND STORAGE Event data is processed in real-time by specialized services as it is received from the messaging layer. The following services exist within Rocana Ops to facilitate its different functionalities: EVENT STORAGE All event data is persisted to the Hadoop Distributed Filesystem (HDFS) for long-term storage and downstream processing. Data is stored in Apache Parquet a highly optimized, PAX-structured format supporting common columnar storage features such as run length encoding (RLE), dictionary encoding, as well as traditional block 2015 Rocana, Inc. 7
compression. This dataset is partitioned by time to facilitate partition pruning when performing queries against known time ranges (the common case in operational analytics). SEARCH INDEXING Each event is indexed for full text and faceted search within the Rocana Ops application. Just as with the event storage service, search indexes are partitioned by time, and served by a scale-out parallelized search engine. Indexing is performed in real-time, as the data arrives. METRIC AGGREGATION In addition to discrete events, Rocana Ops also maintains time series datasets of metric data from device and application activity. Metric data can arrive as events containing host-related metrics collected by the Rocana Agent (described earlier), or it can be extracted from other kinds of events, such as logs. Examples include extracting the number of failed application logins from clickstream data, the number and types of various HTTP errors, or the query time of every SQL statement executed in a relational database. The metric aggregation service records and derives this kind of metric data, writing it to HDFS as a time-partitioned Parquet dataset. Rocana Ops uses this data to build charts and detect patterns in the data using advanced data analysis techniques. ADVANCED ANALYTICS Many of the advanced analytical functions of Rocana Ops operate by observing the data over time and learning about patterns that occur. Rocana Ops employs sophisticated anomaly detection, using machine learning to develop baseline models for the system as a whole as well as custom models of hosts, services, or location. For example, Rocana Ops anomaly detection can establish an ever-improving model of disk IO on a particular machine, then flag any unexpected deviations from that norm. The analytics service controls the execution of these algorithms. DATA LIFECYCLE MANAGEMENT Over time, it becomes necessary to control the growth and characteristics of highvolume datasets in order to control resource consumption and cost. Rocana Ops includes a data lifecycle management (DLM) service that enforces policies on the collected data, including control over data retention and optimization. All services described above are highly available and may be scaled independently to accommodate the size and complexity of a deployment. 2015 Rocana, Inc. 8
DATA EXPLORATION, VISUALIZATION, AND ANALYSIS Rocana Ops includes an interactive user interface for exploring, visualizing, and analyzing operational data collected by the system. Specialized views exist for interactive data exploration and trends identification, comparing and correlating different kinds of events, full-text search, custom dashboard creation, and more. These views combine metadata extracted from event data, time series metric data, and discrete event data to provide a comprehensive understanding of what s happening within the infrastructure, without requiring operators to learn specialized skills or query syntax. Rather than attempt to support these visualizations by shoehorning all their requests into a single general-purpose query engine or storage system, Rocana Ops uses different query engines and data representations to answer different kinds of questions. Interactive full text search queries, for instance, are handled by a parallel search engine, while charts of time series data use a parallel SQL engine over partitioned Parquet datasets. All parts of the system use first-class parallelized query and storage systems for handling event and metric data, making Rocana Ops the first natively parallelized operational analytics system available. EXTENDING ROCANA OPS Rocana Ops is highly customizable, all the way from the user interfaces down to the data platform. Users can extend the system by defining their own event types for specific use cases. They can also develop custom producers to support custom sources, as well as custom consumers to execute specialized data processing. In addition, data transformation can be defined using configuration-based approach to enable quick, simple transformation from and to multiple formats. Anomaly detection targets and thresholds, as well as system-wide data retention policies are also fully customizable. 2015 Rocana, Inc. 9
Simplified Flow Diagram for Integrating External Data with Events CONCLUSION Rocana Ops is all about bringing the power of Big Data analytics into the realm of IT operations. It provides several mechanisms to ingest all of your data, and utilizes the Kafka bus architecture to insure that the system can scale to the required volumes. It also uses HDFS for long term data retention in order to minimize the costs associated with storage infrastructure. The use of these, as well as other open source platforms and formats also insures that the data is always accessible and is never held hostage. To that end, robust integration mechanisms are also provided in and out of the system. In order to deliver sophisticated yet straightforward data analysis (typically associated with other domains such as finance), Rocana Ops employs anomaly detection algorithms based on machine learning, which drive visualizations to guide the user to the most important data points. These algorithms can be tweaked, but are meant to work right out of the box, without requiring any data science background. Our key goal is to provide a simple, easy to use application, while taking advance of the most sophisticated technologies available today. 2015 Rocana, Inc. 10
ABOUT ROCANA Rocana is creating the next generation of IT operations analytics software in a world in which IT complexity is growing exponentially as a result of virtualization, containerization and shared services. Rocana s mission is to provide guided root cause analysis of event oriented machine data in order to streamline IT operations and boost profitability. Founded by veterans from Cloudera, Vertica and Experian, the Rocana team has directly experienced the challenges of today s IT infrastructures, and has set out to address them using modern technology that leverages the Hadoop ecosystem. Rocana, Inc. 548 Market St #22538, San Francisco, CA 94104 +1 (877) ROCANA1 info@rocana.com www.rocana.com 2015 Rocana, Inc. All rights reserved. Rocana and the Rocana logo are trademarks or registered trademarks of Rocana, Inc. in the United States and/or other countries. WP-ARCH-0715 2015 Rocana, Inc. 11