DISTRIBUTED internet systems monitoring is already
|
|
|
- Eleanor Hodges
- 10 years ago
- Views:
Transcription
1 Proceedings of the 2013 Federated Conference on Computer Science and Information Systems pp Content Delivery Network Monitoring with Limited Resources Krzysztof Kaczmarski, Marcin Pilarski Faculty of Mathematics and Information Science, Warsaw University of Technology ul. Koszykowa 75, Warszawa, Poland Bogdan Banasiak, Christophe Kabut Orange Labs Poland Telekomunikacja Polska S.A. ul. Obrzeżna 7, Warszawa, Poland Abstract This article presents results of designing a Content Delivery Network monitoring system for resource limited applications. CDN monitoring is important both for content providers (media companies) and administrators (Internet Service Providers). It is a challenging task since network traffic may generate huge volume of data which must be parsed and analysed in real-time. This paper describes the design of a prototype system that uses a small resource footprint, scalable Big Data solution, which is motivated by real world use cases. I. INTRODUCTION DISTRIBUTED internet systems monitoring is already an important branch of industrial computer systems. Network data transfer, hardware state and many other aspects of distributed systems need to be constantly tracked in order to detect anomalies, prevent failures and measure hardware and software load. In this field Content Delivery Network (CDN) monitoring becomes a fast developing branch of telecommunication. Network caching is used in a variety of different contexts to provide cost savings due to decreased bandwidth consumption, but also to reduce network latency. Due to the techniques described in [1] that makes RAM/HDD resources ratio important especially for use cases in developing regions of the world. CDN Monitoring requires detailed information, both statistical and sensor-reported to be available real-time and cover any given period of time while not loosing any detail. For example, if a malfunction is detected one may need to track it back behaviour up to the unlimited time point in order to find possible coincidences with other events. Therefore all relevant data concerning CDN operation must be stored in exact shape. There are several highly efficient systems able to collect and store raw logs. Flume [2] for example is able to collect log lines, send them to a HDFS storage and generate reports on them. Hive [3] can evaluate SQL-like queries over large data volumes collected in Hadoop HDFS [4]. All these classical Big Data technologies require large hardware configurations while storing data inefficiently from the time series point of view. Much better approach is applied in OpenTSDB [5], a dedicated time series database running on Hbase [6]. Its data model can be effectively tuned to achieve both very fast querying and limited hardware. This paper describes optimizations for a CDN monitoring system based on experiences of its deployment in one of the biggest telecommunication companies in Poland which operates commercial CDN services. A. Limitations of Network Traffic Monitoring The CDN monitoring becomes especially challenging in developing regions where storage footprint for monitoring is very limited. The situation at those regions is especially unfortunate since the transmission of data logs into public clouds or third party services cannot be performed for the reasons like cost, security and privacy. This type of monitoring system may be useful in environments where the link bandwidth is a constrained resource e.g. the CDN nodes deployed at libraries, schools, remote communities, etc., basically all locatlizations where uplink bandwidth is limitted. Also, the security and privacy constraints are general in this scope e.g., most of ISPs are reluctant to deploy subset of CDN systemts called Transparent Caching solutions primarily because of logging concerns in third party services. The goal of this research was to develop CDN monitoring tool for usage in developing regions with limited resources. We present the system which provides the same level of flexibility of traditional heavy weight monitoring tools. II. TIME SERIES DATABASE From a theoretical point of view the system needs to store time series understood as a collection of observations made sequentially in time [7]. These discrete observations T are represented by pairs of a timestamp and a numerical value (t i,v i ) with the following assumptions: number of data points (timestamps and their values) in one time series is not limited, each time series is identified by a name which is often called a metric name, each time series can be additionally marked with a set of tags describing measurement details, observations may not be done in constant time intervals, storage should not limit time series to be piecewise constant or linear (see Fig.1) /$25.00 c 2013, IEEE 801
2 802 PROCEEDINGS OF THE FEDCSIS. KRAKÓW, 2013 A. Architecture The general architecture of the system is composed of three components: data collection, data storage and querying engine. As in many other systems data are inserted by distributed data collectors, which may work directly on data sources and push data into the data storage. Usually data collection is a result of the ongoing measurement or monitoring processes (the operating system, databases and web servers or network devices: see OpenTSDB monitoring tools [8]). The system does not limit possible data which can be inserted and analysed. The only requirement is that it must have a form of time series. According to the existing taxonomies (like in FAME system [9]) measured values are of two types [10]: level values stay the same from one period to the next in the absence of activity. For example, inventory is a level value, because inventory stays the same if you neither buy nor sell. flow values are zero in the absence of activity. For example, energy expenses go to zero if there is no consumption. This distinction turns out to be important when interpolating missing values and for time scale conversion. Our system is open for both solutions by inserting zeros when necessary during data collection time. As we described in [11] our CDN monitoring system is built on OpenTSDB which uses HBase and HDFS as a data storage. OpenTSDB is responsible for storing data in two HBase tables: the first one contains main data compacted into one hour blocks, the second one keeps time series and tag IDs. OpenTSDB is also evaluating queries by getting proper data blocks from HBase and aggregating them into a time series according to the query semantics. Performance of all the system is influenced by all the three components: HDFS, HBase and OpenTSDB. B. CDN Metrics and Metrics Querying For the purposes of CDN monitoring the most important metrics from an operational point of view are: bandwidth average number of bits per second transferred from the platform within a given time period [kbps, Mbps, Gbps] traffic sum of bytes transferred in a given time period [kb, MB, GB] sessions number of active sessions in a given time period Fig. 1. Time series with constant (a) and variable (b) sampling, piecewise constant (S 1 ) and piecewise linear (S 2 ). unique clients number of unique IP addresses url hits number of content download events content from given URL addresses byte cache hit ratio percent of data volume sent from the cache [%] Additionally, all metrics must be further divided into the following dimensions: node name CDN node name or IP which sends data to a client cluster group name group of nodes logically grouped together http response code HTTP code sent in response to a client request country ISO code client s geographical location AS code Autonomous System code of a client s Internet Service Provider CDN instance name a logical grouping of nodes working in one or multiple CDNs provider name name of a content provider origin server name name of an origin server which contains the original data cached by a CDN url url which is tracked by the system We use the following metric naming schema: [infrastructure].[measurement].[aggregation] where: infrastructure indicates a system which is measured (in our case CDN instance, but could also be CPU), measurement indicates a name and type of the measurement being done (for example Mbps, MB, requests, etc.), aggregation describes the type of aggregation done during the data collection which results in one data point for a given time interval. For example, one of the clients requests in Poland downloading the content from a hypothetical disco-tv could be stored as (a metric name, timestamp and value followed by a list of tags): cdn.mbps.avg5min :04: node=node01.waw.cdn-lab.pl group=waw02 httpcode=200 country=pl as=as5617 cdn=lab provider=disco-tv origin=share.disco.pl url=clip doda which means that within 5 minutes period the collector calculated average value of Mbps to be 2.5 for server named node01.waw.cdn-lab.pl being in cluster group waw02 for a client located in Poland in AS5617 downloading content from a provider named disco-tv originating from a server share.disco.pl with the use of CDN instance named lab and referencing url clip doda. This approach, characterized by defining all possible dimensions for each metric, enables very flexible querying of collected data. For example, one could ask for an average Mbps for given provider in Germany in a given time period or number of error codes returned for a given provider s network (identified by AS number) when accessing certain CDN node by given URL. Please note, that all tag values combination defines one time series. Total number of time series stored for
3 KRZYSZTOF KACZMARSKI ET AL.: CONTENT DELIVERY NETWORK MONITORING WITH LIMITED RESOURCES 803 tag name min values real-life node group 5 10 httpcode 5 12 country 1 90 as 1 30 cdn 1 5 provider 1 5 origin 1 20 url total time series TABLE I APPROXIMATION OF POSSIBLE NUMBER OF TIME SERIES FOR ONE METRIC IN MINIMAL AND REAL-LIFE SCENARIOS. one metric is then given by v 1 v 2... v i, i = 1...l where l stands for the number of tags attached for a metric and v i is maximal number of distinct values possible for tag i. Let us analyse how many time series can be generated by an average metric and tags set. An approximation of minimal and real-life scenarios are given in Table I. The real-life approximation was done using production data from one of the working average CDN systems. It assumes it is possible that accessing all URLs will be tracked with possibly all http response codes and hitting all possible nodes from each country. Since there might be some dependencies between the dimensions the real observed values may be smaller, however, the database must be prepared for the worst scenario which may appear for example during malfunction. Actually, anomalies are the most important from the analytical point of view. Therefore the system should be able to present data including all possible dimensions, and tag values ranges. C. Data Volumes Another aspect of CDN monitoring is the estimation of data volumes which must be processed by the system in realtime. CDN traffic monitoring may be based on at least three information sources: CDN proprietary logs, Apache http logs and Syslog events. Due to the latest IETF standarization efforts [12] and [13] we may expect fourth information source of the logs from CDN interconected systems. One request for content generates one line in the log entry (about 0.2 kb). One Smooth Streaming video generates up to 300 entries per 5 minutes (about 60 kb) 10k users watching 90 minutes smooth streaming video generates about 10GB of log data. In some cases log coalescing funciton may be used, however that techniques may reduce size of log by a factor of 10 in average. Another example is 100k users downloading a 1GB game with 5Mbps connection may generate up to 1GB of log data. These data need to be downloaded and analysed and cannot wait for batch processing at night or weekend days. CDN systems offering content worldwide may be equally loaded all the time in which case a monitoring system must collect and process data in real-time. III. OPTIMIZATIONS It is a typical HBase performance design pattern to build clusters of at least 11 nodes. In more demanding environments 50 nodes is an average number. This approach would lead to building a monitoring infrastructure more expensive than the monitored CDN itself. Therefore the time series database must be deployed on a single node cluster with a pseudodistributed configuration on one hand allowing for possible quick extensions in the future and generating sensible running costs on the other hand. A. Single Node Configuration Performance Let us now analyse the performance of a single node configuration. We executed two types of queries: Aggregating all available time series for a given metric into a one time series in a given time window. For example 1 : select sum:cdn.bandwidth.15min from to Performing a selection of time series using tags before the aggregation and within a given time window. For example: select sum:cdn.bandwidth.15min from to where provider=disco-tv and country=de Due to the CDN logs behaviour we used an equal sampling with 5 minutes time period therefore each day is described by = 288 time points. Please note that the time series may not be continuous and therefore their number may vary throughout a queried time window. In Table. II we can see that an average time series processing speed for a query which is scanning all time series in given a time window is about 660k points per second. Since we scan all data, the number of points and time series is constantly increasing. The processing of 4 days data takes almost 40 seconds. Queries for the periods longer than 4 days failed due to an out of memory error. Similar situation appears for the filtered time series querying presented in Table III but we may observe that processing speed is much worse due to more complex data selection from the database. However, the number of processed points is much smaller allowing for better response times. Although it was possible to run queries covering even 16 days, the response time took 108 seconds, which is totally unacceptable from a user s points of view. B. Reduction of data complexity One of the conclusions from the previous section is a need to reduce the number of time points and time series in the database. This can be achieved in two ways: by aggregating the points in time with downsampling, which can be called horizontal compaction by aggregating time series with grouping the tags for one metric, which can be called vertical compaction Both solutions lead to a reduction of available information but may be acceptable if consulted with the user queries. For example, in a most typical example a user wants to get brief information about the system and would like to drill down if needed. Therefore both the detailed and coarse information 1 In this paper we use an abstract query language with obvious semantics to express platform independent queries.
4 804 PROCEEDINGS OF THE FEDCSIS. KRAKÓW, 2013 TABLE II QUERYING AGGREGATING ALL TIME SERIES, DAYS: queried days proc. time data points periods series pts/sec ,225, , , ,497, , , ,137, , , ,099,032 1,152 21, ,767 TABLE III QUERYING FILTERED TIME SERIES, DAYS: queried days proc. time data points periods series pts/sec , , , ,295, , , ,284, , , ,250,740 1,152 2, , ,479,745 4,608 2, ,598 should be kept in the system. In case of short time periods a user is interested in a detailed information, for the longer ones the details would not be visible anyway, therefore the coarse plots are just fine. 1) Downsampling Collectors: There are two possible ways to calculate the downsampled time series. The first one, the easiest, is to perform downsampling right during the data collection process. However, this would require to store aggregates for a long time according to the downsampling period (even one week or more). Keeping a collector s state for a longer time may be dangerous in case of a server malfunction. In industrial prototype we prefer solutions which are stateless, run frequently and returning results as quickly as possible. Therefore, we propose another way by implementing additional collectors which process data already stored in the database and send it back into another metrics after aggregation. The basic architecture of this solution is visible in Fig. 2 where the new collector is in gray colour. Obviously there should be one downsampling collector running for each downsampling time period. It is not working continuously but rather started every given time interval. For flexible querying it produces several metrics containing four different aggregations if sensible: minimum, maximum, sum and average of values for the sampled time period. Additionally to enable further average calculation it also counts a number of processed points and put the result into another metric called count. For some metrics (like Mbps) the summation downsampling does not make any sense and should be omitted. 2) Tags Grouping During Data Collection: During the prototype evaluation phase, we have observed that although a client could run queries containing any combination of tag values he is usually interested only in a subset of two or three tags. The queries concerned: geographical location; traffic per AS, http response code, origin server; and download speed. All time series need to be further divided per content providers. Therefore, instead of one metric with 9 tags, as it was described in II-B, we propose five metrics with six tags. In case of bandwidth metric (cdn.mbps) it could be: cdn.mbps-country with tags: node, group, cdn, provider, origin, country Fig. 2. The basic architecture of a downsampling collector. cdn.mbps-group with tags: node, group, cdn, provider, origin, group cdn.mbps-network with tags: node, group, cdn, provider, origin, network cdn.mbps-speed with tags: node, group, cdn, provider, origin, speed-group cdn.mbps-httpcode with tags: node, group, cdn, provider, origin, httpcode Obviously, a user cannot display for example http codes for given AS or country with the above metrics and tags. However, these detailed queries can be processed with the original unoptimized metrics since selecting many tags greatly reduces the number of queried time series. Furthermore the query gets sensible number of data points. The optimized metrics should be used for general queries aggregating many time series. This optimization is a kind of a pre-aggregation done for certain query types. This vertical aggregation can be run as a post-processor alike in downsampling or during the data collection process. Due to the architecture presented in our previous publication [11] it is much simpler to build it as a MOLAP cube with a reduced number of dimensions. IV. RESULTS Adding the vertical compaction optimization by aggregating time series during the data collection time has increased the log processing time for one CDN log covering data transfer for 5 minutes period from a single node by a factor of 3 from about 10 to 30 seconds. The log rotates every 5 minutes (300 seconds) so the time left can be used for processing log files even 10 times larger which will not be the case of a small or medium system. The horizontal time compaction by the downsampling of the existing data and putting it back into the database as a new time series does not introduce any additional cost in an on-line log processing and does not need to be further studied here.
5 KRZYSZTOF KACZMARSKI ET AL.: CONTENT DELIVERY NETWORK MONITORING WITH LIMITED RESOURCES 805 response time [s] TABLE IV SPEED-UP FOR QUERIES ON OPTIMIZED TIME SERIES. days speed-up total speed-up filtered global filtering 0 1 day 2 days 3 days 4 days 16 days a) response time [s] global optimized filtering optimized 0 1 day 2 days 3 days 4 days 16 days 32 days b) Fig. 3. Response times for initial unoptimized system (a) and final optimized version (b). The charts present two types of queries: global aggregating all time series and filtering performing selection before aggregation. For unoptimized system global query for more than 4 days could not be run due to out of memory error. The queries were run on time series without downsampling. The reduction of time series and tags describing one metric resulted in an outstanding improvement of the database querying performance (see Table IV and Figure 3). This effect was possible first by the reduction of Hbase tables lookups. When processing queries, OpenTSDB first looks for IDs of the time series in a meta table. The more time series and tags, the more expensive this initial query can be. Then, the second great speed up is achieved by reducing the complexity of the time series needed to be grouped and interpolated in order to get the final time series which answers the query. The response times for the same queries, but working on remodelled time series, are around 15 times faster for the longer time periods. This optimization decreased the amount of memory necessary to process the time series during the aggregation and therefore allowed for the processing of the times ranges longer than 4 days, which is obviously important for the real-life monitoring platforms. V. CONCLUSIONS We presented the problem of a small and medium scale CDN monitoring system in a case of limited resources which does not allow for real Big Data storage cluster. The initial state could not be accepted by the industrial requirements since the short term queries were processed too slowly and longer term queries could not be evaluated at all. Adding more resources by inserting computational nodes or increasing size of memory was not possible due to budget constraints. A significant improvement was achieved by introducing optimizations of the time series stored in the system. All initial properties of the system including the real-time processing and a fine grained data store allowed for detailed queries. As the next step we plan to study the number of time series reduction allowing for arbitrary queries on an optimized series. This could be achieved by adding more metrics with reduced tag sets. VI. ACKNOWLEDGMENTS We appreciate many valuable comments from the anonymous reviewers of Federated Conference on Computer Science and Information Systems We thank Sara Oueslati, Paris, France for her great support with setting up the logging environment in the ISP CDN networks, and Marc Fiuczynski, New Jersey, USA for additional technical feedback. REFERENCES [1] A. Badam, K. Park, V. S. Pai, and L. L. Peterson, Hashcache: Cache storage for the next billion., in NSDI, pp , USENIX Association, [2] The Apache Software Foundation, Apache Flume. [3] The Apache Software Foundation, Apache Hive. [4] The Apache Software Foundation, Apache Hadoop. apache. [5] B. Sigoure, OpenTSDB scalable time series database (TSDB). http: //opentsdb.net, Stumble Upon. [6] The Apache Software Foundation, Apache HBase. [7] C. Chatfield, The analysis of time series: an introduction. Florida, US: CRC Press, 6th ed., [8] OpenTSDB, Whats opentsdb [9] MARKETMAP ANALYTIC PLATFORM. fame, [10] D. Shasha, Time series in finance: the array database approach. http: //cs.nyu.edu/shasha/papers/jagtalk.html. [11] K. Kaczmarski and M. Pilarski, Content delivery network monitoring, in FedCSIS (M. Ganzha, L. A. Maciaszek, and M. Paprzycki, eds.), pp , [12] L. Peterson, J. Hartman, and M. Pilarski, A simple approach to cdn interconnection, Internet-Draft, no. draft-peterson-cdni-strawman-00, pp. 1 27, [13] G. Bertrand, I. Oprescu, F. L. Faucheur, and R. Peterkofsky, Cdni logging interface, Internet-Draft, no. draft-ietf-cdni-logging-04, pp. 1 41, 2013.
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Internet Content Distribution
Internet Content Distribution Chapter 2: Server-Side Techniques (TUD Student Use Only) Chapter Outline Server-side techniques for content distribution Goals Mirrors Server farms Surrogates DNS load balancing
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
TORNADO Solution for Telecom Vertical
BIG DATA ANALYTICS & REPORTING TORNADO Solution for Telecom Vertical Overview Last decade has see a rapid growth in wireless and mobile devices such as smart- phones, tablets and netbook is becoming very
Tuning Tableau Server for High Performance
Tuning Tableau Server for High Performance I wanna go fast PRESENT ED BY Francois Ajenstat Alan Doerhoefer Daniel Meyer Agenda What are the things that can impact performance? Tips and tricks to improve
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Bandwidth consumption: Adaptive Defense and Adaptive Defense 360
Contents 1. 2. 3. 4. How Adaptive Defense communicates with the Internet... 3 Bandwidth consumption summary table... 4 Estimating bandwidth usage... 5 URLs required by Adaptive Defense... 6 1. How Adaptive
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Massive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Performance analysis and comparison of virtualization protocols, RDP and PCoIP
Performance analysis and comparison of virtualization protocols, RDP and PCoIP Jiri Kouril, Petra Lambertova Department of Telecommunications Brno University of Technology Ustav telekomunikaci, Purkynova
the missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
Testing & Assuring Mobile End User Experience Before Production. Neotys
Testing & Assuring Mobile End User Experience Before Production Neotys Agenda Introduction The challenges Best practices NeoLoad mobile capabilities Mobile devices are used more and more At Home In 2014,
CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS
CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS The web content providers sharing the content over the Internet during the past did not bother about the users, especially in terms of response time,
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Tableau Server Scalability Explained
Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Adobe Marketing Cloud Data Workbench Monitoring Profile
Adobe Marketing Cloud Data Workbench Monitoring Profile Contents Data Workbench Monitoring Profile...3 Installing the Monitoring Profile...5 Workspaces for Monitoring the Data Workbench Server...8 Data
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Request Routing, Load-Balancing and Fault- Tolerance Solution - MediaDNS
White paper Request Routing, Load-Balancing and Fault- Tolerance Solution - MediaDNS June 2001 Response in Global Environment Simply by connecting to the Internet, local businesses transform themselves
Designing a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
Ubuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
Assignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK [email protected] tel: +44 845 250 4470
Product Guide What is Sawmill Sawmill is a highly sophisticated and flexible analysis and reporting tool. It can read text log files from over 800 different sources and analyse their content. Once analyzed
Cisco Videoscape Distribution Suite Service Broker
Data Sheet Cisco Videoscape Distribution Suite Service Broker Product Overview Cisco Videoscape Distribution Suite Service Broker (VDS-SB) encompasses a broad range of capabilities, particularly in accelerating
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF
Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
BW-EML SAP Standard Application Benchmark
BW-EML SAP Standard Application Benchmark Heiko Gerwens and Tobias Kutning (&) SAP SE, Walldorf, Germany [email protected] Abstract. The focus of this presentation is on the latest addition to the
Understanding Neo4j Scalability
Understanding Neo4j Scalability David Montag January 2013 Understanding Neo4j Scalability Scalability means different things to different people. Common traits associated include: 1. Redundancy in the
Time-Series Databases and Machine Learning
Time-Series Databases and Machine Learning Jimmy Bates November 2017 1 Top-Ranked Hadoop 1 3 5 7 Read Write File System World Record Performance High Availability Enterprise-grade Security Distribution
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive
FAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager [email protected] @lukehq Yang Li Architect & Tech Leader [email protected] Agenda What s Apache Kylin? Tech Highlights Performance
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
VI Performance Monitoring
VI Performance Monitoring Preetham Gopalaswamy Group Product Manager Ravi Soundararajan Staff Engineer September 15, 2008 Agenda Introduction to performance monitoring in VI Common customer/partner questions
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
How To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
LCMON Network Traffic Analysis
LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia [email protected] Abstract The Swinburne
HYPER-CONVERGED INFRASTRUCTURE STRATEGIES
1 HYPER-CONVERGED INFRASTRUCTURE STRATEGIES MYTH BUSTING & THE FUTURE OF WEB SCALE IT 2 ROADMAP INFORMATION DISCLAIMER EMC makes no representation and undertakes no obligations with regard to product planning
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Real-Time Analysis of CDN in an Academic Institute: A Simulation Study
Journal of Algorithms & Computational Technology Vol. 6 No. 3 483 Real-Time Analysis of CDN in an Academic Institute: A Simulation Study N. Ramachandran * and P. Sivaprakasam + *Indian Institute of Management
Scalability of web applications. CSCI 470: Web Science Keith Vertanen
Scalability of web applications CSCI 470: Web Science Keith Vertanen Scalability questions Overview What's important in order to build scalable web sites? High availability vs. load balancing Approaches
Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering
Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations A Dell Technical White Paper Database Solutions Engineering By Sudhansu Sekhar and Raghunatha
Information Technology Policy
Information Technology Policy Security Information and Event Management Policy ITP Number Effective Date ITP-SEC021 October 10, 2006 Category Supersedes Recommended Policy Contact Scheduled Review [email protected]
Scalability in Log Management
Whitepaper Scalability in Log Management Research 010-021609-02 ArcSight, Inc. 5 Results Way, Cupertino, CA 95014, USA www.arcsight.com [email protected] Corporate Headquarters: 1-888-415-ARST EMEA Headquarters:
Network Monitoring Comparison
Network Monitoring Comparison vs Network Monitoring is essential for every network administrator. It determines how effective your IT team is at solving problems or even completely eliminating them. Even
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
www.basho.com Technical Overview Simple, Scalable, Object Storage Software
www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework Many corporations and Independent Software Vendors considering cloud computing adoption face a similar challenge: how should
Understanding Slow Start
Chapter 1 Load Balancing 57 Understanding Slow Start When you configure a NetScaler to use a metric-based LB method such as Least Connections, Least Response Time, Least Bandwidth, Least Packets, or Custom
Improved metrics collection and correlation for the CERN cloud storage test framework
Improved metrics collection and correlation for the CERN cloud storage test framework September 2013 Author: Carolina Lindqvist Supervisors: Maitane Zotes Seppo Heikkila CERN openlab Summer Student Report
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
Real vs. Synthetic Web Performance Measurements, a Comparative Study
Real vs. Synthetic Web Performance Measurements, a Comparative Study By John Bartlett and Peter Sevcik December 2004 Enterprises use today s Internet to find customers, provide them information, engage
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011
Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Data Driven Success. Comparing Log Analytics Tools: Flowerfire s Sawmill vs. Google Analytics (GA)
Data Driven Success Comparing Log Analytics Tools: Flowerfire s Sawmill vs. Google Analytics (GA) In business, data is everything. Regardless of the products or services you sell or the systems you support,
Cost Effective Deployment of VoIP Recording
Cost Effective Deployment of VoIP Recording Purpose This white paper discusses and explains recording of Voice over IP (VoIP) telephony traffic. How can a company deploy VoIP recording with ease and at
Stingray Traffic Manager Sizing Guide
STINGRAY TRAFFIC MANAGER SIZING GUIDE 1 Stingray Traffic Manager Sizing Guide Stingray Traffic Manager version 8.0, December 2011. For internal and partner use. Introduction The performance of Stingray
Cisco Application Networking for IBM WebSphere
Cisco Application Networking for IBM WebSphere Faster Downloads and Site Navigation, Less Bandwidth and Server Processing, and Greater Availability for Global Deployments What You Will Learn To address
From Internet Data Centers to Data Centers in the Cloud
From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON
SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence
Improve performance and availability of Banking Portal with HADOOP
Improve performance and availability of Banking Portal with HADOOP Our client is a leading U.S. company providing information management services in Finance Investment, and Banking. This company has a
Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group
Database Monitoring Requirements Salvatore Di Guida (CERN) On behalf of the CMS DB group Outline CMS Database infrastructure and data flow. Data access patterns. Requirements coming from the hardware and
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Cognos Performance Troubleshooting
Cognos Performance Troubleshooting Presenters James Salmon Marketing Manager [email protected] Andy Ellis Senior BI Consultant [email protected] Want to ask a question?
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Big Data Web Analytics Platform on AWS for Yottaa
Big Data Web Analytics Platform on AWS for Yottaa Background Yottaa is a young, innovative company, providing a website acceleration platform to optimize Web and mobile applications and maximize user experience,
Evaluating Cooperative Web Caching Protocols for Emerging Network Technologies 1
Evaluating Cooperative Web Caching Protocols for Emerging Network Technologies 1 Christoph Lindemann and Oliver P. Waldhorst University of Dortmund Department of Computer Science August-Schmidt-Str. 12
Whitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
978-3-901882-50-0 c 2013 IFIP
The predominance of mobile broadband data traffic has driven a significant increase in the complexity and scale of mobile networks. The orthodox performance management approach is for network elements
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations
How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations Jan van Doorn Distinguished Engineer VSS CDN Engineering 1 What is a CDN? 2 Content Router get customer
Web Load Stress Testing
Web Load Stress Testing Overview A Web load stress test is a diagnostic tool that helps predict how a website will respond to various traffic levels. This test can answer critical questions such as: How
CRITEO INTERNSHIP PROGRAM 2015/2016
CRITEO INTERNSHIP PROGRAM 2015/2016 A. List of topics PLATFORM Topic 1: Build an API and a web interface on top of it to manage the back-end of our third party demand component. Challenge(s): Working with
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
