NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring
|
|
- Clarence Malone
- 8 years ago
- Views:
Transcription
1 NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring Paul Giura and Nasir Memon Polytechnic Intitute of NYU, Six MetroTech Center, Brooklyn, NY Abstract. With the increasing sophistication of attacks, there is a need for network security monitoring systems that store and examine very large amounts of historical network flow data. An efficient storage infrastructure should provide both high insertion rates and fast data access. Traditional row-oriented Relational Database Management Systems (RDBMS) provide satisfactory query performance for network flow data collected only over a period of several hours. In many cases, such as the detection of sophisticated coordinated attacks, it is crucial to query days, weeks or even months worth of disk resident historical data rapidly. For such monitoring and forensics queries, row oriented databases become I/O bound due to long disk access times. Furthermore, their data insertion rate is proportional to the number of indexes used, and query processing time is increased when it is necessary to load unused attributes along with the used ones. To overcome these problems we propose a new column oriented storage infrastructure for network flow records, called NetStore. NetStore is aware of network data semantics and access patterns, and benefits from the simple column oriented layout without the need to meet general purpose RDBMS requirements. The prototype implementation of NetStore can potentially achieve more than ten times query speedup and ninety times less storage size compared to traditional row-stores, while it performs better than existing open source columnstores for network flow data. 1 Introduction Traditionally intrusion detection systems were designed to detect and flag malicious or suspicious activity in real time. However, such systems are increasingly providing the ability to identify the root cause of a security breach. This may involve checking a suspected host s past network activity, looking up any services run by a host, protocols used, the connection records to other hosts that may or may not be compromised, etc. This requires flexible and fast access to network flow historical data. In this paper we present the design, implementation details and the evaluation of a column-oriented storage infrastructure called NetStore, designed to store and analyze very large amounts of network flow data. Throughout this paper we refer to a flow as an unidirectional data stream between two endpoints, to a flow record as a quantitative description of a flow, and to a flow ID as the key that uniquely identifies a flow. In our research the flow ID is S. Jha, R. Sommer, and C. Kreibich (Eds.): RAID 2010, LNCS 6307, pp , c Springer-Verlag Berlin Heidelberg 2010
2 278 P. Giura and N. Memon Fig. 1. Flow traffic distribution for one day and one month. In a typical day the busiest time interval is 1PM - 2PM with 4,381,876 flows, and the slowest time interval is 5AM - 6AM with 978,888 flows. For a typical month we noticed the slow down in week-ends and the peek traffic in weekdays. Days marked with * correspond to a break week. composed of five attributes: source IP, source port, destination IP, destination port and protocol. We assume that each flow record has associated a start time and an end time representing the time interval when the flow was active in the network. Challenges. Network flow data can grow very large in the number of records and storage footprint. Figure 1 shows network flow distribution of traffic captured from edge routers in a moderate sized campus network. This network with about 3,000 hosts, commonly reaches up to 1,300 flows/second, an average 53 million flows daily and roughly 1.7 billion flows in a month. We consider records with the average size of 200 Bytes. Besides CISCO NetFlow data [18] there may be other specific information that a sensor can capture from the network such as the IP, transport and application headers information. Hence, in this example, the storage requirement is roughly 10 GB of data per day which adds up to at least 310 GB per month. When working with large amounts of disk resident data, the main challenge is no longer to ensure the necessary storage space, but to minimize the time it takes to process and access the data. An efficient storage and querying infrastructure for network records has to cope with two main technical challenges: keep the insertion rate high, and provide fast access to the desired flow records. When using a traditional row-oriented Relational Database Management Systems (RDBMS), the relevant flow attributes are inserted as a row into a table as they are captured from the network, and are indexed using various techniques [6]. On the one hand, such a system has to establish a trade off between the insertion rate desired and the storage and processing overhead employed by the use of auxiliary indexing data structures. On the other hand, enabling indexing for more attributes ultimately improves query performance but also increases the storage requirements and decreases insertion rates. At query time, all the columns of the table have to be loaded in memory even if only a subset of the attributes are relevant for the query, adding a significant I/O penalty for the overall query processing time by loading unused columns.
3 NetStore: An Efficient Storage Infrastructure 279 When querying disk resident data, an important problem to overcome is the I/O bottleneck caused by large disk to memory data transfers. One potential solution would be to load only data that is relevant to the query. For example, to answer the query What is the list of all IPs that contacted IP X between dates d 1 and d 2?, the system should load only the source and destination IPs as well as the timestamps of the flows that fall between dates d 1 and d 2.TheI/Otime can also be decreased if the accessed data is compressed since less data traverses the disk-memory boundary. Further, the overall query response time can be improved if data is processed in compressed format by saving decompression time. Finally, since the system has to insert records at line speed, all the preprocessing algorithms used should add negligible overhead while writing to disk. The above requirements can be met quite well by utilizing a column oriented database as described below. Column Store. The basic idea of column orientation is to store the data by columns rather than by rows, where each column holds data for a single attribute of the flow and is stored sequentially on disk. Such a strategy makes the system I/O efficient for read queries since only the required attributes related to a query can be read from the disk. The performance benefits of column partitioning were previously analyzed in [9, 2], and some of the ideas were confirmed by the results in the databases academic research community [16, 1, 21] as well as in industry [19, 11, 10, 3]. However, most of commercial and open-source column stores were conceived to follow general purpose RDBMSs requirements, and do not fully use the semantics of the data carried and do not take advantage of the specific types and data access patterns of network forensic and monitoring queries. In this paper we present the design, implementation details and the evaluation of NetStore, a column-oriented storage infrastructure for network records that, unlike the other systems, is intended to provide good performance for network records flow data. Contribution. The key contributions in this paper include the following: Simple and efficient column oriented design of NetStore, a network flow historical storage system that enables quick access to large amounts of data for monitoring and forensic analysis. Efficient compression methods and selection strategies to facilitate the best compression for network flow data, that permit accessing and querying data in compressed format. Implementation and deployment of NetStore using commodity hardware and open source software as well as analysis and comparison with other open source storage systems used currently in practice. The rest of this paper is organized as follows: we present related work in Section 2, our system architecture and the details of each component in Section 3. Experimental results and evaluation are presented in Section 4 and we conclude in Section 5.
4 280 P. Giura and N. Memon 2 Related Work The problem of discovering network security incidents has received significant attention over the past years. Most of the work done has focused on near-real time security event detection, by improving existing security mechanisms that monitor traffic at a network perimeter and block known attacks, detect suspicious network behavior such as network scans, or malicious binary transfers [12, 14]. Other systems such as Tribeca [17] and Gigascope [4], use stream databases and process network data as it arrives but do not store the date for retroactive analysis. There has been some work done to store network flow records using a traditional RDBMS such as PostgreSQL [6]. Using this approach, when a NIDS triggers an alarm, the database system builds indexes and materialized views for the attributes that are the subject of the alarm, and could potentially be used by forensics queries in the investigation of the alarm. The system works reasonably well for small networks and is able to help forensic analysis for events that happened over the last few hours. However, queries for traffic spanning more than a few hours become I/O bound and the auxiliary data used to speed up the queries slows down the record insertion process. Therefore, such a solution is not feasible for medium to large networks and not even for small networks in the future, if we consider the accelerated growth of internet traffic. Additionally, a time window of several hours is not a realistic assumption when trying to detect the behavior of a complex botnet engaged in stealthy malicious activity over prolonged periods of time. In the database community, many researchers have proposed the physical organization of database storage by columns in order to cope with poor read query performance of traditional row-based RDBMS [16,21,11,15,3]. As shown in [16, 2, 9, 8], a column store provides many times better performance than a row store for read intensive workloads. In [21] the focus is on optimizing the cache-ram access time by decompressing data in the cache rather than in the RAM. This system assumes the working columns are RAM resident, and shows a performance penalty if data has to be read from the disk and processed in the same run. The solution in [16] relies on processing parallelism by partitioning data into sets of columns, called projections, indexed and sorted together, independent of other projections. This layout has the benefit of rapid loading of the attributes belonging to the same projection and referred to by the same query without the use of auxiliary data structure for tuple reconstruction. However, when attributes from different projections are accessed, the tuple reconstruction process adds significant overhead to the data access pattern. The system presented in [15] emphasizes the use of an auxiliary metadata layer on top of the column partitioning that is shown to be an efficient alternative to the indexing approach. However, the metadata overhead is sizable and the design does not take into account the correlation between various attributes. Finally, in [9] authors present several factors that should be considered when one has to decide to use a column store versus a row store for a read intensive workload. The relative large number of network flow attributes and the workloads
5 NetStore: An Efficient Storage Infrastructure 281 with the predominant set of queries with large selectivity and few predicates favor the use of a column store system for historical network flow records storage. NetStore is a column oriented storage infrastructure that shares some of the features with the other systems, and is designed to provide the best performance for large amounts of disk resident network flow records. It avoids tuple reconstruction overhead by keeping at all times the same order of elements in all columns. It provides fast data insertion and quick querying by dynamically choosing the most suitable compression method available and using a simple and efficient design with a negligible meta data layer overhead. 3 Architecture In this section we describe the architecture and the key components of NetStore. We first present the characteristics of network data and query types that guide our design. We then describe the technical design details: how the data is partitioned into columns, how columns are partitioned into segments, what are the compression methods used and how a compression method is selected for each segment. We finally present the metadata associated with each segment, the index nodes, and the internal IPs inverted index structure, as well as the basic set of operators. 3.1 Network Flow Data Network flow records and the queries made on them show some special characteristics compared to other time sequential data, and we tried to apply this knowledge as early as possible in the design of the system. First, flow attributes tend to exhibit temporal clustering, that is, the range of values is small within short time intervals. Second, the attributes of the flows with the same source IP and destination IP tend to have the same values (e.g. port numbers, protocols, packets sizes etc.). Third, columns of some attributes can be efficiently encoded when partitioned into time based segments that are encoded independently. Finally, most attributes that are of interest for monitoring and forensics can be encoded using basic integer data types. The records insertion operation is represented by bulk loads of time sequential data that will not be updated after writing. Having the attributes stored in the same order across the columns makes the join operation become trivial when attributes from more than one column are used together. Network data analysis does not require fast random access on all the attributes. Most of the monitoring queries need fast sequential access to large number of records and the ability to aggregate and summarize the data over a time window. Forensic queries access specific predictable attributes but collected over longer periods of time. To observe their specific characteristics we first compiled a comprehensive list of forensic and monitoring queries used in practice in various scenarios [5]. Based on the data access pattern, we identified five types among the initial list. Spot queries (S) that target a single key (usually an IP address or port number)
6 282 P. Giura and N. Memon and return a list with the values associated with that key. Range queries (R) that return a list with results for multiple keys (usually attributes corresponding to the IPs of a subnet). Aggregation queries (A) that aggregate the data for the entire network and return the result of the aggregation (e.g. traffic sent out for network). Spot Aggregation queries (SA) that aggregate the values found for one key in a single value. Range Aggregation queries (RA) that aggregate data for multiple keys into a single value. Examples of these types of queries expressed in plain words: (S) What applications are observed on host X between dates d 1 and d 2? (R) What is the list of destination IPs that have source IPs in a subnet between dates d 1 and d 2? (A) What is the total number of connections for the entire network between dates d 1 and d 2? (SA) What is the number of bytes that host X sent between dates d 1 and d 2? (RA) What is the number of hosts that each of the hosts in a subnet contacted between dates d 1 and d 2? 3.2 Column Oriented Storage Columns. In NetStore, we consider that flow records with n attributes are stored in the logical table with n columns and an increasing number of rows (tuples) one for each flow record. The values of each attribute are stored in one column and have the same data type. By default almost all of the values of a column are not sorted. Having the data sorted in a column might help get better compression and faster retrieval, but changing the initial order of the elements requires the use of auxiliary data structure for tuple reconstruction at query time. We investigated several techniques to ease tuple reconstruction and all methods added much more overhead at query time than the benefit of better compression and faster data access. Therefore, we decided to maintain the same order of elements across columns to avoid any tuple reconstruction penalty when querying. However, since we can afford one column to be sorted without the need to use any reconstruction auxiliary data, we choose to first sort only one column and partially sort the rest of the columns. We call the first sorted column the anchor column. Note that after sorting, given our storage architecture, each segment can still be processed independently. The main purpose of the anchor column choosing algorithm is to select the ordering that facilitates the best compression and fast data access. Network flow data express strong correlation between several attributes and we exploit this characteristic by keeping the strongly correlated columns in consecutive sorting order as much as possible for better compression results. Additionally, based on previous queries data access pattern, columns are arranged by taking into account the probability of each column to be accessed by future queries. The columns with higher probabilities are arranged at the beginning of the sorting order. As such, we maintain the counting probabilities associated with each of the columns given by the formula P (c i )= ai t,wherec i is the i-th column, a i the number of queries that accessed c i and t the total number of queries.
7 NetStore: An Efficient Storage Infrastructure 283 Segments. Each column is further partitioned into fixed sets of values called segments. Segments partitioning enables physical storage and processing at a smaller granularity than simple column based partitioning. These design decisions provide more flexibility for compression strategies and data access. At query time only used segments will be read from disk and processed based on the information collected from segments metadata structures called index nodes. Each segment has associated a unique identifier called segment ID. For each column, a segment ID represents an auto incremental number, started at the installation of the system. The segment sizes are dependent of the hardware configuration and can be set in such a way to use the most of available main memory. For better control over data structures used, the segments have the same number of values across all the columns. In this way there is no need to store a record ID for each value of a segment, and this is one major difference compared to some existing column stores [11]. As we will show in Section 4 the performance of the system is related to the segment size used. The larger the segment size, the better the compression performance and query processing times. However, we notice that records insertion speed decreases with the increase of segment size, so, there is a trade off between the query performance desired and the insertion speed needed. Most of the columns store segments in compressed format and, in a later section we present the compression algorithms used. Column segmentation design is an important difference compared to traditional row oriented systems that process data a tuple at a time, whereas NetStore processes data segment at a time, which translates to many tuples at a time. Figure 3 shows the processing steps for the three processing phases: buffering, segmenting and query processing. Fig. 2. NetStore main components: Processing Engine and Column-Store. Fig. 3. NetStore processing phases: buffering, segmenting and query processing. Column Index. For each column we store the meta data associated with each of the segments in an index node corresponding to the segment. The set of all index nodes for the segments of a column represent the column index. The information in each index node includes statistics about data and different features that are used in the decision about the compression method to use and optimal data
8 284 P. Giura and N. Memon access, as well as the time interval associated with the segment in the format [min start time, max end time]. Figure 4 presents an intuitive representation of the columns, segments and index for each column. Each column index is implemented using a time interval tree. Every query is relative to a time window T. At query time, the index of every column accessed is looked up and only the segments that have the time interval overlapping window T are considered for processing. In the next step, the statistics on segment values are checked to decide if the segment should be loaded in memory and decompressed. This two-phase index processing helps in early filtering out unused data in query processing similar to what is done in [15]. Note that the index nodes do not hold data values, but statistics about the segments such as the minimum and the maximum values, the time interval of the segment, the compression method used, the number of distinct values, etc. Therefore, index usage adds negligible storage and processing overhead. From the list of initial queries we observed that the column for the source IP attribute is most frequently accessed. Therefore, we choose this column as our first sorted anchor column, and used it as a clustered index for each source IP segment. However, for workloads where the predominant query types are spot queries targeting a specific column other than the anchor column, the use of indexes for values inside the column segments is beneficial at a cost of increased storage and slowdown in insertion rate. Thus, this situation can be acceptable for slow networks were the insertion rate requirements are not too high. When the insertion rate is high then it is best not to use any index but rely on the meta-data from the index nodes. Internal IPs Index. Besides the column index, NetStore maintains another indexing data structure for the network internal IP addresses called the Internal IPs index. Essentially the IPs index is an inverted index for the internal IPs. That is, for each internal IP address the index stores in a list the absolute positions where the IP address occurs in the column, sourceip or destip,asifthecolumn is not partitioned into segments. Figure 5 shows an intuitive representation of the IPs index. For each internal IP address the positions list represents an array of increasing integer values that are compressed and stored on disk on a daily basis. Because IP addresses tend to occur in consecutive positions in a column, we chose to compress the positions list by applying run-length-encoding on differences between adjacent values. 3.3 Compression Each of the segments in NetStore is compressed independently. We observed that segments within a column did not have the same distribution due to the temporal variation of network activity in working hours, days, nights, weekends, breaks etc. Hence segments of the same column were best compressed using different methods. We explored different compression methods. We investigated methods that allow data processing in compressed format and do not need decompression of all the segment values if only one value is requested. We also looked at methods
9 NetStore: An Efficient Storage Infrastructure 285 Fig. 4. Schematic representation of columns, segments, index nodes and column indexes Fig. 5. Intuitive representation of the IPs inverted index that provide fast decompression and reasonable compression ratio and speed. The decision on which compression algorithm to use is done automatically for each segment, and is based on the data features of the segment such as data type, the number of distinct values, range of the values and number of switches between adjacent values. We tested a wide range of compression methods, including some we designed for the purpose or currently used by similar systems in [1,16,21,11], with needed variations if any. Below we list the techniques that emerged effective based on our experimentation: Run-Length Encoding (RLE): is used for segments that have few distinct repetitive values. If value v appears consecutively r times, and r>1, we compress it as the pair (v, r). It provides fast compression as well as the ability to process data in compressed format. Variable Byte Encoding: is a byte-oriented encoding method used for positive integers. It uses a variable number of bytes to encode each integer value as follows: if value < 128 use one byte (set highest bit to 0), for value < use 2 bytes (first byte has highest bit set to 1 and second to 0) and so on. This method can be used in conjunction with RLE for both values and runs. It provides reasonable compression ratio and good decompression speed allowing the decompression of only the requested value without the need to decompress the whole segment. Dictionary Encoding: is used for columns with few distinct values and sometimes before RLE is applied (e.g. to encode protocol attribute). Frame Of Reference: considers the interval bounded by the minimum and maximum values as the frame of reference for the values to be compressed [7]. We use it to compress non-empty timestamp attributes within a segment (e.g. start time, end time, etc.) that are integer values representing the number of seconds from the epoch. Typically the time difference between minimum and maximum timestamp values in a segment is less than few hours, therefore the encoding of the difference is possible using short values of 2 bytes instead of integers of 4 bytes. It allows processing data in compressed format by decompressing each timestamp value individually without the need to decompress the whole segment.
10 286 P. Giura and N. Memon Generic Compression: we use the DEFLATE algorithm from the zlib library that is a variation of the LZ77 [20]. This method provides compression at the binary level, and does not allow values to be individually accessed unless the whole segment is decompressed. It is chosen if it enables faster data insertion and access than the value-based methods presented earlier. No Compression: is listed as a compression method since it will represent the base case for our compression selection algorithm. Method Selection. The selection of a compression method is done based on the statistics collected in one pass over the data of each segment. As mentioned earlier, the two major requirements of our system are to keep records insertion rates high and to provide fast data access. Data compression does not always provide better insertion and better query performance compared to No compression, and for this we developed a model to decide on when compression is suitable and if so, what method to choose. Essentially, we compute a score for each candidate compression method and we select the one that has the best score. More formally, we assume we have k + 1 compression methods m 0,m 1,...,m k, with m 0 being the No Compression method. We then compute the insertion time as the time to compress and write to disk, and the access time, to read from disk and decompress, as functions of each compression method. For value-based compression methods, we estimate the compression, write, read and decompression times based on the statistics collected for each segment. For the generic compression we estimate the parameters based on the average results obtained when processing sample segments. For each segment we evaluate: insertion (m i )=c (m i )+w (m i ), i =1,...,k access (m i )=r (m i )+d (m i ), i =1,...,k As the base case for each method evaluation we consider the No Compression method. We take I 0 to represent the time to insert an uncompressed segment which is represented by only the writing time since there is no time spent for compression and, similarly A 0 to represent the time to access the segment which is represented by only the time to read the segment from disk since there is no decompression. Formally, following the above equations we have: insertion (m 0 )=w (m 0 )=I 0 and access (m 0 )=r (m 0 )=A 0 We then choose the candidate compression methods m i only if we have both: insertion (m i ) <I 0 and access (m i ) <A 0 Next, among the candidate compression methods we choose the one that provides the lowest access time. Note that we primarily consider the access time as the main differentiator factor and not the insertion time. The disk read is the most frequent and time consuming operation and it is many times slower than disk write of the same size file for commodity hard drives. Additionally, insertion time can be improved by bulk loading or by other ways that take into account that the network traffic rate is not steady and varies greatly over time,
11 NetStore: An Efficient Storage Infrastructure 287 whereas the access mechanism should provide the same level of performance at all times. The model presented above does not take into account if the data can be processed in compressed format and the assumption is that decompression is necessary at all times. However, for a more accurate compression method selection we should include the probability of a query processing the data in compressed format in the access time equation. Since forensic and monitoring queries are usually predictable, we can assume without affecting the generality of our system, that we have a total number of t queries, each query q j having the probability of occurrence p j with p j = 1. We consider the probability of a t segment j=1 s being processed in compressed format as the probability of occurrence of the queries that process the segment in compressed format. Let CF be the set of all the queries that process s in compressed format, we then get: P (s) = p j, CF = {q j q j processes s in compressed format} q j CF Now, a more accurate access time equation can be rewritten taking into account the possibility of not decompressing the segment for each access: access (m i )=r (m i )+d (m i ) (1 P (s)), i =1,...,k (1) Note that the compression selection model can accommodate any compression, not only the ones mentioned in this paper, and is also valid in the cases when the probability of processing the data in compressed format is Query Processing Figure 3 illustrates NetStore data flow, from network flow record insertion to the query result output. Data is written only once in bulk, and read many times for processing. NetStore does not support transaction processing queries such as record updates or deletes, it is suitable for analytical queries in general and network forensics and monitoring queries in special. Data Insertion. Network data is processed in several phases before being delivered to permanent storage. First, raw flow data is collected from the network sensors and is then preprocessed. Preprocessing includes the buffering and segmenting phases. Each flow is identified by a flow ID represented by the 5-tuple [sourceip, sourceport, destip, destport, protocol]. In the buffering phase, raw network flow information is collected until the buffer is filled. The flow records in the buffer are aggregated and then sorted. As mentioned in Section 3.3, the purpose of sorting is twofold: better compression and faster data access. All the columns are sorted following the sorting order determined based on access probabilities and correlation between columns using the first sorted column as anchor.
12 288 P. Giura and N. Memon In the segmenting phase, all the columns are partitioned into segments, that is, once the number of flow records reach the buffer capacity the column data in the buffer is considered a full segment and is processed. Each of the segments is then compressed using the appropriate compression method based on the data it carries. The information about the compression method used and statistics about the data is collected and stored in the index node associated with the segment. Note that once the segments are created, the statistics collection and compression of each segment is done independent of the rest of the segments in the same column or in other columns. By doing so, the system takes advantage of the increasing number of cores in a machine and provides good record insertion rates in multi threaded environments. After preprocessing all the data is sent to permanent storage. As monitoring queries tend to access the most recent data, some data is also kept in memory for a predefined length of time. NetStore uses a small active window of size W and all the requests from queries accessing the data in the time interval [NOW - W, NOW] are served from memory, where NOW represents the actual time of the query. Query Execution. For flexibility NetStore supports limited SQL syntax and implements a basic set of segment operators related to the query types presented in Section 3.1. Each SQL query statement is translated into a statement in terms of the basic set of segment operators. Below we briefly present each general operator: filtersegs (d 1,d 2 ): Returns the set with segment IDs of the segments that overlap with the time interval [d 1,d 2 ]. This operator is used by all queries. filteratts(segids, pred 1 (att 1 ),...,pred k (att k )): Returns the list of pairs (segid, pos list), where pos list represents the intersection of attribute position lists in the corresponding segment with id segid, for which the attribute att i satisfies the predicate pred i,withi =1,...,k. aggregate (segids, pred 1 (att 1 ),...,pred k (att k )): Returns the result of aggregating values of attribute att k by att k 1 by... att 1 that satisfy their corresponding predicates pred k,...,pred 1 in segments with ids in segids. The aggregation can be summation, counting, min or max. The queries considered in section 3.1 can all be expressed in terms of the above operators. For example the query: What is the number of unique hosts that each of the hosts in the network contacted in the interval [d 1,d 2 ]? can be expressed as follows: aggregate(filter segs(d 1,d 2 ), sourceip = /16, destip ). After the operator filter segs is applied, only the sourceip and destip segments that overlap with the time interval [d 1,d 2 ] are considered for processing and their corresponding index nodes are read from disk. Since this is a range aggregation query, all the considered segments will be loaded and processed. If we consider the query What is the number of unique hosts that host X contacted in the interval [d 1,d 2 ]? it can be expressed as follows: aggregate(filter segs(d 1,d 2 ), sourceip =X,destIP ). For this query the number of relevant segments can be reduced even more by discarding the ones that do
13 NetStore: An Efficient Storage Infrastructure 289 not overlap with the time interval [d 1,d 2 ], as well as the ones that don t hold the value X for sourceip by checking corresponding index nodes statistics. If the value X represents the IP address of an internal node, then the internal IPs index will be used to retrieve all the positions where the value X occurs in the sourceip column. Then a count operation is performed of all the unique destip addresses corresponding to the positions. Note that by using internal IPs index, the data of sourceip column is not touched. The only information loaded in memory is the positions list of IP X as well as the segments in column destip that correspond to those positions. 4 Evaluation In this section we present an evaluation of NetStore. We designed and implemented NetStore using the Java programming language on the FreeBSD 7.2- RELEASE platform. For all the experiments we used a single machine with 6 GB DDR2 RAM, two Quad-Core 2.3 Ghz CPUs, 1TB SATA MB Buffer 7200 rpm disk with a RAID-Z configuration. We consider this machine representative of what a medium scale enterprise will use as a storage server for network flow records. For experiments we used the network flow data captured over a 24 hour period of one weekday at our campus border router. The size of raw text file data was about 8 GB, 62,397,593 network flow records. For our experiments we considered only 12 attributes for each network flow record, that is only the ones that were meaningful for the queries presented in this paper. Table 1 shows the attributes used as well as the types and the size for each attribute. We compared NetStore s performance with two open source RDBMS, a row-store, PostgreSQL [13] and a column-store, LucidDB [11]. We chose PostgreSQL over other open source systems because we intended to follow the example in [6] which uses it for similar tasks. Additionally we intended to make use of the partial index support for internal IPs that other systems don t offer in order to compare the performance of our inverted IPs index. We chose LucidDB as the column-store to compare with as it is, to the best of our knowledge, the only stable open source column-store that yields good performance for disk resident data and provides reasonable insertion speed. We chose only data captured over one day, with size slightly larger than the available memory, because we wanted to maintain reasonable running times for the other systems that we compared NetStore to. These systems become very slow for larger data sets and performance gap compared to NetStore increases with the size of the data. 4.1 Parameters Figure 6 shows the influence that the segment size has over the insertion rate. We observe that the insertion rate drops with the increase of segment size. This trend is expected and is caused by the delay in preprocessing phase, mostly because of the larger segment array sorting. As Figure 7 shows, the segment
14 290 P. Giura and N. Memon Table 1. NetStore flow attributes. Table 2. NetStore properties and network rates supported based on 24 hour flow records data and the 12 attributes Column Type Bytes sourceip int 4 destip int 4 sourceport short 2 destport short 2 protocol byte 1 starttime short 2 endtime short 2 tcpsyns byte 1 tcpacks byte 1 tcpfins byte 1 tcprsts byte 1 numbytes int 4 Property Value Unit records insertion rate 10,000 records/second number of records 62,397,594 records number of bytes transported 1.17 Terabytes bytes transported per record 20, Bytes/record bits rate supported 1.54 Gbit/s number of packets transported 2,028,392,356 packets packets transported per record packets/record packets rate supported 325, packets/second size also affects the compression ratio of each segment, the larger the segment size the larger the compression ratio achieved. But high compression ratio is not a critical requirement. The size of the segments is more critically related to the available memory, the desired insertion rate for the network and the number of attributes used for each record. We set the insertion rate goal at 10,000 records/second, and for this goal we set a segment size of 2 million records given the above hardware specification and records sizes. Table 2 shows the insertion performance of NetStore. The numbers presented are computed based on average bytes per record and average packets per record given the insertion rate of 10,000 records/second. When installed on a machine with the above specification, NetStore can keep up with traffic rates up to 1.5 Gbit/s for the current experimental implementation. For a constant memory size, this rate decreases with the increase in segment size and the increase in the number of attributes for each flow record. Fig. 6. Insertion rate for different segment sizes Fig. 7. Compression ratio with and without aggregation
15 NetStore: An Efficient Storage Infrastructure Queries Having described the NetStore architecture and it s design details, in this section we consider the queries described in [5], but taking into account data collected over the 24 hours for internal network /16. We consider both the queries and methodology in [5] meaningful for how an investigator will perform security analysis on network flow data. We assume all the flow attributes used are inserted into a table flow and we use standard SQL to describe all our examples. Scanning. Scanning attack refers to the activity of sending a large number of TCP SYN packets to a wide range of IP addresses. Based on the received answer the attacker can determine if a particular vulnerable service is running on the victim s host. As such, we want to identify any TCP SYN scanning activity initiated by an external hosts, with no TCP ACK or TCP FIN flags set and targeted against a large number of internal IP destinations, larger than a preset limit. We use the following range aggregation query (Q1): SELECT sourceip, destport, count(distinct destip), starttime FROM flow WHERE sourceip <> /16 AND destip = /16 AND protocol = tcp AND tcpsyns = 1 AND tcpacks = 0 AND tcpfins = 0 GROUP BY sourceip HAVING count(distinct destip) > limit; External IP address was found scanning starting at time t 1.We check if there were any valid responses after time t 1 from the internal hosts, where no packet had the TCP RST flag set, and we use the following query (Q2): SELECT sourceip, sourceport, destip FROM flow WHERE starttime > t 1 AND sourceip = /16 AND destip = AND protocol = tcp AND tcprsts = 0; Worm Infected Hosts. Internal host with the IP address was discovered to have been responded to a scanning initiated by a host infected with the Conficker worm and we want to check if the internal host is compromised. Typically, after a host is infected, the worm copies itself into memory and begins propagating to random IP addresses across a network by exploiting the same vulnerability. The worm opens a random port and starts scanning random IPs on port 445. We use the following query to check the internal host (Q3): SELECT sourceip, destport, count(distinct destip) FROM flow WHERE starttime > t 1 AND sourceip = AND destport = 445;
16 292 P. Giura and N. Memon SYN Flooding. It is a network based-denial of service attack in which the attacker sends an unusual large number of SYN request, over a threshold t, to a specific target over a small time window W. To detect such an attack we filter all the incoming traffic and count the number of flows with TCP SYN bit set and no TCP ACK or TCP FIN for all the internal hosts. We use the following query(q4): SELECT destip, count(distinct sourcep), starttime FROM flow WHERE starttime > NOW - W AND destip = /16 AND protocol = tcp AND tcpsyns = 1 AND tcpacks = 0 AND tcpfins = 0 GROUP BY destip HAVING count(sourceip) > t; Network Statistics. Besides security analysis, network statistics and performance monitoring is another important usage for network flow data. To get this information we use aggregation queries for all collected data over a large time window, both incoming and outgoing. Aggregation operation can be number of bytes or packets summation, number of unique hosts contacted or some other meaningful aggregation statistics. For example we use the following simple aggregation query to find the number of bytes transported in the last 24 hours (Q5): SELECT sum(numbytes) FROM flow WHERE starttime > NOW - 24h ; General Queries. The sample queries described above are complex and belong to more than one basic type described in Section 3.1. However, each of them can be separated into several basic types such that the result of one query becomes the input for the next one. We build a more general set of queries starting from the ones described above by varying the parameters in such a way to achieve different level of data selectivity form low to high. Then, for each type we reported the average performance for all the queries of that type. Figure 8 shows the average running times of selected queries for increasing segment sizes. We observe that for S type queries that don t use IPs index (e.g. for attributes other than internal sourceip or destip), the performance decreases when the segment size increases. This is an expected result since for larger segments there is more unused data loaded as part of the segment where the spotted value resides. When using the IPs index the performance benefit comes from skipping the irrelevant segments whose positions are not found in the positions list. However, for internal busy servers that have corresponding flow records in all the segments, all corresponding segments of attributes have to be read but not the IPs segments. This is an advantage since an IP segment is several times larger in general than the other attributes segments. Hence, except for spot queries that use non-indexed attributes, queries tend to be faster for larger segment sizes. 4.3 Compression Our goal with using compression is not to achieve the best compression ratio nor the best compression or decompression speed, but to obtain the highest records
17 NetStore: An Efficient Storage Infrastructure 293 insertion rate and the best query performance. We evaluated our compression selection model by comparing performance when using a single method for all the segments in the column, with the performance when using the compression selection algorithm for each segment. To select the method for a column we compressed first all the segments of the columns with all the six methods presented. We then measured the access performance for each column compressed with each method. Finally, we selected the compression method of a column, the method that provides the best access times for the majority of the segments. For the variable segments compression, we activated the methods selection mechanism for all columns and then we inserted the data, compressing each segment based on the statistics of its own data rather than the entire column. In both cases we did not change anything in the statistic collection process since all the statistics were used in the query process for both approaches. We obtained on an average 10 to 15 percent improvement per query using the segment based compression method selection model with no penalty for the insertion rate. However, we consider the overall performance of compression methods selection model is satisfactory and the true value resides in the framework implementation, being limited only by the individual methods used not by the general model design. If the data changes and other compression methods are more efficient for the new data, only the compression algorithm and the operators that work on this compressed data should be changed, with the overall architecture remaining the same. Some commercial systems [19] apply on top of the value-based compressed columns another layer of general binary compression for increased performance. We investigated the same possibility and compared four different approaches to compression on top of the implemented column oriented architecture: no compression, value-based compression only, binary compression only and value-based plus binary compression on top of that. For the no compression case, we processed the data using the same indexing structure and column oriented layout but with the compression disabled for all the segments. For the binary compression only we compress each segment using the generic binary compression. In the case of value-based compression we compress all the segments having the dynamic selection mechanism enabled, and for the last approach we apply another layer of generic compression on top of already value-based compressed segments. The results of our experiment for the four cases are shown in Figure 9. We can see that compression is a determining factor in performance metrics. Using valuebased compression achieves the best average running time for the queries while the uncompressed segments scenario yields the worst performance.we also see that adding another compression layer does not help in query performance nor in the insertion rate even though it provides better compression ratio. However, the general compression method can be used for data aging, to compress and archive older data that is not actively used. Figure 7 shows the compression performance for different segment sizes and how flow aggregation affects storage footprint. As expected, compression performance is better for larger segment sizes in both cases, with and without aggregation. That is the case because of the compression methods used. The larger the
18 294 P. Giura and N. Memon Fig. 8. Average query times for different segment sizes and different query types Fig. 9. Average query times for the compression strategies implemented segment, the longer the runs for column with few distinct values, the smaller the dictionary size for each segment. The overall compression ratio of raw network flow data for the segment size of 2 million records is 4.5 with no aggregation and 8.4 with aggregation enabled. Note that the size of compressed data includes also the size of both indexing structures: column indexes and IPs index. 4.4 Comparison with Other Systems For comparison we used the same data and performed a system-specific tuning for each of the systems parameters. To maintain the insertion rate above our target of 10,000 records/second we created three indexes for each Postgres and Luciddb: one clustered index on starttime and two un-clustered indexes, one on sourceip and one on destip attributes. Although we believe we chose good values for the other tuning parameters we cannot guarantee they are optimal and we only present the performance we observed. We show the performance for using the data and the example queries presented in Section 4.2. Table 3 shows the relative performance of NetStore compared to PostgresSQL for the same data. Since our main goal is to improve disk resident data access, we ran each query once for each system to minimize the use of cached data. The numbers presented show how many times NetStore is better. To maintain a fair overall comparison we created a PostgresSQL table for each column of Netstore. As mentioned in [2], row-storeswith columnardesign provide better performance for queries that access a small number of columns such as the sample queries in Section 4.2. We observe that Netstore clearly outperforms Table 3. Relative performance of NetStore versus columns only PostgreSQL and LucidDB for query running times and total storage needed Q1 Q2 Q3 Q4 Q5 Storage Postgres/NetStore LucidDB/NetStore
19 NetStore: An Efficient Storage Infrastructure 295 PostgreSQL for all the query types providing the best results for queries accessing more attributes (e.g. Q1 and Q4) even though it uses 90 times more disk space including all the auxiliary data. The poor PostgreSQL performance can be explained by the absence of more clustered indexes, the lack of compression, and the unnecessary tuple overhead. Table 3 also shows the relative performance compared to LucidDB. We observe that the performance gap is not at the same order of magnitude compared to that of PostgreSQL even when more attributes are accessed. However, NetStore performs clearly better when storing about 6 times less data. The performance penalty of LucidDB can be explain by the lack of column segmentation design and by early materialization in the processing phase specific to general-purpose column stores. However we noticed that LucidDB achieves a significant performance improvement for the subsequent runs of the same query by efficiently using memory resident data. 5 Conclusion and Future Work With the growth of network traffic, there is an increasing demand for solutions to better manage and take advantage of the wealth of network flow information recorded for monitoring and forensic investigations. The problem is no longer the availability and the storage capacity of the data, but the ability to quickly extract the relevant information about potential malicious activities that can affect network security and resources. In this paper we have presented the design, implementation and evaluation of a novel working architecture, called NetStore, that is useful in the network monitoring tasks and assists in network forensics investigations. The simple column oriented design of NetStore helps in reducing query processing time by spending less time for disk I/O and loading only needed data. The column partitioning facilitates the use of efficient compression methods for network flow attributes that allow data processing in compressed format, therefore boosting query runtime performance. NetStore clearly outperforms existing row-based DBMSs systems and provides better results that the general purpose column oriented systems because of simple design decisions tailored for network flow records. Experiments show that NetStore can provide more than ten times faster query response comparedto other storage systems while maintaining much smaller storage size. In future work we seek to explore the use of NetStore for new types of time sequential data, such as host log analysis, and the possibility to release it as an open source system. References 1. Abadi, D., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: SIGMOD 2006: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp ACM, New York (2006)
20 296 P. Giura and N. Memon 2. Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp ACM, New York (2008) 3. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2006 (2006) 4. Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp ACM, New York (2003) 5. Gates, C., Collins, M., Duggan, M., Kompanek, A., Thomas, M.: More netflow tools for performance and security. In: LISA 2004: Proceedings of the 18th USENIX Conference on System Administration, pp USENIX Association, Berkeley (2004) 6. Geambasu, R., Bragin, T., Jung, J., Balazinska, M.: On-demand view materialization and indexing for network forensic analysis. In: NETB 2007: Proceedings of the 3rd USENIX International Workshop on Networking Meets Databases, pp USENIX Association, Berkeley (2007) 7. Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of IEEE International Conference on Data Engineering, pp (1998) 8. Halverson, A., Beckmann, J.L., Naughton, J.F., Dewitt, D.J.: A comparison of c- store and row-store in a common framework. Technical Report TR1570, University of Wisconsin-Madison (2006) 9. Holloway, A.L., DeWitt, D.J.: Read-optimized databases, in depth. Proc. VLDB Endow. 1(1), (2008) 10. Infobright Inc. Infobright, LucidEra. Luciddb, Paxson, V.: Bro: A system for detecting network intruders in real-time. Computer Networks, (1998) 13. PostgreSQL. Postgresql, Roesch, M.: Snort - lightweight intrusion detection for networks. In: LISA 1999: Proceedings of the 13th USENIX Conference on System Administration, pp USENIX Association, Berkeley (1999) 15. Ślȩzak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: an analytic data warehouse for ad-hoc queries. Proc. VLDB Endow. 1(2), (2008) 16. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O Neil, E., O Neil, P., Rasin, A., Tran, N., Zdonik, S.: C-store: a column-oriented dbms. In: VLDB 2005: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB Endowment, pp (2005) 17. Sullivan, M., Heybey, A.: Tribeca: A system for managing large databases of network traffic. In: USENIX, pp (1998) 18. Cisco Systems. Cisco ios netflow, Vertica Systems. Vertica, Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, (1977) 21.Zukowski,M.,Boncz,P.A.,Nes,N.,Héman, S.: Monetdb/x100 - a dbms in the cpu cache. IEEE Data Eng. Bull. 28(2), (2005)
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
More informationNSC 93-2213-E-110-045
NSC93-2213-E-110-045 2004 8 1 2005 731 94 830 Introduction 1 Nowadays the Internet has become an important part of people s daily life. People receive emails, surf the web sites, and chat with friends
More informationMonitoring PostgreSQL database with Verax NMS
Monitoring PostgreSQL database with Verax NMS Table of contents Abstract... 3 1. Adding PostgreSQL database to device inventory... 4 2. Adding sensors for PostgreSQL database... 7 3. Adding performance
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationChuck Cranor, Ted Johnson, Oliver Spatscheck
Gigascope: How to monitor network traffic 5Gbit/sec at a time. Chuck Cranor, Ted Johnson, Oliver Spatscheck June, 2003 1 Outline Motivation Illustrative applications Gigascope features Gigascope technical
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationResearch on Errors of Utilized Bandwidth Measured by NetFlow
Research on s of Utilized Bandwidth Measured by NetFlow Haiting Zhu 1, Xiaoguo Zhang 1,2, Wei Ding 1 1 School of Computer Science and Engineering, Southeast University, Nanjing 211189, China 2 Electronic
More informationNetFlow Tracker Overview. Mike McGrath x ccie CTO mike@crannog-software.com
NetFlow Tracker Overview Mike McGrath x ccie CTO mike@crannog-software.com 2006 Copyright Crannog Software www.crannog-software.com 1 Copyright Crannog Software www.crannog-software.com 2 LEVELS OF NETWORK
More informationPort evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently.
TLP:WHITE - Port Evolution Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently. Gerard Wagener 41, avenue de la Gare L-1611 Luxembourg Grand-Duchy
More informationAnalysis of a Distributed Denial-of-Service Attack
Analysis of a Distributed Denial-of-Service Attack Ka Hung HUI and OnChing YUE Mobile Technologies Centre (MobiTeC) The Chinese University of Hong Kong Abstract DDoS is a growing problem in cyber security.
More informationIn-Memory Data Management for Enterprise Applications
In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University
More informationHierarchical Bloom Filters: Accelerating Flow Queries and Analysis
Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis January 8, 2008 FloCon 2008 Chris Roblee, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department
More informationSQL Server Business Intelligence on HP ProLiant DL785 Server
SQL Server Business Intelligence on HP ProLiant DL785 Server By Ajay Goyal www.scalabilityexperts.com Mike Fitzner Hewlett Packard www.hp.com Recommendations presented in this document should be thoroughly
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationPractical Experience with IPFIX Flow Collectors
Practical Experience with IPFIX Flow Collectors Petr Velan CESNET, z.s.p.o. Zikova 4, 160 00 Praha 6, Czech Republic petr.velan@cesnet.cz Abstract As the number of Internet applications grows, the number
More informationHow good can databases deal with Netflow data
How good can databases deal with Netflow data Bachelorarbeit Supervisor: bernhard fabian@net.t-labs.tu-berlin.de Inteligent Networks Group (INET) Ernesto Abarca Ortiz eabarca@net.t-labs.tu-berlin.de OVERVIEW
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
More informationImproving the Database Logging Performance of the Snort Network Intrusion Detection Sensor
-0- Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor Lambert Schaelicke, Matthew R. Geiger, Curt J. Freeland Department of Computer Science and Engineering University
More informationKey Components of WAN Optimization Controller Functionality
Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications
More informationFuzzy Network Profiling for Intrusion Detection
Fuzzy Network Profiling for Intrusion Detection John E. Dickerson (jedicker@iastate.edu) and Julie A. Dickerson (julied@iastate.edu) Electrical and Computer Engineering Department Iowa State University
More informationCase Study: Instrumenting a Network for NetFlow Security Visualization Tools
Case Study: Instrumenting a Network for NetFlow Security Visualization Tools William Yurcik* Yifan Li SIFT Research Group National Center for Supercomputing Applications (NCSA) University of Illinois at
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski kbajda@cs.yale.edu
Kamil Bajda-Pawlikowski kbajda@cs.yale.edu Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation
More informationCapacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
More informationSawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices
Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal
More informationGraph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
More information2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
More informationWireshark Developer and User Conference
Wireshark Developer and User Conference Using NetFlow to Analyze Your Network June 15 th, 2011 Christopher J. White Manager Applica6ons and Analy6cs, Cascade Riverbed Technology cwhite@riverbed.com SHARKFEST
More informationWindows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
More informationHardware Configuration Guide
Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationVirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
More informationPerformance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
More informationOracle Database In-Memory The Next Big Thing
Oracle Database In-Memory The Next Big Thing Maria Colgan Master Product Manager #DBIM12c Why is Oracle do this Oracle Database In-Memory Goals Real Time Analytics Accelerate Mixed Workload OLTP No Changes
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More information8. 網路流量管理 Network Traffic Management
8. 網路流量管理 Network Traffic Management Measurement vs. Metrics end-to-end performance topology, configuration, routing, link properties state active measurements active routes active topology link bit error
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationFact Sheet In-Memory Analysis
Fact Sheet In-Memory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the In-Memory Database...4
More informationco Characterizing and Tracing Packet Floods Using Cisco R
co Characterizing and Tracing Packet Floods Using Cisco R Table of Contents Characterizing and Tracing Packet Floods Using Cisco Routers...1 Introduction...1 Before You Begin...1 Conventions...1 Prerequisites...1
More informationOracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
More informationlow-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationAlienVault Unified Security Management (USM) 4.x-5.x. Deployment Planning Guide
AlienVault Unified Security Management (USM) 4.x-5.x Deployment Planning Guide USM 4.x-5.x Deployment Planning Guide, rev. 1 Copyright AlienVault, Inc. All rights reserved. The AlienVault Logo, AlienVault,
More informationWhitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
More informationBeyond Monitoring Root-Cause Analysis
WHITE PAPER With the introduction of NetFlow and similar flow-based technologies, solutions based on flow-based data have become the most popular methods of network monitoring. While effective, flow-based
More informationWe will give some overview of firewalls. Figure 1 explains the position of a firewall. Figure 1: A Firewall
Chapter 10 Firewall Firewalls are devices used to protect a local network from network based security threats while at the same time affording access to the wide area network and the internet. Basically,
More informationNetwork forensics 101 Network monitoring with Netflow, nfsen + nfdump
Network forensics 101 Network monitoring with Netflow, nfsen + nfdump www.enisa.europa.eu Agenda Intro to netflow Metrics Toolbox (Nfsen + Nfdump) Demo www.enisa.europa.eu 2 What is Netflow Netflow = Netflow
More informationMonitoring System Status
CHAPTER 14 This chapter describes how to monitor the health and activities of the system. It covers these topics: About Logged Information, page 14-121 Event Logging, page 14-122 Monitoring Performance,
More informationLarge-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop
Large-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop R. David Idol Department of Computer Science University of North Carolina at Chapel Hill david.idol@unc.edu http://www.cs.unc.edu/~mxrider
More informationEmerald. Network Collector Version 4.0. Emerald Management Suite IEA Software, Inc.
Emerald Network Collector Version 4.0 Emerald Management Suite IEA Software, Inc. Table Of Contents Purpose... 3 Overview... 3 Modules... 3 Installation... 3 Configuration... 3 Filter Definitions... 4
More informationIntegrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD
More informationIntroducing the Microsoft IIS deployment guide
Deployment Guide Deploying Microsoft Internet Information Services with the BIG-IP System Introducing the Microsoft IIS deployment guide F5 s BIG-IP system can increase the existing benefits of deploying
More informationPerformance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
More informationAlienVault. Unified Security Management (USM) 5.x Policy Management Fundamentals
AlienVault Unified Security Management (USM) 5.x Policy Management Fundamentals USM 5.x Policy Management Fundamentals Copyright 2015 AlienVault, Inc. All rights reserved. The AlienVault Logo, AlienVault,
More informationUsing the HP Vertica Analytics Platform to Manage Massive Volumes of Smart Meter Data
Technical white paper Using the HP Vertica Analytics Platform to Manage Massive Volumes of Smart Meter Data The Internet of Things is expected to connect billions of sensors that continuously gather data
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationAmadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
More informationArchitecture Overview
Architecture Overview Design Fundamentals The networks discussed in this paper have some common design fundamentals, including segmentation into modules, which enables network traffic to be isolated and
More informationINCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS
WHITE PAPER INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS Network administrators and security teams can gain valuable insight into network health in real-time by
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More informationApplication of Netflow logs in Analysis and Detection of DDoS Attacks
International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 8, Number 1 (2016), pp. 1-8 International Research Publication House http://www.irphouse.com Application of Netflow logs in
More informationScaling 10Gb/s Clustering at Wire-Speed
Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
More informationD1.2 Network Load Balancing
D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,
More informationNetwork Intrusion Detection Systems. Beyond packet filtering
Network Intrusion Detection Systems Beyond packet filtering Goal of NIDS Detect attacks as they happen: Real-time monitoring of networks Provide information about attacks that have succeeded: Forensic
More informationSecurity Event Management. February 7, 2007 (Revision 5)
Security Event Management February 7, 2007 (Revision 5) Table of Contents TABLE OF CONTENTS... 2 INTRODUCTION... 3 CRITICAL EVENT DETECTION... 3 LOG ANALYSIS, REPORTING AND STORAGE... 7 LOWER TOTAL COST
More informationnfdump and NfSen 18 th Annual FIRST Conference June 25-30, 2006 Baltimore Peter Haag 2006 SWITCH
18 th Annual FIRST Conference June 25-30, 2006 Baltimore Peter Haag 2006 SWITCH Some operational questions, popping up now and then: Do you see this peek on port 445 as well? What caused this peek on your
More informationAdaptive Flow Aggregation - A New Solution for Robust Flow Monitoring under Security Attacks
Adaptive Flow Aggregation - A New Solution for Robust Flow Monitoring under Security Attacks Yan Hu Dept. of Information Engineering Chinese University of Hong Kong Email: yhu@ie.cuhk.edu.hk D. M. Chiu
More informationInternet Firewall CSIS 4222. Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS 4222. net15 1. Routers can implement packet filtering
Internet Firewall CSIS 4222 A combination of hardware and software that isolates an organization s internal network from the Internet at large Ch 27: Internet Routing Ch 30: Packet filtering & firewalls
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationEMC Unified Storage for Microsoft SQL Server 2008
EMC Unified Storage for Microsoft SQL Server 2008 Enabled by EMC CLARiiON and EMC FAST Cache Reference Copyright 2010 EMC Corporation. All rights reserved. Published October, 2010 EMC believes the information
More informationAgenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
More informationDATA WAREHOUSING II. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 23
DATA WAREHOUSING II CS121: Introduction to Relational Database Systems Fall 2015 Lecture 23 Last Time: Data Warehousing 2 Last time introduced the topic of decision support systems (DSS) and data warehousing
More informationDistributed storage for structured data
Distributed storage for structured data Dennis Kafura CS5204 Operating Systems 1 Overview Goals scalability petabytes of data thousands of machines applicability to Google applications Google Analytics
More informationand reporting Slavko Gajin slavko.gajin@rcub.bg.ac.rs
ICmyNet.Flow: NetFlow based traffic investigation, analysis, and reporting Slavko Gajin slavko.gajin@rcub.bg.ac.rs AMRES Academic Network of Serbia RCUB - Belgrade University Computer Center ETF Faculty
More informationNetwork Monitoring On Large Networks. Yao Chuan Han (TWCERT/CC) james@cert.org.tw
Network Monitoring On Large Networks Yao Chuan Han (TWCERT/CC) james@cert.org.tw 1 Introduction Related Studies Overview SNMP-based Monitoring Tools Packet-Sniffing Monitoring Tools Flow-based Monitoring
More informationExercise 7 Network Forensics
Exercise 7 Network Forensics What Will You Learn? The network forensics exercise is aimed at introducing you to the post-mortem analysis of pcap file dumps and Cisco netflow logs. In particular you will:
More informationNoDB: Efficient Query Execution on Raw Data Files
NoDB: Efficient Query Execution on Raw Data Files Ioannis Alagiannis Renata Borovica Miguel Branco Stratos Idreos Anastasia Ailamaki EPFL, Switzerland {ioannis.alagiannis, renata.borovica, miguel.branco,
More informationNetwork Security Monitoring and Behavior Analysis Pavel Čeleda, Petr Velan, Tomáš Jirsík
Network Security Monitoring and Behavior Analysis Pavel Čeleda, Petr Velan, Tomáš Jirsík {celeda velan jirsik}@ics.muni.cz Part I Introduction P. Čeleda et al. Network Security Monitoring and Behavior
More informationPerformance Verbesserung von SAP BW mit SQL Server Columnstore
Performance Verbesserung von SAP BW mit SQL Server Columnstore Martin Merdes Senior Software Development Engineer Microsoft Deutschland GmbH SAP BW/SQL Server Porting AGENDA 1. Columnstore Overview 2.
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationRichard Bejtlich richard@taosecurity.com www.taosecurity.com / taosecurity.blogspot.com BSDCan 14 May 04
Network Security Monitoring with Sguil Richard Bejtlich richard@taosecurity.com www.taosecurity.com / taosecurity.blogspot.com BSDCan 14 May 04 Overview Introduction to NSM The competition (ACID, etc.)
More informationENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
More informationFinal exam review, Fall 2005 FSU (CIS-5357) Network Security
Final exam review, Fall 2005 FSU (CIS-5357) Network Security Instructor: Breno de Medeiros 1. What is an insertion attack against a NIDS? Answer: An insertion attack against a network intrusion detection
More informationDBMS / Business Intelligence, SQL Server
DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationUsing Synology SSD Technology to Enhance System Performance. Based on DSM 5.2
Using Synology SSD Technology to Enhance System Performance Based on DSM 5.2 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD Cache as Solution...
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationLimitations of Packet Measurement
Limitations of Packet Measurement Collect and process less information: Only collect packet headers, not payload Ignore single packets (aggregate) Ignore some packets (sampling) Make collection and processing
More informationLCMON Network Traffic Analysis
LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne
More informationACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION
ACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION Dell NX4 Dell Inc. Visit dell.com/nx4 for more information and additional resources Copyright 2008 Dell Inc. THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES
More informationIntrusion Detection in AlienVault
Complete. Simple. Affordable Copyright 2014 AlienVault. All rights reserved. AlienVault, AlienVault Unified Security Management, AlienVault USM, AlienVault Open Threat Exchange, AlienVault OTX, Open Threat
More informationPerformance Guideline for syslog-ng Premium Edition 5 LTS
Performance Guideline for syslog-ng Premium Edition 5 LTS May 08, 2015 Abstract Performance analysis of syslog-ng Premium Edition Copyright 1996-2015 BalaBit S.a.r.l. Table of Contents 1. Preface... 3
More informationFirewalls Overview and Best Practices. White Paper
Firewalls Overview and Best Practices White Paper Copyright Decipher Information Systems, 2005. All rights reserved. The information in this publication is furnished for information use only, does not
More information4 Internet QoS Management
4 Internet QoS Management Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology stadler@ee.kth.se September 2008 Overview Network Management Performance Mgt QoS Mgt Resource Control
More informationIndex Terms Domain name, Firewall, Packet, Phishing, URL.
BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet
More informationChapter 13 File and Database Systems
Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation
More informationChapter 13 File and Database Systems
Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation
More informationFirewalls, Tunnels, and Network Intrusion Detection
Firewalls, Tunnels, and Network Intrusion Detection 1 Part 1: Firewall as a Technique to create a virtual security wall separating your organization from the wild west of the public internet 2 1 Firewalls
More information