FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics
|
|
|
- Emery Carpenter
- 10 years ago
- Views:
Transcription
1 FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics André Petermann 1,2, Martin Junghanns 1, Robert Müller 2 and Erhard Rahm 1 1 University of Leipzig {petermann,junghanns,rahm}@informatik.uni-leipzig.de 2 Leipzig University of Applied Sciences [email protected] Abstract. We present FoodBroker, a new data generator for benchmarking graph-based business intelligence systems and approaches. It covers two realistic business processes and their involved master and transactional data objects. The interactions are correlated in controlled ways to enable non-uniform distributions for data and relationships. For benchmarking data integration, the generated data is stored in two interrelated databases. The dataset can be arbitrarily scaled and allows comprehensive graph- and pattern-based analysis. Keywords: synthetic data generation, benchmarks, graph data management, busines intelligence 1 Introduction The operations of a company are reflected in its business processes and their different domain objects such as employees, products or purchase orders. Significant goals for business intelligence are to find correlations between such domain objects, to identify how domain objects are involved in business processes and to determine how this affects the success or failure of processes and the company. For example, consider trading processes where unfriendly sales people will have a negative influence while fast logistics companies can have a positive impact. Business information systems supporting the execution of business processes store those domain objects influencing the process outcome as master data. Furtheron, they record transactional data such as sales orders or invoices referencing master data during process execution. While the data of business information systems is usually stored in relational databases, all process-related domain objects and their references can be abstracted as a graph, where objects are nodes and references are edges. So far, graph data models have been typically used to model natural graph scenarios such as social networks or knowledge models (e.g., ontologies). However, recent research projects aim at using graph models for data integration and analytics, for example in the enterprise [10][13] or health care [9] domains. Evaluating or benchmarking such analytical graph applications requires datasets that reflect the nature of business information systems. Unfortunately, it is very
2 difficult to obtain real datasets from enterprises for evaluation purposes so that we see a need to generate appropriate datasets synthetically. Such datasets must meet specific features to be useful for graph-based analytics which are not sufficiently supported by established data generators for warehouse and graph benchmarks. In particular, the scenario covered by the dataset should represent a large number of similar use cases in practice and should allow the analysis of complex relationships and patterns to gain significant business insights. More specifically, we pose the following requirements. 1 - Heterogeneous domain objects Domain objects belong to different classes representing master or transactional data. Example master data classes are employee, customer or product and example transactional classes are sales order, rating or Heterogeneous relationships Domain objects may be related in different semantic ways represented by different relationship types. Relationships have to involve both master and transactional domain objects in any combination. 3 - Support of graph-based analysis The recorded data should allow a more comprehensive, graph-based business analysis than with traditional data warehouses. For enterprise data, it should thus represent different business processes with complex, correlated interactions between master data objects and transactional objects. It should thus be possible to identify and analyze different transactional patterns, e.g., the effect of how an employee interacted with specific customers. 4 - Multiple sources Companies and organizations typically use multiple information systems. To cover data integration tasks, a synthetic dataset needs to imitate multiple data sources. 5 - Scalability For benchmarking, the datasets need to be scalable up to realistic sizes with thousands of master data objects and millions of transactions. To our knowledge, there is no data generator that provides all of the stated requirements yet. The contribution of this paper is FoodBroker, a data generator for related master and transactional data which are meaningful for a specific domain. FoodBroker is based on simulating processes and process-supporting information systems. The current version of FoodBroker provides an authentic data model representing two interrelated business information systems and generates records by simulating many business process executions (cases). FoodBroker considers the correlation between master data instances and their influence on the development and outcome of each case. The resulting data can be integrated using a graph model and used for evaluation and benchmarking of analytical graph systems. The source code of the FoodBroker data generator can be found on GitHub 1 under GPL v3. The remaining paper is organized as follows. In section 2 we discuss related work about existing benchmark-related data generators. Then in section 3, we introduce our business process and its simulation. The current implementation is described in section 4. Finally, we end up with a summary and provide an outlook in section
3 2 Related Work Data generators for established OLAP benchmarks such as TPC-H [12] and APB-1 [8] do not fully meet the introduced requirements for graph-based analytics. Although they are scalable and offer heterogenous domain objects their relationship heterogenity is limited. Graph patterns of interest usually involve causal connections among transactional data as well as the involved master data [10]. However, APB-1 and all TPC benchmarks except TPC-DS are focussed on a single class of transactional data and TPC-DS by design provides no relationships in between the different transactional objects. The data generator of BigBench [5] extends the TPC-DS data model by related transactional data from web logs and reviews. However, those relationships are generated without considering the impact of master data instances. Further on, none of the benchmarks involves the problem of integrating multiple sources. Due to the growing research interest in graph database systems, many benchmark studies have been published comparing those systems to relational database systems as well as among themselves. As there is currently no standard graph benchmark available, varying data generators have been developed, typically focused on a specific use case. In [14], Vicknair et al. compare a relational and a graph database system. Their generated datasets contain artificial provenance information modeled as a directed acyclic graph and stored in a single data source. Nodes as well as edges are homogeneous and lack specific semantics: nodes carry random payload data, edges have no relationship type and no data attached to them. Graph sizes can vary depending on a user-defined number of nodes. Holzschuher and Peinl also compare a relational and a graph database system focusing on the evaluation of graph query languages and SQL[7]. The generated datasets resemble the structure and content of online social networks in a property graph and are stored in a single data source. Nodes and edges can be of different types, nodes store realistic data based on dictionaries. The relationship information is taken from a real social network which makes it impossible to scale the graph size unrestricted. Dominguez-Sal et al.[4] benchmark different GDBMS. They make use of the recursive matrix (R-MAT[3]) algorithm, which was designed to generate directed graphs that are equivalent to real networks in terms of specific graph invariants, like degree distribution and diameter. Using a single parameter, the graph size can be scaled exponentially. The algorithm is focused on generating a realistic structure but lacks the capability to add semantics to nodes or edges. In [6], Gupta emphasizes that the characteristics of generating data for heterogenous graphs strongly differs from data generators of benchmarks for data warehouses and graphs which are specified by topology characteristics. He proposes a data generator for heterogenous multigraphs with typed nodes and labeled edges representing meaingful data from a drug discovery scenario. Although this approach even provides correlations between domain objects and graph structure, the resulting data does neither consider the characteristics of process-related data nor the scenario of multiple data sources.
4 Pham et al. propose S3G2, a framework for specifying the generation of graphs using a rule-based approach which leads to plausible structural correlations between graph structures and domain objects [11]. Their focus is the generation of social networks with real-world structural and semantic characteristics using dictionaries and manually defined correlation rules. To achieve scalability in terms of graph size and computing time, the framework is implemented using the MapReduce paradigm. A way to adopt the framework for business analytics could be the definition of domain specific correlation rules. Nevertheless, this would not lead to the generation of complex business processes where the correlation is not only depending on directly connected nodes but on transitive relationships between multiple domain objects. A promising project, which adopts the S3G2 framework, is the Linked Data Benchmark Council (LDBC) [1][2], which aims at establishing standardized graphoriented benchmarks and data generators for different application domains. Like us, the authors also highlight the importance of semantic correlations within the data to address business intelligence and graph analytics. Up until now, the project offers data generators for social network and semantic publishing use cases. FoodBroker describes a complementary, business-oriented use case where data objects are interrelated within diverse business process executions. The Foodbroker dataset will thus likely allow different graph-based analysis tasks for business intelligence than in the LDBC use cases. Still we expect the LDBC results as a valuable input for defining a benchmark based on Foodbroker datasets. 3 Simulation The FoodBroker simulation reflects the core business of a fictive company trading food between producers (vendors) and retailers (customers). The company only offers a brokerage service and has no warehouse. Process-related data is recorded in an enterprise resource planning (ERP) system and a customer issue tracking (CIT) system. The simulation includes the customizable generation of master data as well as transactional data involved in the executions of two interrelated business processes for food brokerage and complaint handling. In the following, we describe the data model and the two simulated business processes. We also discuss how the generated data can be used for graph-based business analytics. 3.1 Data Model The data model for the ERP and CIT systems is shown in figure 1. The ERP system stores master data objects of the classes Employee, Product, Customer, Vendor and Logistics. Products are categozied into different product categories, such as fruits, vegetables and nuts. In the CIT system, the instances of master class User refer to employees in the ERP system, while master class Client refers to ERP class Customer.
5 Fig. 1. FoodBroker Data Model : The outer rectangles show the boundaries of two systems ERP and CIT. Database tables correspond either to classes or n:m associations (*Line). Primary keys are highlighted by italic letters. Associations are shown as solid lines. Foreign keys are attached to associations. Implicit associations in between both databases are represented by dotted lines. For each implicit association, there is a corresponding column with prefix erp.
6 For each master data object we provide a quality criterion with one of the values good, normal or bad. We use these criteria in the generation of business processes and achieve thus a correlation between the different kinds of master objects and the structure and outcome of process executions. The ERP system further records transactional data about trading and refunds, represented by the classes SalesQuotation, SalesOrder, PurchOrder, DeliveryNote, SalesInvoice and PurchInvoice. The line classes SalesQuotationLine, SalesOrderLine and PurchOrderLine represent n:m associations between the respective transactional classes and Product. The CIT system has only the transactional class Ticket representing customer complaints. All transactional classes have associations to other transactional and master data classes. 3.2 Food Brokerage A food brokerage starts with a SalesQuotation sent by a random Employee to a random Customer. A quotation has SalesQuotationLines referring to random products. Each SalesQuotationLine provides a salesprice that is determined by adding a sales margin to the products purchprice. A SalesQuotation can be either confirmed or rejected. To simulate the interaction of employee and customer, the probability of confirmation as well as the sales margin will be significantly higher if a good Employee and a good Customer are involved and correspondingly lower for normal or bad master data objects. A confirmed quotation results in a SalesOrder and a set of SalesOrderLines. The SalesOrder includes a reference to the underlying SalesQuotation as well as a deliverydate. To reflect partial confirmations, there may be fewer SalesOrderLines than SalesQuotationLines. While the Customer is the same as the one of the SalesQuotation, a new Employee is processing the SalesOrder. For each SalesOrder, one ore more PurchOrders, each with one or more PurchOrderLines, are placed at random Vendors. A random Employee (purchaser) is associated per PurchOrder. Actual purchase prices are subject to variations. To simulate the interaction of purchaser and vendor, a good Employee and a good Vendor will lead to a lower purchprice as compared to normal or bad ones. Furtheron, a good Employee will place PurchOrders faster. After a PurchOrder is processed by the Vendor, the company receives information about the DeliveryNote, in particular date of delivery, operating Logistics company and operator-specific trackingcode. The delivery time is influenced by the quality of both Vendor and Logistics company such that good business partners will lead to faster delivery than bad or normal ones. Finally, one SalesInvoice per SalesOrder will be sent to the Customer and one PurchInvoice per PurchOrder received from the corresponding Vendor. All transactional data objects created within cases of food brokerage refer to their predecessor (except SalesQuotation) and provide a date property with a value greater or equal than the one of the predecessor.
7 3.3 Complaint Handling For every customer complaint an instance of Ticket referring the corresponding SalesOrder is created. For the first complaint per Customer, additionally a Client instance is created. The employee handling the complaint is recorded in class User. There are two problems which may cause complaints: late delivery and bad product quality. Late delivery complaints occur if the date of a DeliveryNote is greater than the agreed deliverydate of the SalesOrder, e.g., due to a bad Employee processing the SalesOrder or a PurchaseOrder, a bad Vendor, a bad Logistics company or combinations of those. Bad quality complaints may be caused by bad Products, a bad Vendor, a bad Logistics company or combinations of such. A Ticket may lead to refunds which are recorded as SalesInvoice objects with negative revenue. While the constellation of a good Employee allocated to the Ticket and a good Customer may lead to low or no refund, bad ones lead to higher refund. If the problem was caused by the Vendor, there will be also a refund for the company in the form of a PurchInvoice with negative expenses. While the constellation of a good Employee and a good Vendor may lead to high refund, bad ones lead to lower or no refund. 3.4 Graph Analytics The FoodBroker datasets provide complex correlations between master data instances and their involvement within the execution of the described business processes making it a good basis for graph-based business analytics. In particular it is possible to analyze the influence of the different master objects (employees, customers,vendors, etc.) on the financial outcome of business processes by case-wise aggregating all revenue- and expense-related properties, e.g. in SalesInvoice and PurchInvoice records of the same case, as well as determining the impact of refunds due to complaints. Since all objects involved in the same case are interconnected in the generated dataset, it is possible to analyze the resulting graphs representing business process executions. Hence it is not only possible to aggregate profit-related measures but also to analyze the transactional correlation patterns underlying positive or negative cases to find frequent patterns in such business process executions. For example, it could be found that cases with outstandingly high profit frequently contain patterns such as Customer:ACME <-receivedfrom- SalesOrder -processedby-> Employee:Alice saying that employee Alice was processing the sales order by customer ACME. Finding such correlations is a non-trivial task due to many-to-many associations between cases and master data objects of the same class. For example, in a single case multiple Employees can be involved in different ways but also in the same way, e.g. multiple purchaser employees for different items.
8 Class Configuration Default Value Master Data Employee number of instances SF proportions of good/normal/bad inst. 0.1/0.8/0.1 Product number of instances SF proportions of good/normal/bad inst. 0.1/0.8/0.1 list price range Customer number of instances SF proportions of good/normal/bad inst. 0.1/0.8/0.1 Vendor number of instances SF proportions of good/normal/bad inst. 0.1/0.8/0.1 Logistics number of instances SF proportions of good/normal/bad inst. 0.1/0.8/0.1 Food Brokerage Process number of cases SF date range SalesQuotation products per quotation (lines) quantity per product sales margin/im 0.05/0.02 confirmation probability/im 0.6/0.2 line confirmation probability 0.9 confirmation delay/im 0..20/5 SalesOrder agreed delivery delay/im 2..4/1 purchase delay/im 0..2/2 invoice delay/im 0..3/-2 PurchOrder price variation/im 0.01/-0.02 delivery delay/im 0..1/1 invoice delay/im 2..5/3 Complaint Handling Ticket probability of bad quality/im 0.05/-0.1 sales refund/im 0.1/-0.05 purchase refund/im 0.1/0.05 SF abbreviates scale f actor number of instance configurations are defined by linear functions a + b SF IM abbreviates impact per master data instance IM > 0 means good/bad master data instances increase/decrease value IM < 0 means good/bad master data instances decrease/increase value Table 1. FoodBroker Configuration Parameters
9 4 Implementation Our current implementation stores the generated dataset in a MySQL database dump with separate databases for the ERP and CIT systems. In both databases every class has a dedicated table with the same name. 1:1 and 1:m associations correspond to foreign keys in class tables where the column names represent the relationship type. M:n associations are stored in separate tables. The simulation is implemented as a Java console application. The database dump includes SQL statements for the creation of tables. All instances of a class are inserted into the tables corresponding to their class. Domain-specific string values such as employee or product names are provided by an embedded SQLite database. Scalability in terms of different data sizes is controlled by a scale factor (SF ). This makes it possible to create datasets with equal criteria regarding distributions, value ranges and probabilities but with a different number of instances. The number of master data instances and the number of simulated cases are defined by linear functions. The simulation function has a default slope of 10,000 per scale factor. We take into account that master data instances do not scale proportionally to cases. For example, there is only a limited amount of logistics companies. Thus, the functions of all master data classes beside a specific slope also have a y-intercept to specify a minimum number of instances. Beside the scale factor and the resulting data growth, the generation of master data but also the process simulation is customizable by several configuration parameters. For example, one can set the confirmation probability of quotations or the influence of master data quality criteria on that probability. All configurations are set within a single file. An overview about the configuration parameters is provided in table 1. While using FoodBroker for benchmarks requires fixed configurations and variable scale, variable configurations at a fixed scale can be used to evaluate business intelligence applications at different scenarios. Table 2 shows dataset measures for different scale factors using the standard configuration. With respect to runtime, our current implementation shows a linear behavior for increasing scale factors. All datasets were generated on a workstation containing an Intel Xeon Quadcore, 8GB RAM and a standard HDD. Master Data Transactional Data SF Objects Time Objects Relationships Time Dump Size 1 1.1K 4s 73K 380K 4s 42MB K 4s 725K 3.8M 25s 426MB K 4s 7.2M 38M 4min 4.1GB K 8s 68M 360M 35min 39GB Table 2. Measures of FoodBroker datasets for different scale factors (SF).
10 5 Summary We presented the FoodBroker data generator for graph-based analytics based on simulated business process executions. FoodBroker creates domain-specific master and transactional data as well as meaningful relationships in between them. Quality criteria provided for all master data objects influence the execution of business processes and thus the creation of transactional data and process outcome. The complex correlations between master and transactional objects within graph-like business process representations make the Foodbroker datasets a good candidate for a variety of graph- and pattern-based analysis tasks. Data generation is customizable and can be scaled to large sizes by a simple scale factor. In future work, we want to define a benchmarking workload for Foodbroker datasets and use it to evaluate different systems including current GDBMS and our own BIIIG 2 approach [10]. 6 Acknowledgments This work is partly funded within the EU program Europa fördert Sachsen of the European Social Fund. References 1. R. Angles et. al. The Linked Data Benchmark Council: a Graph and RDF industry benchmarking effort. ACM SIGMOD Record, 43(1), P. Boncz. LDBC: benchmarks for graph and RDF data management. In Proc. of the 17th Int. Database Engineering & Applications Symposium. ACM, D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, pages SIAM, D. Dominguez-Sal et. al. Survey of graph database performance on the hpc scalable graph analysis benchmark. In Web-Age Information Management. Springer, A. Ghazal et. al. Bigbench: Towards an industry standard benchmark for big data analytics. In Proc. of the 2013 int. conf. on Management of data. ACM. 6. A. Gupta. Generating large-scale heterogeneous graphs for benchmarking. In Specifying Big Data Benchmarks, pages Springer, F. Holzschuher and R. Peinl. Performance of graph query languages: comparison of cypher, gremlin and native access in neo4j. In Proceedings of the Joint EDBT/ICDT 2013 Workshops. ACM, OLAP Council. APB-1 OLAP Benchmark. research/bmarkly.htm. 9. Y. Park et. al. Graph Databases for Large-Scale Healthcare Systems: A Framework for Efficient Data Management and Data Services. In Data Engineering Workshops (ICDEW), IEEE 30th Int. Conf. on, A. Petermann, M. Junghanns, R. Müller, and E. Rahm. BIIIG : Enbabling Business Intelligence with Integrated Instance Graphs. In Data Engineering Workshops (ICDEW), IEEE 30th Int. Conf. on,
11 11. M.-D. Pham et. al. S3g2: A scalable structure-correlated social graph generator. In Selected Topics in Performance Evaluation and Benchmarking. Springer, Transaction Processing Performance Council. TPC Benchmarks. org/information/benchmarks.asp. 13. E. Vasilyeva et. al. Leveraging flexible data management with graph databases. In 1st Int. Workshop on Graph Data Man. Experiences and Systems. ACM, C. Vicknair et al. A comparison of a graph database and a relational database: a data provenance perspective. In Proc. of the 48th ann. Southeast reg. conf. ACM, 2010.
SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG
SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG MARTIN JUNGHANNS, ANDRE PETERMANN, ERHARD RAHM www.scads.de RESEARCH ON GRAPH ANALYTICS Graph Analytics on Hadoop (Gradoop) Distributed graph data management
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Evaluating Online Payment Transaction Reliability using Rules Set Technique and Graph Model
Evaluating Online Payment Transaction Reliability using Rules Set Technique and Graph Model Trung Le 1, Ba Quy Tran 2, Hanh Dang Thi My 3, Thanh Hung Ngo 4 1 GSR, Information System Lab., University of
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD
Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation
Energy aware RAID Configuration for Large Storage Systems
Energy aware RAID Configuration for Large Storage Systems Norifumi Nishikawa [email protected] Miyuki Nakano [email protected] Masaru Kitsuregawa [email protected] Abstract
Converting Relational to Graph Databases
Converting Relational to Graph Databases Roberto De Virgilio Università Roma Tre Rome, Italy [email protected] Antonio Maccioni Università Roma Tre Rome, Italy [email protected] Riccardo Torlone
Information Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
LDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany [email protected],
Maximum performance, minimal risk for data warehousing
SYSTEM X SERVERS SOLUTION BRIEF Maximum performance, minimal risk for data warehousing Microsoft Data Warehouse Fast Track for SQL Server 2014 on System x3850 X6 (95TB) The rapid growth of technology has
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1
A Practical Approach to Process Streaming Data using Graph Database
A Practical Approach to Process Streaming Data using Graph Database Mukul Sharma Research Scholar Department of Computer Science & Engineering SBCET, Jaipur, Rajasthan, India ABSTRACT In today s information
TopBraid Insight for Life Sciences
TopBraid Insight for Life Sciences In the Life Sciences industries, making critical business decisions depends on having relevant information. However, queries often have to span multiple sources of information.
Rule-Based Engineering Using Declarative Graph Database Queries
Rule-Based Engineering Using Declarative Graph Database Queries Sten Grüner, Ulrich Epple Chair of Process Control Engineering, RWTH Aachen University MBEES 2014, Dagstuhl, 05.03.14 Motivation Every plant
TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS
9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence
1-Oct 2015, Bilbao, Spain. Towards Semantic Network Models via Graph Databases for SDN Applications
1-Oct 2015, Bilbao, Spain Towards Semantic Network Models via Graph Databases for SDN Applications Agenda Introduction Goals Related Work Proposal Experimental Evaluation and Results Conclusions and Future
Big Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
LDBC Social Network Benchmark @ Neo Technology
LDBC Social Network Benchmark @ Neo Technology Alex Averbuch 1 1 Neo Technology Alex Averbuch LDBC & Neo4j 1 / 19 Table of Contents 1 Introduction 2 3 4 Alex Averbuch LDBC & Neo4j 2 / 19 Introduction LDBC
Using Big Data in Healthcare
Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? David R. Holmes III, PhD Mayo Clinic College of Medicine Rochester, MN, USA Using Big Data in Healthcare
Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage
Moving Large Data at a Blinding Speed for Critical Business Intelligence A competitive advantage Intelligent Data In Real Time How do you detect and stop a Money Laundering transaction just about to take
Database Design Patterns. Winter 2006-2007 Lecture 24
Database Design Patterns Winter 2006-2007 Lecture 24 Trees and Hierarchies Many schemas need to represent trees or hierarchies of some sort Common way of representing trees: An adjacency list model Each
Graph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
How To Write A Bigbench Benchmark For A Retailer
BigBench Overview Towards a Comprehensive End-to-End Benchmark for Big Data - bankmark UG (haftungsbeschränkt) 02/04/2015 @ SPEC RG Big Data The BigBench Proposal End to end benchmark Application level
Microblogging Queries on Graph Databases: An Introspection
Microblogging Queries on Graph Databases: An Introspection ABSTRACT Oshini Goonetilleke RMIT University, Australia [email protected] Timos Sellis RMIT University, Australia [email protected]
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Scalability and Performance Report - Analyzer 2007
- Analyzer 2007 Executive Summary Strategy Companion s Analyzer 2007 is enterprise Business Intelligence (BI) software that is designed and engineered to scale to the requirements of large global deployments.
Using the HP Vertica Analytics Platform to Manage Massive Volumes of Smart Meter Data
Technical white paper Using the HP Vertica Analytics Platform to Manage Massive Volumes of Smart Meter Data The Internet of Things is expected to connect billions of sensors that continuously gather data
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
How To Find Influence Between Two Concepts In A Network
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Influence Discovery in Semantic Networks: An Initial Approach Marcello Trovati and Ovidiu Bagdasar School of Computing
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics
Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Zhao Wenbin 1, Zhao Zhengxu 2 1 School of Instrument Science and Engineering, Southeast University, Nanjing, Jiangsu
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
TopBraid Life Sciences Insight
TopBraid Life Sciences Insight In the Life Sciences industries, making critical business decisions depends on having relevant information. However, queries often have to span multiple sources of information.
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
PartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2
CloudRank-D:A Benchmark Suite for Private Cloud Systems
CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
Review of Graph Databases for Big Data Dynamic Entity Scoring
Review of Graph Databases for Big Data Dynamic Entity Scoring M. X. Labute, M. J. Dombroski May 16, 2014 Disclaimer This document was prepared as an account of work sponsored by an agency of the United
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
Big Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected] Data Availability or Data Deluge? Some decades
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design
Scaling from Workstation to Cluster for Compute-Intensive Applications
Cluster Transition Guide: Scaling from Workstation to Cluster for Compute-Intensive Applications IN THIS GUIDE: The Why: Proven Performance Gains On Cluster Vs. Workstation The What: Recommended Reference
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Actian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
Simulating a File-Sharing P2P Network
Simulating a File-Sharing P2P Network Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar Department of Computer Science Stanford University, Stanford, CA 94305, USA Abstract. Assessing the performance
What's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances
IBM Software Business Analytics Cognos Business Intelligence IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances 2 IBM Cognos 10: Enhancing query processing performance for
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING
EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING Agustín Sabater, Carlos Guerrero, Isaac Lera, Carlos Juiz Computer Science Department, University of the Balearic Islands, SPAIN [email protected], [email protected],
The Edge Editions of SAP InfiniteInsight Overview
Analytics Solutions from SAP The Edge Editions of SAP InfiniteInsight Overview Enabling Predictive Insights with Mouse Clicks, Not Computer Code Table of Contents 3 The Case for Predictive Analysis 5 Fast
Accelerating Business Intelligence with Large-Scale System Memory
Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness
Attack graph analysis using parallel algorithm
Attack graph analysis using parallel algorithm Dr. Jamali Mohammad ([email protected]) Ashraf Vahid, MA student of computer software, Shabestar Azad University ([email protected]) Ashraf Vida, MA
Daniel J. Adabi. Workshop presentation by Lukas Probst
Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted
Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints
Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints Christian Bizer 1 and Andreas Schultz 1 1 Freie Universität Berlin, Web-based Systems Group, Garystr. 21, 14195 Berlin, Germany
A Case Study in Integrated Quality Assurance for Performance Management Systems
A Case Study in Integrated Quality Assurance for Performance Management Systems Liam Peyton, Bo Zhan, Bernard Stepien School of Information Technology and Engineering, University of Ottawa, 800 King Edward
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Handling the Complexity of RDF Data: Combining List and Graph Visualization
Handling the Complexity of RDF Data: Combining List and Graph Visualization Philipp Heim and Jürgen Ziegler (University of Duisburg-Essen, Germany philipp.heim, [email protected]) Abstract: An
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Integrating Open Sources and Relational Data with SPARQL
Integrating Open Sources and Relational Data with SPARQL Orri Erling and Ivan Mikhailov OpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A, {oerling,imikhailov}@openlinksw.com,
Data Store Interface Design and Implementation
WDS'07 Proceedings of Contributed Papers, Part I, 110 115, 2007. ISBN 978-80-7378-023-4 MATFYZPRESS Web Storage Interface J. Tykal Charles University, Faculty of Mathematics and Physics, Prague, Czech
MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3
MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3 Copyright 2012 Manning Publications For more information on this and other Manning titles go to www.manning.com brief contents PART
IBM WebSphere DataStage Online training from Yes-M Systems
Yes-M Systems offers the unique opportunity to aspiring fresher s and experienced professionals to get real time experience in ETL Data warehouse tool IBM DataStage. Course Description With this training
Supply chain intelligence: benefits, techniques and future trends
MEB 2010 8 th International Conference on Management, Enterprise and Benchmarking June 4 5, 2010 Budapest, Hungary Supply chain intelligence: benefits, techniques and future trends Zoltán Bátori Óbuda
International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.
REVIEW ARTICLE ISSN: 2321-7758 UPS EFFICIENT SEARCH ENGINE BASED ON WEB-SNIPPET HIERARCHICAL CLUSTERING MS.MANISHA DESHMUKH, PROF. UMESH KULKARNI Department of Computer Engineering, ARMIET, Department
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
