Data Modeling for Big Data



Similar documents
RECOVERY OF CA ARCSERVE DATABASE IN A CLUSTER ENVIRONMENT AFTER DISASTER RECOVERY

How To Handle Big Data With A Data Scientist

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

CA Technologies optimizes business systems worldwide with enterprise data model

Radix Technologies China establishes compelling cloud services using CA AppLogic

can you simplify your infrastructure?

Contents. Introduction... 1

Big Data and Big Data Modeling

Testing Big data is one of the biggest

A to Z Information Services stands out from the competition with CA Recovery Management solutions

The Future of Workload Automation in the Application Economy

CA ERwin Data Modeling's Role in the Application Development Lifecycle

ERwin R8 Reporting Easier Than You Think Victor Rodrigues. Session Code ED05

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Next-Generation Performance Testing with Service Virtualization and Application Performance Management

CA Repository for z/os r7.2

INTRODUCTION TO CASSANDRA

How To Scale Out Of A Nosql Database

Web Admin Console - Release Management. Steve Parker Richard Lechner

Architecture in the API Era

Service Virtualization CA LISA introduction. Jim Dugger CA LISA Product Marketing Manager Steve Mazzuca CA LISA Public Sector Alliances Director

Broadcloud improves competitive advantage with efficient, flexible and scalable disaster recovery services

TECHNOLOGY BRIEF: CA ERWIN SAPHIR OPTION. CA ERwin Saphir Option

1 CA SECURITY SAAS VALIDATION PROGRAM 2015 ca.com. CA Security SaaS Validation Program. Copyright 2015 CA. All Rights Reserved.

Can the Elephants Handle the NoSQL Onslaught?

CA Virtual Assurance for Infrastructure Managers

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop and Map-Reduce. Swati Gore

The Benefits of Data Modeling in Business Intelligence

NoSQL for SQL Professionals William McKnight

Move Data from Oracle to Hadoop and Gain New Business Insights

Implement Hadoop jobs to extract business value from large and varied data sets

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Data Integration Checklist

CA Big Data Management: It s here, but what can it do for your business?

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CA Automation Suite for Data Centers

10A CA Plex in the Cloud. Rob Layzell CA Technologies

Sicredi improves data center monitoring with CA Data Center Infrastructure Management

Making Sense of Big Data in Insurance

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

SharePlex for SQL Server

Object Level Authentication

Cloud Scale Distributed Data Storage. Jürmo Mehine

Data Modeling in the Age of Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data

Data Deduplication: An Essential Component of your Data Protection Strategy

Big Systems, Big Data

CA Chorus for Security and Compliance Management Deep Dive

An Approach to Implement Map Reduce with NoSQL Databases

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

SAP SE - Legal Requirements and Requirements

Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide

How To Model Data For Business Intelligence (Bi)

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

CA Virtual Assurance for Infrastructure Managers

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

CA Nimsoft Monitor. Probe Guide for Active Directory Response. ad_response v1.6 series

A very short Intro to Hadoop

Integrating Big Data into the Computing Curricula

Data Model s Role in DaaS...3. The SQL Azure Use Case...4. Best Practices and Guidelines...5

CA Nimsoft Service Desk. Compatibility Matrix

Single Sign-on to Salesforce.com with CA Federation Manager

journey to a hybrid cloud

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Tips & Tricks: CA CMDB Data Mining Techniques. John Sorensen & Neil Mitchell

How To Use Big Data For Telco (For A Telco)

Unicenter NSM Integration for BMC Remedy. User Guide

Sallie Mae slashes change management costs and complexity with CA SCM

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Deployment Options for Microsoft Hyper-V Server

An Enterprise Architect s Guide to API Integration for ESB and SOA

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

CA Unified Infrastructure Management

NoSQL Database Options

Identity and Access Management (IAM) Across Cloud and On-premise Environments: Best Practices for Maintaining Security and Control

Nordea saves 3.5 million with enhanced application portfolio management

CA Nimsoft Monitor. Probe Guide for Performance Collector. perfmon v1.5 series

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BancoEstado achieves greater data efficiency and system development agility with CA ERwin

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Apache Hadoop: The Big Data Refinery

Picis improves the delivery of client projects worth $50 million with CA Clarity PPM

ScaleMatrix safeguards 100 terabytes of data and continuity of cloud services with CA Technologies

Atkins safeguards availability of client s geospatial systems with a CA AppLogic private cloud environment

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Transcription:

Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes and petabytes. As the volume of data keeps growing, the types of data generated by applications become richer than before. As a result, traditional relational databases are challenged to capture, store, search, share, analyze, and visualize data. Many IT companies attempt to manage big data challenges using a NoSQL ( not only SQL ) database, such as Cassandra or HBase, and may employ a distributed computing system such as Hadoop. NoSQL databases are typically key-value stores that are non-relational, distributed, horizontally scalable, and schema-free. So, do we still need data modeling today? Traditional data modeling focuses on resolving the complexity of relationships among schema-enabled data. However, these considerations do not apply to non-relational, schema-less databases. As a result, old ways of data modeling no longer apply. We need a new methodology to manage big data for maximum business value. Big Data Model The big data model is an abstract layer used to manage the data stored in physical devices. Today we have large volumes of data with different formats stored in global devices. The big data model provides a visual way to manage data resources, and creates fundamental data architecture so that we can have more applications to optimize data reuse and reduce computing costs. The following diagram illustrates the general architecture of big data: About the authors Jinbao Zhu is a Principle Software Engineer at CA Technologies in Beijing, where he is responsible for designing solutions and developing key features for the data modeling product line. He has more than 11 years experience in developing and deploying enterprise-level software solutions focused on data processing and applications. Before joining CA Technologies, Jinbao worked in such areas as WebGIS architecture and solution, MS Exchange API, natural language processing, and parallel message analysis. Jinbao is a volunteer on a big data group researching Hadoop clusters and subprojects, and is developing several POC prototypes. Figure 1

In the diagram above there are three model layers. The physical data layer is the data we have in a big data system. It can have different data types such as video, audio, logs, business tables, and so on. The data modeling layer is the abstract data model we build to manage physical data. The computing modeling layer is the application layer that we build to retrieve information for business value. With these three models, we build data models to separate physical data and data use. This means, the application is able to access data through the data model instead of accessing the physical data. This makes applications flexible and data manageable. To construct a big data model, we must first create data blocks based on data storage, data type, relationship, read-write requirement, and so on. Further, we must have modeling applications maintain these models, so that data models are able to display and store the latest data. Hybrid Data Modeling Though NoSQL databases evolved to resolve specific issues of managing big data, there are good reasons for SQL s enduring popularity, and current trends are to provide both SQL and NoSQL features in the same database. Some traditional RDBMS vendors, such as Microsoft and Oracle, provide NoSQL features in recent releases, such as the columnar storage support in Microsoft SQL Server 2012 and Teradata Warehouse 14. From the NoSQL side, projects like the Hive subproject of Hadoop and the CQL query language of Cassandra, have grown quickly to provide an SQL-like façade to a NoSQL database. Google, the leader in big data, is also developing a hybrid distributed database, Megastore, that can run SQL-specific features with a NoSQL storage specification. That s not surprising - both the quantity and variety of data are growing rapidly and we need many models and tools to process it. In the past we tried to cram data of all types into an RDBMS; now we must keep some of it out. As Big Data drives changes in the collection and processing of data, we need to reconsider existing data and storage models to allow for the increasing importance of implementation issues such as the performance decrease imposed by join operations and pressures on storage space as data grows to exceed the capacity of hardware storage. There are new options to resolve these issues--we can elect to migrate big data into a NoSQL database, possibly running a hybrid system as a tradeoff, taking advantage of the low risks and costs of these promising new options. Allen Wang is a development manager on CA ERwin at CA Technologies in Bejing. He has worked for CA Technologies for over 6 years, and has experience with Information Governance and Data Modeling. Allen leads the CA ERwin team in the China Technology Center. Allen is interested in Agile methodology, data modeling and big data. In the past we tried to cram data of all types into an RDBMS; now we must keep some of it out. Remodeling Tools and Techniques In the age of big data, popular data modeling tools such as CA ERwin Data Modeler continue to help us analyze and understand our data architectures by applying hybrid data modeling concepts. Instead of creating pure a relational data model, we now can embed NoSQL submodels within a relational data model. In general, data size and performance bottlenecks are the factors that help us decide which data goes to the NoSQL system. Furthermore, we can redesign models. For example, we can denormalize to reduce the dependency on relationships, and aggregate attributes from different entities into an entity in ER diagrams. Since we need to present the hierarchy of original data in the documentation, we need a good data representation format to assist in understanding original data relationship. The attribute-value orientation of JSON is well-suited to represent the key-value structure of many NoSQL stores, though XML still has its place. With this transformation in representation, modeling tools like CA ERwin Data Modeler can analyze schema The attribute-value orientation of JSON is well-suited to represent the key-value structure of many NoSQL stores, though XML still has its place.

changes and generate a code template to build a bridge between the current RDBMS and a target NoSQL system. This assists with data migration from an RDBMS to a NoSQL database, and gives the whole picture for data deployed in various data systems. The following diagram illustrates data migration design: Figure 2 Data Migration NoSQL Data: Data with the following characteristics is well-suited for a NoSQL system: 1 Data volume growing rapidly 2 Columnar growth of data 3 Document and tuple data 4 Hierarchical and graph data RDBMS Data: Data with the following characteristics might be better suited for a traditional RDBMS: 5 OLTP required (On-Line Transaction Processing) 6 ACID requirements (atomicity, consistency, isolation, durability) 7 Complex data relationship 8 Complex query requirements Data Consumption It often takes more effort to migrate data consumers to new storage in new system than to migrate the data itself. Once we have used a hybrid model to migrate data, applications designed to access the data using relational methodology must be redesigned to adapt to the new data architecture. Hopefully, future hybrid databases will provide built-in migration tools to help leverage the potential of NoSQL. In the present, remodeling data is critical before an existing system is migrated to NoSQL. In addition, modeling tools must to be evaluated to verify they meet the new requirements. Physical Data-aware Modeling For the past 30 years, structure first, collect later has been the data modeling approach. With this approach, we determine the data structure before any data goes into the data system. In other words, the data structure definition is It often takes more effort to migrate data consumers to new storage in new system than to migrate the data itself.

determined by how the data is used. With big data, this traditional approach is no longer applicable, as data doesn t necessarily have fixed schemas or formats. As a result, we need a new modeling methodology to suit big data characteristics and deployment. I propose a new approach for managing the variety of big data. Dynamic Schema Definition Even with big unstructured or semi-structured data, we still need to define schemas because data relationships can be more complex than before. Hiding data relationship logic in a program is not a good way to manage data complexity. Because big data uses the structure later approach, in most of the cases, we can only know the data schema after the data has been created. The proposed methodology is to define schema on existing data after it has been collected and stored, and use the schema to retrieve data at runtime while processing. In the definition, the schema function should be as atomic and isolated as possible. To achieve this, we must find out the scope of the data the schema function applies to, and the versions of data that the schema function can work for. This is called dynamic schema to distinguish it from traditional fixed schema defined before data is collected. The following is an example of a schema function showing how to fetch BirthDate and PostCode from an identity number. Even with big unstructured or semistructured data, we still need to define schemas because data relationships can be more complex than before. Figure 3 Data Block A data block is the physical data defined with metadata. We must know the actual location of data, so we can model it on a modeling layer, and so the computing program knows where to retrieve it. Include the following fields in the metadata definition: Region servers, the physical location of data storage Data access path that the program can use to access data Data format of the saved data Other attributes added as needed when the data is used As a general rule to manage similar types of data, place data of identical format

and schema functions in the same data block. The data in a data block should be stored in close physical location. For example, data that might be used at the same time should be stored in the same rack, or at least in the same region. Data Version In a real-time production scenario, data may have different formats at different times in its lifecycle. Data models should manage these versions of data, so that schema functions can ensure compatible data update. Schema functions should be version-aware, so historic data can be retrieved with the appropriate legacy schema function. Data Relationship The data relationship among multiple data blocks still exists in practice, although most of the data in the big data system are designed as relationshipless (that is, without joins) to facilitate scalability. However, denormalization may lead to data duplication in storage, and to problems in a distributed system if data has to be updated frequently. We need to balance between duplicating data for scalability and query performance and practical requirements for update performance. Relationships are defined on schema functions in data blocks. The relationships are just data links. Optional proxy applications can exist to maintain consistency for data changes, such as deleting data blocks or changing data format. The important benefit is computing tasks can fetch data using the data relationship definition in a bunch of data blocks. This also can be proactively designed in a modeling layer Data Computing Modeling On the top of the physical data model, we normally need to create data flow and compute tasks for business requirements. With physical-data modeling, it s possible to create a computing model, which can present the logic path of computing data. This will help computing tasks be well designed and enable more efficient data reuse. Hadoop provides a new distributed data processing model, and its HBase database provides an impressive solution for data replication, backup, scalability, and so on. Hadoop also provides the Map/Reduce computing framework to retrieve value from data stored in a distributed system. Map/Reduce is a framework for parallel processing using mappers dividing a problem into smaller sub-problems to feed reducers that process the subproblems and produce the final answer. As a business grows, computing requirements becomes more and more complex. For example, some normal computing tasks might need a couple of Map/Reduce subtasks to meet specific business requirements. We need a better design on the dataflow of complex computing. Reuse of a existing Mapper/Reducer is also a potential requirement for effective production environments. Computing Unit A Mapper and Reducer pair is a minimum unit to compute the final output based on the input. A good design is one that is simple, because it can aid debugging and provide a more stable environment. The key point is how to Map/Reduce is a framework for parallel processing using mappers dividing a problem into smaller subproblems to feed reducers that process the sub-problems and produce the final answer.

make the Mapper and Reducer pair more independent of physical data, so that we can better reuse these computing units to accelerate development and reduce cost. Computing Flow As the following diagram shows, rather than writing bigger Mappers and Reducer, we can combine existing Mappers and Reducers in a workflow. The output of the first Map-Reduce is the input of the second Map-Reduce. The whole work flow can be visualized and managed based on the data model introduced above. Data computing modeling helps manage the computing algorithm on the data stored in raw format. When new tools are developed, the data model will assist in taking the computing task forward, and help monitor the progress of data processing. Meanwhile, we can embed the data model into a big data system like Hadoop. Once data is moved or imported, Hadoop can synchronize with the data model first. Integration with Hive Hive creates separate data systems for better querying and computing. The tables in Hive have well-defined schemas. We can reverse engineer to import schemas to specific data blocks and integrate Hive data and external RAW data in Hadoop s HBase database in the same data model. This provides a useful basic architecture if Hive has to extract data from RAW data, and run computing tasks on both data sources during data mining. Figure 4 Conclusion The complexity of a relational database limits the scalability of data storage, but makes it very easy to query data through an SQL engine. Big data NoSQL systems have the opposite characteristics: unlimited scalability with more limited query capabilities. The challenge of big data is querying data easily. Creating data models on physical data and computing path help manage raw data. he future will bring more hybrid systems combining the attributes of both approaches. Meanwhile, the dynamic schema model proposed in this paper and systems such as Hive offer help in the challenge of managing big data. References NoSQL Data Modeling Techniques http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ Opetarational Data, Hadoop and New Modeling http://erwin.com/expert_blogs/detail/opetarational_data_hadoop_and_new_modeling/ Big data http://en.wikipedia.org/wiki/big_data NOTICES Copyright 2012 CA. All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. The information in this publication could include typographical errors or technical inaccuracies, and CA, Inc. ( CA ) and the authors assume no responsibility for its accuracy or completeness. The statements and opinions expressed in this publication are those of the authors and are not necessarily those of CA. Certain information in this publication may outline CA s general product direction. However, CA may make modifications to any CA product, software program, service, method or procedure described in this publication at any time without notice, and the development, release and timing of any features or functionality described in this publication remain at CA s sole discretion. CA will support only the referenced products in accordance with (i) the documentation and specifications provided with the referenced product, and (ii) CA s then-current maintenance and support policy for the referenced product. Notwithstanding anything in this publication to the contrary, this publication shall not: (i) constitute product documentation or specifications under any existing or future written license agreement or services agreement relating to any CA software product, or be subject to any warranty set forth in any such written agreement; (ii) serve to affect the rights and/or obligations of CA or its licensees under any existing or future written license agreement or services agreement relating to any CA software product; or (iii) serve to amend any product documentation or specifications for any CA software product.

Any reference in this publication to third-party products and websites is provided for convenience only and shall not serve as the authors or CA s endorsement of such products or websites. Your use of such products, websites, any information regarding such products or any materials provided with such products or on such websites shall be at your own risk. To the extent permitted by applicable law, the content of this publication is provided AS IS without warranty of any kind, including, without limitation, any implied warranties of merchantability, fitness for a particular purpose, or non-infringement. In no event will the authors or CA be liable for any loss or damage, direct or indirect, arising from or related to the use of this publication, including, without limitation, lost profits, lost investment, business interruption, goodwill or lost data, even if expressly advised in advance of the possibility of such damages. Neither the content of this publication nor any software product or service referenced herein serves as a substitute for your compliance with any laws (including but not limited to any act, statute, regulation, rule, directive, standard, policy, administrative order, executive order, and so on (collectively, Laws ) referenced herein or otherwise or any contract obligations with any third parties. You should consult with competent legal counsel regarding any such Laws or contract obligations.