Solution Spotlight PREPARING A DATABASE STRATEGY FOR BIG DATA



Similar documents
Hybrid cloud computing explained

Data warehouse software bundles: tips and tricks

E-Guide CLOUD COMPUTING FACTS MAY UNCLENCH SERVER HUGGERS HOLD

E-Guide THE CHALLENGES BEHIND DATA INTEGRATION IN A BIG DATA WORLD

Big Data and the Data Warehouse

E-Guide GROWING CYBER THREATS CHALLENGING COST REDUCTION AS REASON TO USE MANAGED SERVICES

E-Guide to Mobile Application Development

BUYING PROCESS FOR ALL-FLASH SOLID-STATE STORAGE ARRAYS

Skills shortage, training present pitfalls for big data analytics

Advanced analytics key component for decision management systems

Aligning Public Cloud Strategies to Improve Server Efficiency

2013 Cloud Storage Expectations

E-Guide HADOOP MYTHS BUSTED

E-Guide BRINGING BIG DATA INTO A DATA WAREHOUSE ENVIRONMENT

Best Practices for Scaling a Big Data Analytics Project

3 common cloud challenges eradicated with hybrid cloud

E-Guide NETWORKING MONITORING BEST PRACTICES: SETTING A NETWORK PERFORMANCE BASELINE

How to Develop Cloud Applications Based on Web App Security Lessons

Social media driving CRM strategies

HOW TO SELECT THE BEST SOLID- STATE STORAGE ARRAY FOR YOUR ENVIRONMENT

Managing Virtual Desktop Environments

E-Guide HOW THE VMWARE SOFTWARE DEFINED DATA CENTER WORKS: AN IAAS EXAMPLE

Tips to ensuring the success of big data analytics initiatives

Big Data Technologies Compared June 2014

Securing the SIEM system: Control access, prioritize availability

Streamlining the move to the cloud. Key tips for selecting the right cloud tools and preparing your infrastructure for migration

WHAT S INSIDE NEW HYPER- CONVERGED SYSTEMS

Solution Spotlight BEST PRACTICES FOR DEVELOPING MOBILE CLOUD APPS REVEALED

Software Defined Networking Goes Well Beyond the Data Center

Social channels changing contact center certification

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

E-Guide CRM: THE INTEGRATION AND CONSOLIDATION PAYOFF

Expert guide to achieving data center efficiency How to build an optimal data center cooling system

E-Guide THE LATEST IN SAN AND NAS STORAGE TRENDS

E-Guide CONSIDERATIONS FOR EFFECTIVE SOFTWARE LICENSE MANAGEMENT

Essentials Guide CONSIDERATIONS FOR SELECTING ALL-FLASH STORAGE ARRAYS

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

The State of Desktop Virtualization in 2013: Brian Madden analyzes uses cases, preferred vendors and effective tools

The skinny on storage clusters

Benefits of virtualizing your network

Bringing Big Data into the Enterprise

E-Guide SHAREPOINT UPGRADE BEST PRACTICES

Big Data and Its Impact on the Data Warehousing Architecture

The Do s and Don ts of Server Virtualization Back to basics tips for Australian IT professionals

A Guide to MAM and Planning for BYOD Security in the Enterprise

The Enterprise Data Hub and The Modern Information Architecture

TIPS TO HELP EVALUATE AND DEPLOY FLASH STORAGE

The changing face of scale-out networkattached

NoSQL for SQL Professionals William McKnight

Hyper-V 3.0: Creating new virtual data center design options Top four methods for deployment

E-Guide VIDEO CONFERENCING SOFTWARE AND HARDWARE: HYBRID APPROACH NEEDED

The Future of Data Management

Advantages on Green Cloud Computing

E-Guide MANAGING AND MONITORING HYBRID CLOUD RESOURCE POOLS: 3 STEPS TO ENSURE OPTIMUM APPLICATION PERFORMANCE

Apache Hadoop Patterns of Use

Navigating the Big Data infrastructure layer Helena Schwenk

FIVE PERVASIVE FLASH-BASED STORAGE MYTHS

How To Handle Big Data With A Data Scientist

MOBILE APP DEVELOPMENT LEAPS FORWARD

Rethink defense-in-depth security model

Big Data and Data Science: Behind the Buzz Words

Data Refinery with Big Data Aspects

BEST PRACTICES FOR MANAGING THE EVOLUTION OF EHRS

Unlocking data with document capture and imaging

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Key Trends in the Identity and Access Management Market and How CA IAM R12 Suite Addresses These Trends

E-Guide NAVIGATING THE TWISTS AND TURNS OF THE HADOOP LANDSCAPE

In-Database Analytics

ios7: 3 rd party or platform-enabled MAM? Taking a look behind the scenes with Jack Madden

E-Guide WHAT IT MANAGERS NEED TO KNOW ABOUT RISKY FILE-SHARING

Discovering Business Insights in Big Data Using SQL-MapReduce

E-Guide CONSIDER SECURITY IN YOUR DAILY BUSINESS OPERATIONS

Evaluating SaaS vs. on premise for ERP systems

E-Guide BEST PRACTICES FOR CLOUD BASED DISASTER RECOVERY

Server OS Buyer s Guide Vendor-neutral tips for choosing the best server operating system for your organization

Making the move from a tactical to a strategic supply chain

Big Data Defined Introducing DataStack 3.0

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

E-Guide UNDERSTANDING PCI MOBILE PAYMENT PROCESSING SECURITY GUIDELINES

CLOUD APPLICATION INTEGRATION AND DEPLOYMENT MADE SIMPLE

Understanding the Value of In-Memory in the IT Landscape

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

GUIDELINES FOR EVALUATING PROCUREMENT SOFTWARE

Whitepaper. Innovations in Business Intelligence Database Technology.

Navigating Big Data business analytics

High Performance Data Management Use of Standards in Commercial Product Development

Getting Started Practical Input For Your Roadmap

E-Guide SIX ENTERPRISE CLOUD STORAGE AND FILE-SHARING SERVICES TO CONSIDER

Cloud Security Certification Guide What certification is right for you?

Big Data Are You Ready? Thomas Kyte

6 Point SIEM Solution Evaluation Checklist

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Does consolidating multiple ERP systems make sense?

Next-Generation Cloud Analytics with Amazon Redshift

In-Memory Analytics for Big Data

Introducing Oracle Exalytics In-Memory Machine

IBM Netezza High Capacity Appliance


Transcription:

Solution Spotlight PREPARING A DATABASE STRATEGY FOR BIG DATA

C ompanies need to explore the newer approaches to handling large data volumes and begin to understand the limitations and that come with technologies like Hadoop and NoSQL databases, if they are to avoid being swept away by the big data tidal wave. Read this SearchDataManagment.com E-Guide to learn why Big Data presents new database and opportunities. Discover why, according to experts, massively parallel processing (MPP) and distributed computing approaches are becoming a popular choice to handle growing data volumes. PAGE 2 OF 14

'BIG DATA' APPLICATIONS BRING NEW DATABASE CHOICES, CHALLENGES I started out my career as a systems programmer and database administrator, working on what was then the state of the art in the world of databases: IMS, from IBM. Companies needed somewhere to put (and sometimes even retrieve) their data, and once things had moved beyond basic file systems, databases were the way to go. The volumes of data that had to be handled back then seem amusingly modest by today s big data applications standards, with IBM s 3380 mainframe able to store what seemed like a capacious 2.5 GB of data when it was launched in 1980. To put data into IMS, you needed to understand how to navigate the physical structure of the database itself, and it was a radical step indeed when IBM launched DB2 in 1983. In this new approach, programmers would write in a language called SQL and the database management system itself would figure out the best access path to the data via an optimiser. I recall some of my colleagues deep scepticism about such dark arts, but the relational approach caught on, and within a few years the database world PAGE 3 OF 14

was a competitive place, with Ingres, Informix, Sybase and Oracle all battling it out along with IBM in the enterprise transaction processing market. A gradual awareness that such databases were all optimised for transaction processing performance allowed a further range of innovation, and the first specialist analytical databases appeared. Products from smaller companies, with odd names like Red Brick and Essbase, became available and briefly a thousand database flowers bloomed. By the dawn of the millennium, the excitement had died down, and the database market had seen dramatic consolidation. Oracle, IBM and Microsoft dominated the landscape, having either bought out or crushed most of the competition. Object databases were snuffed out, and the database administrator beginning his or her career in 2001 could look forward to a stable world of a few databases, all based on SQL. Teradata had carved out a niche at the highvolume end, and Sybase had innovated with columnar instead of row-based storage, but the main debates in database circles were reduced to arguments over arcane revisions to the SQL standard. The database world had seemingly grown up and put on a cardigan and slippers. How misleading that picture turned out to be. Few appreciated at the time that rapid growth in both the volume and types of data that companies collect PAGE 4 OF 14

was about to challenge the database incumbents and spawn another round of innovation. Whilst Moore s Law was holding for processing speed, it was most decidedly not working for disk access speed. Solid-state drives helped some, but they were, and still are, very expensive. Database volumes were increasing faster than ever due primarily to the explosion of social media data and machine-generated data, such as information from sensors, point-of-sale systems, mobile phone network, Web server logs and the like. In 1997 (according to Winter Corp., which measures such things), the largest commercial database in the world was 7 TB in size, and that figure had only grown to about 30 TB by 2003. Yet it more than tripled to 100 TB by 2005, and by 2008 the first petabyte-sized database appeared. In other words, the largest databases increased tenfold in size between 2005 and 2008. The strains of analysing such volumes of data started to stretch and exceed the capacity of the mainstream databases. ENTER MPP, COLUMNAR AND HADOOP The database industry has responded in a number of ways. Throwing hardware at the problem was one way. Massively parallel processing (MPP) databases allow database loads to be split amongst many processors. The columnar data PAGE 5 OF 14

structure pioneered by Sybase turned out to be well suited to analytical processing workloads, and a range of new analytical databases sprang up, often combining columnar and MPP approaches. The giant database vendors responded with either their own versions or by simply purchasing upstart rivals. For example, Oracle brought out its Exadata offering, IBM purchased Netezza and Microsoft bought DATAllegro. There are also a range of independent alternatives remaining on the market. However, the big data challenge is of such a scale that more radical approaches have also appeared. Google, having to deal with exponentially growing Web traffic, devised an approach called MapReduce, designed to work with a massively distributed file system. That work inspired an open source technology called Hadoop, along with an associated file system called HDFS. New databases followed that spurned SQL entirely or in large part, endeavouring to allow more predictable scalability and eliminating the constraints of a fixed database schema. This NoSQL approach brings with it a range of issues. A generation of programmers and software products have relied on a common standard for database access, with the removal of the need to understand internal database structure allowing considerable productivity gains. Programming for big PAGE 6 OF 14

data applications is an altogether trickier affair, and IT departments that are staffed with people who understand SQL are ill-equipped to tackle the world of MapReduce programming, parallel programming and key-value databases that is starting to represent the state of the art in tackling very large data sets. There are also considerable to the new database technologies in coping with high availability, guaranteed consistency and tolerance to hardware failure, things which many organisations had previously started to take for granted. Of course, not everyone is equally affected by such developments. The big data issues are most acutely felt in certain industries, such as Web marketing and advertising, telecoms, retail and financial services, and certain government activities. Understanding the relationships between data is important in areas as diverse as fraud detection, counter-terrorism, medical research and energy metering. However, the recent data explosion is going to make life difficult in many industries, and those companies that can adapt well and gain the ability to analyse such data will have a considerable advantage over those that lag. New skill sets are going to be needed, and these skills will be scarce. Companies need to explore the newer approaches to handling large data volumes and begin to PAGE 7 OF 14

understand the limitations and that come with technologies like Hadoop and NoSQL databases, if they are to avoid being swept away by the big data tidal wave. ANDY HAYLER is co-founder and CEO of analyst firm The Information Difference and a regular keynote speaker at international conferences on MDM, data governance and data quality. He is also a respected restaurant critic and author (see www.andyhayler.com). IT INDUSTRY VETERAN DEMYSTIFIES THE SCALE UP VS. SCALE OUT DEBATE Choosing sides on the scale used to be easy. Choices were few and for most organizations, it was symmetric multiprocessing (SMP) all the way the classic scale up approach to computing. But with the rise of commodity hardware and more organizations looking to capitalize on the Internet-driven "big data" explosion scaling out has become a more viable option. As a result, massively parallel processing (MPP) and distributed computing approaches are growing more popular all the time, PAGE 8 OF 14

according to Tony Iams, a senior vice president with IT analyst firm Ideas International in Port Chester, N.Y. SearchDataManagement.com got on the phone with Iams to learn more about the longstanding scale. Iams explained the most common uses of SMP, MPP and distributed computing and had some advice for those seeking to match database workloads with the correct architecture approach. Could you describe the prevailing approaches to hardware architecture and how they fit into the scale? Tony Iams: The first is symmetric multiprocessing, or SMP, which is the classic form of scaling up. That's where you have lots of processing units inside of a single computer. And I say "processing units" because that line is blurring. It used to be "processors," but now processors have multiple cores inside of them and those cores might have a lot of threads. But the point is that all of it is inside of one enclosure, one computer system. That is the traditional approach to scaling up. What are the other two major approaches? Iams: Then you have scaling out and the most extreme form of that is what you might call distributed computing. That is where you have lots and lots of PAGE 9 OF 14

computers that are cooperating to solve a problem. The third approach is massively parallel processing (MPP) and that kind of sits in between [SMP and distributing computing] in the sense that you still have lots of machines that are collaborating on solving a problem. The difference is that with massively parallel processing, there is usually some assumption that there is some sharing of something. At a minimum, you have shared management in that there may be many separate computers but you manage them as if they are a single computer. With MPP, there is also usually some form of sharing memory. How does MPP differ from SMP in terms of sharing memory? Iams: With SMP, by definition, all reading and writing can be done by any thread, core or processor. They can all get to the memory equally easily for reading or for writing purposes. In MPP, again there is usually some form of sharing, but depending on the implementation and there are many different implementations of MPP you have different compromises that you have to make in terms of who can get to what memory; whether there is a penalty for reading the memory; and whether there is a penalty for writing the memory. With MPP, users have to consider how that memory sharing works: Who can read? Who can write? The answers to those questions are going to vary significantly [depending on] the implementation. PAGE 10 OF 14

Why do I feel like I'm hearing vendor references to MPP more often than in the past? Iams: I think because the industry in general is trending towards scaling out. What I just explained to you is purely an architectural view. But the more important aspect of this is now matching [the computer hardware architecture] with workloads. Depending on what kind of workload you're trying to host, each of these approaches is going to make more sense or less sense. You always have to be careful about matching the right workload with the right scalability approach. How has the process of choosing the right scalability approach changed over the years? Iams: The rules 10 to 15 or 20 years ago were pretty clear. If you wanted to do heavy duty database processing, you wanted it to scale up and you would use SMP. That was it. End of story. The number of workloads that you would want to use with the distributed computing or even MPP was extremely limited. But nowadays with the Internet and Web-based computing and all of these services that people are using on the Web like Facebook and Google a lot of those work really well with a scale out approach, and distributed computing and MPP are starting to be applied more widely. PAGE 11 OF 14

How should an IT organization go about matching database workloads with the right scalability approach? Iams: If you're just talking about the classic transaction-driven database workloads that drive the day to day operations of a typical business, that still works best on SMP type systems. That is because with transactions, you're going to be writing a lot of data by definition. If you're going to write something, you have to have very efficient access to that memory, and that is why you need SMP. There is no more efficient way to access memory than with SMP. What if the database workload is created for business intelligence or analytics purposes? Iams: More and more database work is based on analysis, which is not necessarily writing data. In this case, you're more interested in reading the data because you're going to go through there and analyze it looking for patterns, business opportunities and trends. That is increasingly where the value is and that is not a new thing. People have been doing data warehouses and stuff like that for a long time. But now there has been an uptick in the volume of data. Internet usage is generating data, mobile phone usage is generating data and all that stuff is tracked now. That has produced an explosion in data. [Data warehouses and associated analytics projects] have to scale more than ever before, PAGE 12 OF 14

and MPP is the right answer for that. PAGE 13 OF 14

FREE RESOURCES FOR TECHNOLOGY PROFESSIONALS TechTarget publishes targeted technology media that address your need for information and resources for researching products, developing strategy and making cost-effective purchase decisions. Our network of technology-specific Web sites gives you access to industry experts, independent content and analysis and the Web s largest library of vendor-provided white papers, webcasts, podcasts, videos, virtual trade shows, research reports and more drawing on the rich R&D resources of technology providers to address market trends, and solutions. Our live events and virtual seminars give you access to vendor neutral, expert commentary and advice on the issues and you face daily. Our social community IT Knowledge Exchange allows you to share real world information in real time with peers and experts. WHAT MAKES TECHTARGET UNIQUE? TechTarget is squarely focused on the enterprise IT space. Our team of editors and network of industry experts provide the richest, most relevant content to IT professionals and management. We leverage the immediacy of the Web, the networking and face-to-face opportunities of events and virtual events, and the ability to interact with peers all to create compelling and actionable information for enterprise IT professionals across all industries and markets. PAGE 14 OF 14