Solution Spotlight PREPARING A DATABASE STRATEGY FOR BIG DATA
C ompanies need to explore the newer approaches to handling large data volumes and begin to understand the limitations and that come with technologies like Hadoop and NoSQL databases, if they are to avoid being swept away by the big data tidal wave. Read this SearchDataManagment.com E-Guide to learn why Big Data presents new database and opportunities. Discover why, according to experts, massively parallel processing (MPP) and distributed computing approaches are becoming a popular choice to handle growing data volumes. PAGE 2 OF 14
'BIG DATA' APPLICATIONS BRING NEW DATABASE CHOICES, CHALLENGES I started out my career as a systems programmer and database administrator, working on what was then the state of the art in the world of databases: IMS, from IBM. Companies needed somewhere to put (and sometimes even retrieve) their data, and once things had moved beyond basic file systems, databases were the way to go. The volumes of data that had to be handled back then seem amusingly modest by today s big data applications standards, with IBM s 3380 mainframe able to store what seemed like a capacious 2.5 GB of data when it was launched in 1980. To put data into IMS, you needed to understand how to navigate the physical structure of the database itself, and it was a radical step indeed when IBM launched DB2 in 1983. In this new approach, programmers would write in a language called SQL and the database management system itself would figure out the best access path to the data via an optimiser. I recall some of my colleagues deep scepticism about such dark arts, but the relational approach caught on, and within a few years the database world PAGE 3 OF 14
was a competitive place, with Ingres, Informix, Sybase and Oracle all battling it out along with IBM in the enterprise transaction processing market. A gradual awareness that such databases were all optimised for transaction processing performance allowed a further range of innovation, and the first specialist analytical databases appeared. Products from smaller companies, with odd names like Red Brick and Essbase, became available and briefly a thousand database flowers bloomed. By the dawn of the millennium, the excitement had died down, and the database market had seen dramatic consolidation. Oracle, IBM and Microsoft dominated the landscape, having either bought out or crushed most of the competition. Object databases were snuffed out, and the database administrator beginning his or her career in 2001 could look forward to a stable world of a few databases, all based on SQL. Teradata had carved out a niche at the highvolume end, and Sybase had innovated with columnar instead of row-based storage, but the main debates in database circles were reduced to arguments over arcane revisions to the SQL standard. The database world had seemingly grown up and put on a cardigan and slippers. How misleading that picture turned out to be. Few appreciated at the time that rapid growth in both the volume and types of data that companies collect PAGE 4 OF 14
was about to challenge the database incumbents and spawn another round of innovation. Whilst Moore s Law was holding for processing speed, it was most decidedly not working for disk access speed. Solid-state drives helped some, but they were, and still are, very expensive. Database volumes were increasing faster than ever due primarily to the explosion of social media data and machine-generated data, such as information from sensors, point-of-sale systems, mobile phone network, Web server logs and the like. In 1997 (according to Winter Corp., which measures such things), the largest commercial database in the world was 7 TB in size, and that figure had only grown to about 30 TB by 2003. Yet it more than tripled to 100 TB by 2005, and by 2008 the first petabyte-sized database appeared. In other words, the largest databases increased tenfold in size between 2005 and 2008. The strains of analysing such volumes of data started to stretch and exceed the capacity of the mainstream databases. ENTER MPP, COLUMNAR AND HADOOP The database industry has responded in a number of ways. Throwing hardware at the problem was one way. Massively parallel processing (MPP) databases allow database loads to be split amongst many processors. The columnar data PAGE 5 OF 14
structure pioneered by Sybase turned out to be well suited to analytical processing workloads, and a range of new analytical databases sprang up, often combining columnar and MPP approaches. The giant database vendors responded with either their own versions or by simply purchasing upstart rivals. For example, Oracle brought out its Exadata offering, IBM purchased Netezza and Microsoft bought DATAllegro. There are also a range of independent alternatives remaining on the market. However, the big data challenge is of such a scale that more radical approaches have also appeared. Google, having to deal with exponentially growing Web traffic, devised an approach called MapReduce, designed to work with a massively distributed file system. That work inspired an open source technology called Hadoop, along with an associated file system called HDFS. New databases followed that spurned SQL entirely or in large part, endeavouring to allow more predictable scalability and eliminating the constraints of a fixed database schema. This NoSQL approach brings with it a range of issues. A generation of programmers and software products have relied on a common standard for database access, with the removal of the need to understand internal database structure allowing considerable productivity gains. Programming for big PAGE 6 OF 14
data applications is an altogether trickier affair, and IT departments that are staffed with people who understand SQL are ill-equipped to tackle the world of MapReduce programming, parallel programming and key-value databases that is starting to represent the state of the art in tackling very large data sets. There are also considerable to the new database technologies in coping with high availability, guaranteed consistency and tolerance to hardware failure, things which many organisations had previously started to take for granted. Of course, not everyone is equally affected by such developments. The big data issues are most acutely felt in certain industries, such as Web marketing and advertising, telecoms, retail and financial services, and certain government activities. Understanding the relationships between data is important in areas as diverse as fraud detection, counter-terrorism, medical research and energy metering. However, the recent data explosion is going to make life difficult in many industries, and those companies that can adapt well and gain the ability to analyse such data will have a considerable advantage over those that lag. New skill sets are going to be needed, and these skills will be scarce. Companies need to explore the newer approaches to handling large data volumes and begin to PAGE 7 OF 14
understand the limitations and that come with technologies like Hadoop and NoSQL databases, if they are to avoid being swept away by the big data tidal wave. ANDY HAYLER is co-founder and CEO of analyst firm The Information Difference and a regular keynote speaker at international conferences on MDM, data governance and data quality. He is also a respected restaurant critic and author (see www.andyhayler.com). IT INDUSTRY VETERAN DEMYSTIFIES THE SCALE UP VS. SCALE OUT DEBATE Choosing sides on the scale used to be easy. Choices were few and for most organizations, it was symmetric multiprocessing (SMP) all the way the classic scale up approach to computing. But with the rise of commodity hardware and more organizations looking to capitalize on the Internet-driven "big data" explosion scaling out has become a more viable option. As a result, massively parallel processing (MPP) and distributed computing approaches are growing more popular all the time, PAGE 8 OF 14
according to Tony Iams, a senior vice president with IT analyst firm Ideas International in Port Chester, N.Y. SearchDataManagement.com got on the phone with Iams to learn more about the longstanding scale. Iams explained the most common uses of SMP, MPP and distributed computing and had some advice for those seeking to match database workloads with the correct architecture approach. Could you describe the prevailing approaches to hardware architecture and how they fit into the scale? Tony Iams: The first is symmetric multiprocessing, or SMP, which is the classic form of scaling up. That's where you have lots of processing units inside of a single computer. And I say "processing units" because that line is blurring. It used to be "processors," but now processors have multiple cores inside of them and those cores might have a lot of threads. But the point is that all of it is inside of one enclosure, one computer system. That is the traditional approach to scaling up. What are the other two major approaches? Iams: Then you have scaling out and the most extreme form of that is what you might call distributed computing. That is where you have lots and lots of PAGE 9 OF 14
computers that are cooperating to solve a problem. The third approach is massively parallel processing (MPP) and that kind of sits in between [SMP and distributing computing] in the sense that you still have lots of machines that are collaborating on solving a problem. The difference is that with massively parallel processing, there is usually some assumption that there is some sharing of something. At a minimum, you have shared management in that there may be many separate computers but you manage them as if they are a single computer. With MPP, there is also usually some form of sharing memory. How does MPP differ from SMP in terms of sharing memory? Iams: With SMP, by definition, all reading and writing can be done by any thread, core or processor. They can all get to the memory equally easily for reading or for writing purposes. In MPP, again there is usually some form of sharing, but depending on the implementation and there are many different implementations of MPP you have different compromises that you have to make in terms of who can get to what memory; whether there is a penalty for reading the memory; and whether there is a penalty for writing the memory. With MPP, users have to consider how that memory sharing works: Who can read? Who can write? The answers to those questions are going to vary significantly [depending on] the implementation. PAGE 10 OF 14
Why do I feel like I'm hearing vendor references to MPP more often than in the past? Iams: I think because the industry in general is trending towards scaling out. What I just explained to you is purely an architectural view. But the more important aspect of this is now matching [the computer hardware architecture] with workloads. Depending on what kind of workload you're trying to host, each of these approaches is going to make more sense or less sense. You always have to be careful about matching the right workload with the right scalability approach. How has the process of choosing the right scalability approach changed over the years? Iams: The rules 10 to 15 or 20 years ago were pretty clear. If you wanted to do heavy duty database processing, you wanted it to scale up and you would use SMP. That was it. End of story. The number of workloads that you would want to use with the distributed computing or even MPP was extremely limited. But nowadays with the Internet and Web-based computing and all of these services that people are using on the Web like Facebook and Google a lot of those work really well with a scale out approach, and distributed computing and MPP are starting to be applied more widely. PAGE 11 OF 14
How should an IT organization go about matching database workloads with the right scalability approach? Iams: If you're just talking about the classic transaction-driven database workloads that drive the day to day operations of a typical business, that still works best on SMP type systems. That is because with transactions, you're going to be writing a lot of data by definition. If you're going to write something, you have to have very efficient access to that memory, and that is why you need SMP. There is no more efficient way to access memory than with SMP. What if the database workload is created for business intelligence or analytics purposes? Iams: More and more database work is based on analysis, which is not necessarily writing data. In this case, you're more interested in reading the data because you're going to go through there and analyze it looking for patterns, business opportunities and trends. That is increasingly where the value is and that is not a new thing. People have been doing data warehouses and stuff like that for a long time. But now there has been an uptick in the volume of data. Internet usage is generating data, mobile phone usage is generating data and all that stuff is tracked now. That has produced an explosion in data. [Data warehouses and associated analytics projects] have to scale more than ever before, PAGE 12 OF 14
and MPP is the right answer for that. PAGE 13 OF 14
FREE RESOURCES FOR TECHNOLOGY PROFESSIONALS TechTarget publishes targeted technology media that address your need for information and resources for researching products, developing strategy and making cost-effective purchase decisions. Our network of technology-specific Web sites gives you access to industry experts, independent content and analysis and the Web s largest library of vendor-provided white papers, webcasts, podcasts, videos, virtual trade shows, research reports and more drawing on the rich R&D resources of technology providers to address market trends, and solutions. Our live events and virtual seminars give you access to vendor neutral, expert commentary and advice on the issues and you face daily. Our social community IT Knowledge Exchange allows you to share real world information in real time with peers and experts. WHAT MAKES TECHTARGET UNIQUE? TechTarget is squarely focused on the enterprise IT space. Our team of editors and network of industry experts provide the richest, most relevant content to IT professionals and management. We leverage the immediacy of the Web, the networking and face-to-face opportunities of events and virtual events, and the ability to interact with peers all to create compelling and actionable information for enterprise IT professionals across all industries and markets. PAGE 14 OF 14