Modernizing Your Data Warehouse Architecture with Hadoop
|
|
|
- Godfrey Dalton
- 10 years ago
- Views:
Transcription
1 MARCH 2014 TDWI E-Book Modernizing Your Data Warehouse Architecture with Hadoop 1 Q&A: Best Practices for Offloading Tasks to Hadoop 3 How (and Why) Hadoop Is Changing the Data Warehousing Paradigm 6 Hadoop and Data Management: A Close Relationship 9 About Syncsort Sponsored by: tdwi.org
2 Q&A: Best Practices for Offloading Tasks to Hadoop The growing costs of IT infrastructure and larger data volumes are driving enterprises to offload data management chores to Hadoop. We explore how enterprises can make the most of this shift with Jorge A. Lopez, director of product marketing at Syncsort. With over 15 years of experience in business intelligence and data integration, Lopez is responsible for product marketing and strategy at Syncsort. TDWI: How would you describe the current state of big data integration today? Jorge A. Lopez: For decades, organizations have struggled with critical performance and scalability shortcomings of conventional data integration. These shortcomings forced them to push heavy data integration workloads down to the data warehouse. As a result, core data integration experienced a shift from extract, transform, and load (ETL) to extract, load, and transform (ELT). Although this worked in the short term, it also created a whole new set of problems for the IT organization, from longer batch windows to shorter data retention and rapidly growing database costs. Today, big data is aggravating these problems. More than ever, your organization s survival depends on your ability to transform data into actionable insights. This need to analyze more data from a more diverse set of sources in less time, while keeping costs under control, is creating a lot of tension in existing data integration architectures. What is driving organizations to incorporate Hadoop into their data management environments? It s precisely this tension between the evolving needs of the business and the growing costs of IT infrastructure that s driving many Hadoop implementations. Hadoop is offering a better alternative to data integration, with an approach that is economically feasible, while providing the required levels of performance and massive scalability. Hadoop has the potential of becoming the ideal staging area where you can store and archive all of your data (both structured 1 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
3 and unstructured), but you can also pre-process it execute all the batch workloads and then feed it to other pieces of your architecture. By effectively offloading data and ELT workloads from the data warehouse into Hadoop, organizations can significantly reduce batch windows, keep readily available data as long as they need, and free up significant data warehouse capacity. This means no trade-offs and no tension just the data you need to drive your business. What challenges do organizations face as they offload their data warehouse into Hadoop? It s key to understand that Hadoop is not a complete ETL solution. In my opinion, Hadoop is much closer to an operating system. It provides services for developers and vendors to create big data applications. However, although it offers powerful utilities and massive horizontal scalability, it does not provide the set of functionality users need to deliver enterprise ETL capabilities. That s why offloading data and ETL workloads to Hadoop can be intimidating. Some of the key challenges involve identifying the right tools to close the functional gaps between enterprise ETL and Hadoop. Where do you begin? How do you know which workloads to move? Do you have all the tools necessary to access and move your data and processing? How can you optimize processing once it s inside Hadoop? What are some best practices to overcome these challenges? First of all, let me be clear: the data warehouse is not going away. The goal of offloading is to free up database capacity to reduce costs, improve database user query response time, and use that premium database capacity more wisely. To that end, most organizations follow a three-step approach. The first step consists of identifying infrequently used (cold) data and heavy ELT workloads in your data warehouse. We have heard from partners and customers alike that, in many cases, ELT processes performed on this cold data can waste significant premium storage and CPU resources in your data warehouse, yet it adds zero value. Similarly, heavy transformations including changed data capture (CDC), slowly changing dimensions, raking functions, volatile tables, multiple merge, joins, cursors, and unions can drive up to 80 percent of resources. The second step is all about moving the data and replicating the existing ELT workloads into Hadoop. The best way to do this is by leveraging existing skills within your organization. Tools with pointand-click interfaces can help accelerate development and ongoing maintenance of the new environment. Finally, once you ve offloaded data and workloads from the data warehouse, you will need enterprise-grade tools to manage, secure, and operationalize the new environment. Here, it is important to look out for solutions that support common security standards such as Kerberos and LDAP, as well as monitoring and management tools. Where does the cloud fit within this picture? Hadoop presents a great opportunity to collect, process, and analyze extreme data volumes at a much lower cost. However, procuring, deploying, and maintaining a Hadoop environment can be a daunting task, and that s exactly where the cloud comes into the picture. Cloud services, such as Amazon EMR, Google Cloud, and others, allow organizations to instantly provision a Hadoop framework, effectively lowering the barriers for wider adoption and leveling the playing field. Not only that, it means any organization, regardless of its size, can have a Hadoop cluster with virtually unlimited scalability. That s why the convergence of cloud and Hadoop is so much more disruptive. Does legacy data have a place in Hadoop? Absolutely. Historically, the most data-intensive businesses have relied on mainframes to manage their big data. These organizations retail, financial, healthcare, telecommunications know they cannot neglect mainframe data. However, they also need to be aware of skills, integration, and cost gaps between mainframe and Hadoop in order to provide fast, reliable, and secure access to mainframe data. What products or services does Syncsort offer for offloading data and processing to Hadoop? Syncsort provides targeted solutions to address the challenges of offloading data and workloads from legacy systems, such as data warehouses and mainframes, to Hadoop. Our fully integrated approach gives you the tools to automatically identify data and processing suitable for offload, easily migrate them to Hadoop with the help of a graphical user interface, and, once there, optimize and secure your Hadoop environment. This is true for both the enterprise data warehouse and the mainframe. Our mainframe heritage means you can also analyze mainframe workloads and easily access, translate, and move mainframe data to Hadoop. Our solutions can be deployed on premises with Syncsort DMX-h or in the cloud with Ironcluster for Amazon EMR. 2 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
4 Expert Q&A Why Hadoop? Hadoop and Data Management About Syncsort How (and Why) Hadoop Is Changing the Data Warehousing Paradigm By Jack Norris Hadoop will not replace relational databases or traditional data warehouse platforms, but its superior price/performance ratio can help organizations lower costs while maintaining their existing applications and reporting infrastructure. How should your enterprise get started? The emergence of new data sources and the need to analyze virtually everything, including unstructured data and live event streams, has led many organizations to a startling conclusion: a single enterprise data warehousing platform can no longer handle the growing breadth and depth of analytical workloads. Being purposebuilt for big data analytics, Hadoop is now becoming a strategic addition to the data warehousing environment, where it is able to fulfill several roles. Why Hadoop (and Why Now) Organizations across all industries are confronting the same challenge: data is arriving faster than existing data warehousing platforms are able to absorb and analyze it. The migration to online channels, for example, is driving unprecedented volumes of transaction and clickstream data, which are, in turn, driving up the cost of data warehouses, ETL processing, and analytics. Compounding the challenge is that much of this new data is unstructured. Many businesses, for example, now want to analyze more complex high-value data types (such as clickstream and social media data, as well as un-modeled, multi-structured data) to gain new insights. The problem is that these new data types do not fit the existing massively parallel processing model that was designed for structured data in most data warehouses. The cost to scale traditional data warehousing technologies is high and eventually becomes prohibitive. Even if the cost could be justified, the performance would be insufficient to accommodate today s growing volume, velocity, and variety of data. Something more scalable and cost-effective is needed, and Hadoop satisfies both of these needs. Hadoop is a complete, open-source ecosystem for capturing, organizing, storing, searching, sharing, analyzing, visualizing, and otherwise processing disparate data sources (structured, semistructured, and unstructured) in a cluster of commodity computers. This architecture gives Hadoop clusters incremental and virtually unlimited scalability from a few to a few thousand servers, each offering local storage and computation. Hadoop s ability to store and analyze large data sets in parallel on a large cluster of computers yields exceptional performance, while the use of commodity hardware results in a remarkably low cost. In fact, Hadoop clusters often cost 50 to 100 times less on a per-terabyte basis than today s typical data warehouse. With such an impressive price/performance ratio, it should come as no surprise that Hadoop is changing the data warehousing paradigm. 3 TDWI e - book Modernizing Your Data Wa rehouse A rchitecture with H a doop
5 Hadoop s Role in the New Data Warehousing Paradigm Hadoop s role in data warehousing is evolving rapidly. Initially, Hadoop was used as a transitory platform for extract, transform, and load (ETL) processing. In this role, Hadoop is used to offload processing and transformations performed in the data warehouse. This replaces an ELT (extract, load, and transform) process that required loading data into the data warehouse as a means to perform complex and large-scale transformations. With Hadoop, data is extracted and loaded into the Hadoop cluster where it can then be transformed, potentially in near real time, with the results loaded into the data warehouse for further analysis. In all fairness, ELT processes began as a way of taking advantage of the parallel query processing available in the data warehouse platform. Offloading transformation processing to Hadoop frees up considerable capacity in the data warehouse, thereby postponing or avoiding an expensive expansion or upgrade to accommodate the relentless data deluge. Hadoop has a role to play in the front end of performing transformation processing as well as in the back end of offloading data from a data warehouse. With virtually unlimited scalability at a per-terabyte cost that is more than 50 times less than traditional data warehouses, Hadoop is quite well suited for data archiving. Because Hadoop can perform analytics on the archived data, it is necessary to move only the specific result sets to the data warehouse (and not the full, large set of raw data) for further analysis. Appfluent, a data usage analytics provider, calls this the Active Archive an oxymoron that accurately reflects the value-added potential of using Hadoop in today s data warehousing environment. They have found that for many companies, about 85 percent of their tables go unused, and that in the active tables, up to 50 percent of the columns go unused. The combination of eliminating dead data at the ETL stage and relocating dormant data to a lowcost Hadoop Active Archive can be considerable, resulting in truly extraordinary savings. Although able to provide superior price/performance ratio in both the front and back ends of a data warehouse, Hadoop s best role may well be as an end in and of itself. This is particularly true given how much Hadoop has evolved since its early days of batch-oriented analysis of Web content for search engines. Consider the inclusion of HBase, for example, in the Hadoop ecosystem. HBase is a non-relational, NoSQL database that sits atop the Hadoop Distributed File System (HDFS). HBase applications have several advantages in certain distributions, including the ability to achieve high performance and consistently low latency for database operations. Of course, Hadoop s original MapReduce framework purpose-built for large-scale parallel processing is also eminently suitable for data analytics in a data warehouse. In fact, MapReduce is fully capable of everything from complex analyses of structured data to exploratory analyses of un-modeled, multi-structured data. An exploratory analysis, for example, could derive structure from unstructured data, enabling the data to be loaded into HBase, Hive, or the existing data warehouse for further analysis. Such preprocessing is so effective and cost-effective that a growing number of ETL processes are being rewritten as MapReduce jobs. These efforts are often assisted by Hive s ability to convert ETL-generated SQL transformations into MapReduce jobs. Although these MapReduce conversions work well, performance can be improved by rewriting the intermediate shuffle phase that occurs after the Map and before the Reduce functions. Optimizing the shuffle benefits the sorting, aggregation, hashing, pattern-matching, and other processes that are integral to ETL/ELT. Because it is quite common with MapReduce to have the output of one job become the input for another, Hadoop effectively makes ETL integral to, and seamless with, data analytics and archival processing. It is this beginning-to-end role in data warehousing that has given impetus to what is Hadoop s ultimate role as an enterprise data management hub in a multi-platform data analytics environment. Indeed, it is almost as if Hadoop is destined to fulfill this role based on its versatility, scalability, compatibility, and affordability. Although Hadoop appears perfectly suited for use as an enterprise data management hub, there is (as always) a caveat: some Hadoop distributions and/or configurations lack enterprise-class capabilities. As a hub, the Hadoop cluster must offer missioncritical high availability and robust data protection. The former can be achieved by eliminating any single points of failure, the latter by supporting both snapshots for point-in-time data recovery and remote mirroring for disaster recovery. 4 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
6 Expert Q&A Why Hadoop? Hadoop and Data Management About Syncsort An Enterprise Data Hub SCM Billing Enterprise Data Hub CRM Sales Public Location Web Logs Social Media Clickstreams Production Data Sensor Data A data hub combines different data sources, minimizes data movement, and uses one platform for analytics. Conclusion The data deluge with its three equally challenging dimensions of variety, volume, and velocity has made it impossible for any single platform to meet all of an organization s data warehousing needs. Hadoop will not replace relational databases or traditional data warehouse platforms, but its superior price/performance ratio will give organizations an option to lower costs while maintaining their existing applications and reporting infrastructure. Jack Norris is the chief marketing officer of MapR Technologies and leads the company s worldwide marketing efforts. Jack has over 20 years of enterprise software marketing and product management experience in defining and delivering analytics, storage, and information delivery products. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain and Company. Jack earned an MBA from UCLA Anderson and a BA in economics with honors and distinction from Stanford University. So get started with Hadoop at the front end with ETL, at the back end with an Active Archive, or get started in-between by supplementing existing technologies with Hadoop s parallel processing prowess for both structured and unstructured data depending on your greatest need. For those still reluctant to make the investment at this time, consider getting started in in the cloud, where Hadoop is now available as an on-demand service. However your organization gets started, be prepared to become a believer in the new multi-platform data warehousing paradigm, in general, and in Hadoop as a potential and powerful enterprise data management hub. 5 TDWI e - book Modernizing Your Data Wa rehouse A rchitecture with H a doop
7 Hadoop and Data Management: A Close Relationship Hadoop doesn t require you to rewire your existing data management best practices, only revise them. Contrary to what you might have heard, Hadoop doesn t rewrite the data management (DM) rulebook. Not in whole, and not necessarily in (large) part. In fact, Hadoop and traditional data management are highly complementary. From a DM perspective, Hadoop enables fundamentally new practices and excels in contexts in which the data warehouse (DW), that linchpin of traditional DM, founders. For this reason, Hadoop and the DW are not mutually antagonistic. Far from it: many of the new (or non-traditional) use cases that Hadoop enables in turn enable new and non-traditional use cases in DM. This doesn t so much require rewriting as it does revising existing data management best practices. In most cases, revising the DM rulebook means incorporating changes as new additions to the standard text. This is what it means to extend traditional DM with Hadoop: it s a question of identifying new and existing use cases and best practices that leverage Hadoop s strengths in a complementary capacity. Hadoop can be a powerful platform that enables scalability and handles diverse data types for certain components of your DW architecture, notes Philip Russom, TDWI Research director for data management. For example, Hadoop can function in any of several roles as a long-term repository in which to store, persist, and manage data. Some of these roles are by now well known (the Hadoop landing zone ), while others are relatively new. From a DM perspective, all of these roles tend to be highly complementary: e.g., it simply is not cost-effective to use an RDBMS or an MPP DBMS platform to implement a landing zone- or data lake-like scheme. However, just because Hadoop is complementary doesn t mean that it doesn t also constitute a significant enhancement or extension of a departure from existing technology and practices. Hadoop enhances and extends traditional data management practices and enables entirely new practices or use cases. It is both continuous with traditional DM and at the same time much bigger: it enables an altogether new kind of big data management a category that subsumes traditional DM centering on the Hadoop environment itself. Russom describes a use case in which Hadoop simultaneously functions as both a low-cost replacement for and, in the form of a massive online data archive, as an extension of the otherwiseindispensable ODS. To free up capacity on a data warehouse, many organizations manage detailed source data on an ODS consisting of a standalone hardware server running a DBMS instance. There s a need for ODS platforms that cost-effectively handle massive data volumes and more diverse data, which Hadoop can do, he points out. The source data stored long term in ODSs can approach petabyte scale. Examples include call detail records in telco, sessionized clickstreams in e-commerce, and customer data in financial 6 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
8 services. To cope with large volumes, some data is archived offline, which puts it beyond the reach of analytics. Hadoop can keep data online for constant access. DM Best Practices, Hadoop Style The data warehouse was designed in part to address a data access challenge: the DW facilitates query access to business information by centralizing data, by imposing strict constraints on data types, and by organizing data in a strict, rigidly defined schema. In effect, the DW model brings the data to the analytic logic. Hadoop and big data effectively upend this model, Russom notes: in the Hadoop model, after all, both data and analytic logic are (or can be made to be) in the same place. For decades, most analytic tools required that data be transformed to a special model and moved to a special database or file prior to analysis, he explains. Given the volumes of today s big data, this is no longer feasible. However, Hadoop was designed for processing data in place. Think of how MapReduce and Hive access and process Hadoop data without moving or remodeling it first. That said, many DM best practices can and should be extended to Hadoop. For example, if an information system is identified as a valuable source for analytics, it, too, should be managed and curated. As with a data warehouse, this means preloading its data into Hadoop. In the long run, this is faster than retrieving the data prior to each run of an analytic process, especially if the data is voluminous, writes Russom. Here, too, DM best practices apply: a DW isn t just preloaded with data, after all; data must be updated or refreshed over time. This means synchronization, which effectively means changed data capture (CDC) capabilities: Preloading data into HDFS means you must devise processes that keep Hadoop data up to date. Look for changed data capture functionality in data integration tools that interface with HDFS. In this model as in most such integration or extension schemes Hadoop isn t a replacement for the data warehouse. For example, one loudly trumpeted use case casts Hadoop as an analytic or data integration sandbox: i.e., as a place in which (respectively) to consolidate and test extremely large data sets or in which to test joins, aggregations, and transformational logic. Even though the scope of both activities is open-ended or experimental, the ultimate aim is to produce stable or persistent structures: for example, tested and refined analytic insights or data integration artifacts. It s ironic that data analysts, data scientists, and similar users scan gigantic volumes of data to understand a business problem e.g., what s the root cause of the latest form of churn? or opportunity, he indicates. [T]hey typically boil it all down to a relatively small data set [that is] expressed in a model that represents their epiphany. Too often, analysts share the epiphany with a few peers and managers, then move on to the next analytic assignment. Instead, analysts should always take the outcome of analytics to the BI and DW team in case the team sees the need to operationalize in reports what was initially discovered via analysis. One of Hadoop s biggest strengths is that it s able to accommodate types and volumes of data that traditional RDBMSs or even MPP DBMSs cannot. This isn t a categorical cannot, however: it s possible to scale an MPP warehouse into the double-digit petabyte range, albeit at a cost that s orders of magnitude more expensive than that of a comparative Hadoop deployment. To paraphrase the subtitle of Stanley Kubrick s Dr. Strangelove, DM practitioners must learn to stop worrying and love Hadoop. This means looking for opportunities to offload (to Hadoop) data or processes for which the DW itself is fundamentally unsuited. This includes data types that few DWs were designed for, such as detailed source data and any file-based data [e.g.] logs, XML, text documents, [and] unstructured data, Russom writes. It includes most ETL and data integration processes, especially those that must run at massive scale e.g., aggregating tens of terabytes, sorting millions of call detail records. Hadoop is designed for these data types and operations, and Hadoop capacity is far less expensive than DW capacity. Russom sees offloading as a win-win for both the DW and Hadoop: [O]ffloading allows the DW to do what it does best: provide squeaky clean, well-modeled data with a well-documented audit trail for standard reports, dashboards, performance management, and OLAP. Using Hadoop as a landing zone, sandbox, or staging area for storing and managing new data or for accommodating experimental analytic and data integration workloads has other benefits, too. For one thing, doing so makes it much easier to quickly expose new data sources: there s simply less risk of something going awry. Data integration or analytic kinks can be worked out in Hadoop; squeakyclean data or data structures can thereafter be moved into the data warehouse. 7 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
9 Russom cites the relatively new idea of using Hadoop as a data lake i.e., as an inexhaustible pool for most of the raw data that a business ultimately uses to feed its DW analytic apps. This would be a nonstarter in a data warehouse environment; there s a reason data is conformed and transformed before it s loaded into the warehouse, after all. By the same token, the data lake concept wouldn t be cost-effective or (as a function of scaling issues) practicable with an ODS. Yet certain kinds of analysis, from BI discovery to advanced analytics, can make use of this raw data. What s more, information from other non-traditional sources such as mainframes, which generate detailed transaction information, or machines and sensors, which generate log and event data can conceivably be pooled in a Hadoop data lake, too, Russom argues. DW professionals are often hesitant when it comes to integrating data from a new source into the warehouse because it takes time to model new data structures and design ETL jobs. In addition, disaggregating poor-quality or untrustworthy data from the DW s calculated values, time series, and dimensional structures is so difficult as to be impossible, he points out. With a datalake approach to HDFS, modeling and ETL are not required, and disaggregation can be as simple as altering virtual views or analytic algorithms so they ignore files containing questionable data. DW teams are concluding that a single-platform DW is no longer desirable, Russom says. Instead, they maintain a core DW platform for traditional workloads reports, performance management, and OLAP but offload other workloads to other platforms, he concludes. The DW is not going away; it s just being complemented by additional data platforms tuned to workloads that can and should be offloaded from the core warehouse. DM Mainstays Such as ETL Are More Important than Ever This doesn t mean that data modeling and ETL requirements will go away. Far from it: if or when analytic artifacts are shifted from Hadoop to the data warehouse, modeling and ETL will be huge factors. The practical effect of using Hadoop to land and stage new data; inexpensively parallelize ETL workloads; identify, test, and refine analytic insights; or perfect data integration workloads is that it becomes possible to more quickly instantiate all of these artifacts in the data warehouse. In the traditional DW model, there can be a huge time lag between when a feature or change is requested and when it s actually delivered. This is in part because the DW model presupposes a comparatively static world so much so that making changes is a nontrivial task. Incorporating Hadoop into DM as a versatile test and development, analytic discovery, or data processing platform can help mitigate this issue. The issue... is whether a singleplatform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More and more 8 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
10 Syncsort provides fast, secure, enterprise-grade software spanning Big Data solutions in Hadoop to Big Iron on mainframes. We help customers around the world to collect, process, and distribute more data in less time, with fewer resources and lower costs. Eightyseven of the Fortune 100 companies are Syncsort customers, and Syncsort s products are used in more than 85 countries to offload expensive and inefficient legacy data workloads, speed data warehouse and mainframe processing, and optimize cloud data integration. Experience Syncsort at To learn more about Syncsort solutions for Hadoop and try them for yourself: tdwi.org TDWI, a division of 1105 Media, Inc., is the premier provider of in-depth, high-quality education and research in the business intelligence and data warehousing industry. TDWI is dedicated to educating business and information technology professionals about the best practices, strategies, techniques, and tools required to successfully design, build, maintain, and enhance business intelligence and data warehousing solutions. TDWI also fosters the advancement of business intelligence and data warehousing research and contributes to knowledge transfer and the professional development of its members. TDWI offers a worldwide membership program, five major educational conferences, topical educational seminars, role-based training, on-site courses, certification, solution provider partnerships, an awards program for best practices, live Webinars, resourceful publications, an in-depth research program, and a comprehensive website, tdwi.org by TDWI (The Data Warehousing Institute TM ), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies. 9 TDWI e-book Modernizing Your Data Warehouse Architecture with Hadoop
Ten Mistakes to Avoid
EXCLUSIVELY FOR TDWI PREMIUM MEMBERS TDWI RESEARCH SECOND QUARTER 2014 Ten Mistakes to Avoid In Big Data Analytics Projects By Fern Halper tdwi.org Ten Mistakes to Avoid In Big Data Analytics Projects
Evolving Data Warehouse Architectures
Evolving Data Warehouse Architectures In the Age of Big Data Philip Russom April 15, 2014 TDWI would like to thank the following companies for sponsoring the 2014 TDWI Best Practices research report: Evolving
Datenverwaltung im Wandel - Building an Enterprise Data Hub with
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
HADOOP BEST PRACTICES
TDWI RESEARCH TDWI CHECKLIST REPORT HADOOP BEST PRACTICES For Data Warehousing, Data Integration, and Analytics By Philip Russom Sponsored by tdwi.org OCTOBER 2013 TDWI CHECKLIST REPORT HADOOP BEST PRACTICES
Cloudera Enterprise Data Hub in Telecom:
Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Big Data and Your Data Warehouse Philip Russom
Big Data and Your Data Warehouse Philip Russom TDWI Research Director for Data Management April 5, 2012 Sponsor Speakers Philip Russom Research Director, Data Management, TDWI Peter Jeffcock Director,
Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013
Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
Deploying an Operational Data Store Designed for Big Data
Deploying an Operational Data Store Designed for Big Data A fast, secure, and scalable data staging environment with no data volume or variety constraints Sponsored by: Version: 102 Table of Contents Introduction
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload
End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ
End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
Using and Choosing a Cloud Solution for Data Warehousing
TDWI RESEARCH TDWI CHECKLIST REPORT Using and Choosing a Cloud Solution for Data Warehousing By Colin White Sponsored by: tdwi.org JULY 2015 TDWI CHECKLIST REPORT Using and Choosing a Cloud Solution for
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Data Integration Checklist
The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media
Ganzheitliches Datenmanagement
Ganzheitliches Datenmanagement für Hadoop Michael Kohs, Senior Sales Consultant @mikchaos The Problem with Big Data Projects in 2016 Relational, Mainframe Documents and Emails Data Modeler Data Scientist
BIG DATA-AS-A-SERVICE
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
Traditional BI vs. Business Data Lake A comparison
Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses
OFFLOADING TERADATA. With Hadoop A 1-2-3 APPROACH TO NEW HADOOP GUIDE!
NEW HADOOP GUIDE! A 1-2-3 APPROACH TO OFFLOADING TERADATA With Hadoop A Practical Guide to Freeing up Valuable Teradata Capacity & Saving Costs with Hadoop Table of Contents INTRO: THE PERVASIVE IMPACT
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY
A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY Analytics for Enterprise Data Warehouse Management and Optimization Executive Summary Successful enterprise data management is an important initiative for growing
Big Data at Cloud Scale
Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For
Protecting Big Data Data Protection Solutions for the Business Data Lake
White Paper Protecting Big Data Data Protection Solutions for the Business Data Lake Abstract Big Data use cases are maturing and customers are using Big Data to improve top and bottom line revenues. With
EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst
White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned
More Data in Less Time
More Data in Less Time Leveraging Cloudera CDH as an Operational Data Store Daniel Tydecks, Systems Engineering DACH & CE Goals of an Operational Data Store Load Data Sources Traditional Architecture Operational
Using Tableau Software with Hortonworks Data Platform
Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data
BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE
BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE Current technology for Big Data allows organizations to dramatically improve return on investment (ROI) from their existing data warehouse environment.
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
The 3 questions to ask yourself about BIG DATA
The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.
Big Data Integration: A Buyer's Guide
SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology
Agile Business Intelligence Data Lake Architecture
Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step
Getting Started Practical Input For Your Roadmap
Getting Started Practical Input For Your Roadmap Mike Ferguson Managing Director, Intelligent Business Strategies BA4ALL Big Data & Analytics Insight Conference Stockholm, May 2015 About Mike Ferguson
Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM
Using Big Data for Smarter Decision Making Colin White, BI Research July 2011 Sponsored by IBM USING BIG DATA FOR SMARTER DECISION MAKING To increase competitiveness, 83% of CIOs have visionary plans that
White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014
White Paper EMC Isilon: A Scalable Storage Platform for Big Data By Nik Rouda, Senior Analyst and Terri McClure, Senior Analyst April 2014 This ESG White Paper was commissioned by EMC Isilon and is distributed
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
Data virtualization: Delivering on-demand access to information throughout the enterprise
IBM Software Thought Leadership White Paper April 2013 Data virtualization: Delivering on-demand access to information throughout the enterprise 2 Data virtualization: Delivering on-demand access to information
BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP
BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015 Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue
Data Virtualization A Potential Antidote for Big Data Growing Pains
perspective Data Virtualization A Potential Antidote for Big Data Growing Pains Atul Shrivastava Abstract Enterprises are already facing challenges around data consolidation, heterogeneity, quality, and
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
BEYOND BI: Big Data Analytic Use Cases
BEYOND BI: Big Data Analytic Use Cases Big Data Analytics Use Cases This white paper discusses the types and characteristics of big data analytics use cases, how they differ from traditional business intelligence
Big Data for the Rest of Us Technical White Paper
Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive
Achieving Business Value through Big Data Analytics Philip Russom
Achieving Business Value through Big Data Analytics Philip Russom TDWI Research Director for Data Management October 3, 2012 Sponsor 2 Speakers Philip Russom Research Director, Data Management, TDWI Brian
Bringing the Power of SAS to Hadoop. White Paper
White Paper Bringing the Power of SAS to Hadoop Combine SAS World-Class Analytic Strength with Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities Contents Introduction... 1 What
SQL Server 2012 Parallel Data Warehouse. Solution Brief
SQL Server 2012 Parallel Data Warehouse Solution Brief Published February 22, 2013 Contents Introduction... 1 Microsoft Platform: Windows Server and SQL Server... 2 SQL Server 2012 Parallel Data Warehouse...
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.
Saving Millions through Data Warehouse Offloading to Hadoop Jack Norris, CMO MapR Technologies MapR Technologies. All rights reserved. MapR Technologies Overview Open, enterprise-grade distribution for
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
ANALYTICS BUILT FOR INTERNET OF THINGS
ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that
Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014
Increase Agility and Reduce Costs with a Logical Data Warehouse February 2014 Table of Contents Summary... 3 Data Virtualization & the Logical Data Warehouse... 4 What is a Logical Data Warehouse?... 4
VIEWPOINT. High Performance Analytics. Industry Context and Trends
VIEWPOINT High Performance Analytics Industry Context and Trends In the digital age of social media and connected devices, enterprises have a plethora of data that they can mine, to discover hidden correlations
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco
Decoding the Big Data Deluge a Virtual Approach Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco High-volume, velocity and variety information assets that demand
Hadoop in the Hybrid Cloud
Presented by Hortonworks and Microsoft Introduction An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure. Big
DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases
DATAMEER WHITE PAPER Beyond BI Big Data Analytic Use Cases This white paper discusses the types and characteristics of big data analytics use cases, how they differ from traditional business intelligence
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
Big Data & the Cloud: The Sum Is Greater Than the Parts
E-PAPER March 2014 Big Data & the Cloud: The Sum Is Greater Than the Parts Learn how to accelerate your move to the cloud and use big data to discover new hidden value for your business and your users.
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
Next-Generation Cloud Analytics with Amazon Redshift
Next-Generation Cloud Analytics with Amazon Redshift What s inside Introduction Why Amazon Redshift is Great for Analytics Cloud Data Warehousing Strategies for Relational Databases Analyzing Fast, Transactional
BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
BANKING ON CUSTOMER BEHAVIOR
BANKING ON CUSTOMER BEHAVIOR How customer data analytics are helping banks grow revenue, improve products, and reduce risk In the face of changing economies and regulatory pressures, retail banks are looking
Presenters: Luke Dougherty & Steve Crabb
Presenters: Luke Dougherty & Steve Crabb About Keylink Keylink Technology is Syncsort s partner for Australia & New Zealand. Our Customers: www.keylink.net.au 2 ETL is THE best use case for Hadoop. ShanH
MDM and Data Warehousing Complement Each Other
Master Management MDM and Warehousing Complement Each Other Greater business value from both 2011 IBM Corporation Executive Summary Master Management (MDM) and Warehousing (DW) complement each other There
www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage
www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage If every image made and every word written from the earliest stirring of civilization
Virtual Operational Data Store (VODS) A Syncordant White Paper
Virtual Operational Data Store (VODS) A Syncordant White Paper Table of Contents Executive Summary... 3 What is an Operational Data Store?... 5 Differences between Operational Data Stores and Data Warehouses...
Ubuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
End Small Thinking about Big Data
CITO Research End Small Thinking about Big Data SPONSORED BY TERADATA Introduction It is time to end small thinking about big data. Instead of thinking about how to apply the insights of big data to business
How To Use Hp Vertica Ondemand
Data sheet HP Vertica OnDemand Enterprise-class Big Data analytics in the cloud Enterprise-class Big Data analytics for any size organization Vertica OnDemand Organizations today are experiencing a greater
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
How the oil and gas industry can gain value from Big Data?
How the oil and gas industry can gain value from Big Data? Arild Kristensen Nordic Sales Manager, Big Data Analytics [email protected], tlf. +4790532591 April 25, 2013 2013 IBM Corporation Dilbert
Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale
WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept
Tap into Hadoop and Other No SQL Sources
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Your Data, Any Place, Any Time.
Your Data, Any Place, Any Time. Microsoft SQL Server 2008 provides a trusted, productive, and intelligent data platform that enables you to: Run your most demanding mission-critical applications. Reduce
Accelerate BI Initiatives With Self-Service Data Discovery And Integration
A Custom Technology Adoption Profile Commissioned By Attivio June 2015 Accelerate BI Initiatives With Self-Service Data Discovery And Integration Introduction The rapid advancement of technology has ushered
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
There s no way around it: learning about Big Data means
In This Chapter Chapter 1 Introducing Big Data Beginning with Big Data Meeting MapReduce Saying hello to Hadoop Making connections between Big Data, MapReduce, and Hadoop There s no way around it: learning
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
