NOSQL, NEW SQL, HADOOP:

Size: px
Start display at page:

Download "NOSQL, NEW SQL, HADOOP:"

Transcription

1 Teradata PAGE 16 HARNESSING THE VALUE OF BIG DATA MarkLogic PAGE 20 THE BIG DATA DILEMMA FOR FINANCIAL SERVICES Akiban PAGE 23 SQL OR NOSQL: A CHOICE YOU DON T HAVE TO MAKE Progress DataDirect PAGE 24 HADOOP: LATEST TREND IN BIG CONNECTIVITY Objectivity PAGE 25 5 MYTHS ABOUT BIG DATA Vertica PAGE 26 HP VERTICA ANALYTICS PLATFORM READY TO GROW WITH CARDLYTICS Denodo Technologies PAGE 27 DATA VIRTUALIZATION IS VITAL FOR MAXIMIZING YOUR NOSQL AND BIG DATA INVESTMENT NOSQL, NEW SQL, HADOOP: NEW TECHNOLOGIES FOR THE ERA OF BIG DATA Best Practices Series

2 14 DECEMBER 2012 DBTA BRINGING BIG DATA INTO THE ENTERPRISE FOLD Are organizations systems and data environments ready for the big data surge? Far from it, a new survey shows. The survey of 298 members of the Independent Oracle Users Group (IOUG), conducted by Unisphere Research and sponsored by Oracle Corp., finds fewer than one out of five data managers and executives are confident that their IT infrastructure will be capable of handling the surge of big data. And big data is already here more than one out of 10 survey respondents report having in excess of a petabyte of data within their organizations, and a majority report their levels of unstructured data are growing. ( Big Data, Big Challenges, Big Opportunities: 2012 IOUG Big Data Strategies Survey. ) Since big data incorporates so many different data types in varying volumes and from many different sources, it would make both data managers and end users lives easier if it all could be brought into a single comprehensive framework that can be easily managed and accessed. This, in fact, has long been the holy grail of the IT and database industries a vision that, unfortunately, has yet to be realized. If it was possible to easily manage structured and unstructured data in the same frameworks, the legacy vendors would have already solved all these problems and new innovative technologies would never have taken root, Sanjay Mehta, vice president of product marketing for Splunk, points out. The challenge is that existing database environments especially relational databases were not designed for the tsunami of data that organizations are being asked to absorb in today s world, adds Russ Kennedy, vice president of product strategy for Cleversafe. They do well with small, fast transactions but were not designed to handle this changing landscape. For many organizations, then, big data requires newer technology strategies, especially platforms including the opensource Hadoop framework and NoSQL databases that maintain hierarchical structures. The IOUG-Unisphere survey, for example, found adoption of Hadoop and NoSQL technologies will double over the coming year from 7% to 16% for Hadoop adoption and 11% to 15% for NoSQL adoption. Relational and big data environments will co-exist in many organizations. Collecting click streams from a website will have very different uses than recording of customer purchases, advises David Champagne, chief technology officer for Revolution Analytics. Hadoop may be optimal for storing massive streams of data. A traditional data warehouse is likely the best choice for customer transactions where low latency access to the information is critical. Plus, big data often requires different skill sets, as well as different philosophical approaches than traditional environments. A realignment of resources and technologies is taking place within many organizations looking to manage these dual data environments. Traditional data warehouse and BI applications are getting redefined and call for adopting new skills and technologies that are yet to mature, however being actively used by businesses, says Prasanna Venkatesan, practice director for big data and analytics with HCL Technologies. Such new skill sets include Hadoop, big data storage, and in-memory analytics, visualization, and reporting. Many of these skills and technologies will require the ability to transition between these data environments. It s critical that effective tools become available to connect and translate both the old relational world with the new nonrelational one, Darin Bartik of Quest Software (now a part of Dell), tells DBTA. And transitional capabilities are key, as both relational and big data environments will co-exist in many organizations. The days of one-sizefits all are over, says Mike Miller, cofounder and chief scientist at Cloudant. Just as we don t force workers to use a single device such as a desktop PC for all of their computing needs, so too successful companies have understood that application developers are most productive when allowed to choose data management solutions tailor-made for their specific problem. RDBMS for relational data, document stores for document data, graph stores for graph problems, and Hadoop for warehousing. There will always be a need for traditional relational databases, addressing needs where relationships between entities, referential integrity and

3 DBTA DECEMBER ACID [atomicity, consistency, isolation, durability] transactions are among the key requirements, notes Michael Kopp, technology strategist at Compuware APM. However, he says, new forms of data that are streaming in Facebook likes, webpage searches and indexes, and shoppers recommendations do not fit the traditional RDBMS model very well. The advantage is that big data does not need to strictly adhere to ACID. The ideal is to blend emerging big data environments into existing technologies. If someone updates their social media profile, it is not critical that everyone in the network has to see the same update at the same time, says Kopp. Relational databases have their limits, however. Existing data environments are heavily focused on structured, relational data models, and most traditional and new-generation RDBMS vendors offer highly scalable massive parallel processing options that can scale to hundreds of terabytes, Craig Nies, global head of software development at Opera Solutions, tells DBTA. But they are still limited by their relatively inflexible relational data models. Data growth will be deep number of rows, which these models handle well and wide number of columns and variety of data, which these models handle less well. And these models don t handle unstructured data at all, even though it now represents the majority of all big data. Data warehouses long seen as providing the enterprise view of data may also have limits when it comes to big data, says Peter Wang, president of Continuum Analytics. The average data warehouse is designed to schematize and clean transactional data into a form that can be consumed by analytical modeling tools, which can then be used for confirmatory analysis, he observes. However, turnaround time for iterating is frequently months. Hadoop is capable of doing quick iteration on schema and large-scale predictive analytics on raw data. Ultimately, the ideal is to eventually be able to blend emerging big data environments into existing technologies such as relational databases and data warehouses. In reality, it isn t a question of the data warehouse versus big data, it s more about how to integrate big data into your existing infrastructure of operational and analytical systems, Tamara Dull, director of content development and deployment for SAS, says. Along with handling unstructured and semistructured data such as multimedia files, social media data, and weblogs, Dull notes, big data can also include data from your operational and analytical systems. The results of big data processing can be stored in Hadoop, data warehouses, data marts, or operational systems, among other platforms. One promising approach being explored is data virtualization, in which disparate forms of data from across the enterprise are abstracted into a common service fabric. For years, data warehouses and extract, transform, and load (ETL) have been the primary methods of accessing and archiving multiple data sources across enterprises. Data virtualization promises to advance the concept of the federated data warehouse to deliver more timely and easier-toaccess enterprise data. Data virtualization revives the original goals of data warehousing, David Besemer, CTO of Composite Software, pointed out at the vendor s recent conference on the topic. The idea that I can access all of my data without having to rip and replace all the physical systems we ve built all these years. Data virtualization represents the answer to the new normal that is now seen across many enterprise IT and data environments with big data, new client devices, predictive analytics, and selfservice business intelligence at the forefront of corporate plans, Besemer said. The technology doesn t necessarily replace but supplements data warehouses, added Rick van der Lans of R20/Consultancy at the event. Most production systems don t track history, he said. You need historical data for analysis. In addition, he added, data warehouses make an excellent, cleansed data source for the data virtualization layer. To illustrate the role data virtualization, van der Lans compared the technology to the service window in a restaurant customers place their specific orders but do not have to be concerned about what goes on in the other side of the window in the kitchen. The benefits of data virtualization are seen in the more rapid access users gain to reports and insights. Data models have been way too abstract for endusers, van der Lans pointed out. And they don t ask for data warehouses, and don t care about data structures. Virtualization helps show users the data itself and make it more practical, more down-to-earth for them. Ultimately, to bring these two environments together, there will need to be consolidation and the establishment of standards, says Peter Zaitsev, cofounder and CEO of Percona, Inc. For example, for relational databases there is SQL, which describes how data can be defined and uses queries to help in application portability between solutions. There is nothing like this in Until convergence is feasible, many organizations will maintain separate data infrastructures. big data yet, so the market is fragmented with a lot of different concepts and available query languages. Until convergence is feasible, many organizations will continue to maintain two separate data infrastructures, each with distinct purposes. For example, NoSQL database approaches are a must if dealing with varied information, states Chris Biow, CTO and vice president for MarkLogic. To achieve a better fit into existing organizations, he advocates Agile development processes, in which IT professionals work closely with business users to deliver incremental releases in a rapid fashion. Business users should not be waiting many months or years for an all-in project to deliver value on big data, he says.

4 16 DECEMBER 2012 DBTA Sponsored Content Harnessing the Value of Big Data How to Gain Business Insight Using MapReduce and Apache Hadoop with SQL-Based Analytics EXECUTIVE SUMMARY In business publications and IT trade journals, the buzz about big data challenges is nearly deafening. Rapidly growing volumes of data from transactional systems like enterprise resource planning (ERP) software and non-transactional sources such as web logs, customer call center records, and video images are everywhere. A tsunami of data, some experts call it. Most companies know how to collect, store, and analyze their operational data. But these new multi-structured data types are often too variable and dynamic to be cost-effectively captured in a traditional data schema using only a standard query language (SQL) for analytics. Leading organizations are exploring alternative solutions that use the MapReduce software framework, such as Apache Hadoop. While Hadoop can cost-effectively load, store, and refine multi-structured data, it is not well suited for low latency, iterative data discovery or classic enterprise business intelligence (BI). These applications require a strong ecosystem of tools that provide ANSI SQL support as well as high performance, scalability and interactivity. The more complete solution is to implement a unified data architecture that includes workload specific technologies for data warehousing, big data analytics & discovery and data landing, refinement & storage. Teradata unified data architecture including Teradata Integrated Data Warehouse, Aster Big Analytics & Discovery platform and Apache Hadoop is one such solution. The solution combines the power of the MapReduce analytic framework with SQL-based BI tools that are familiar to analysts. The result is a unified solution that helps companies gain valuable business insight from new and existing data using existing BI tools and skill sets as well as enhanced MapReduce analytic capabilities. INTRODUCING TERADATA UNIFIED DATA ARCHITECTURE To maximize the value of traditional and multi-structured data assets, companies need to deploy technologies that integrate Hadoop and relational database systems. Although the two worlds were separate not long ago, vendors are beginning to introduce solutions that effectively combine the technologies. Figure 1. Unified Data Architecture For example, market leaders like Teradata and Hortonworks are partnering to deliver reference architectures and innovative product integration that unify Hadoop with data discovery platforms and integrated data warehouses. What should companies look for to get the most value from Hadoop? Most importantly, you need a unified data architecture that provides answer to any question, by any user, at anytime against any data by tightly integrating the Hadoop/MapReduce programming model with traditional SQL-based enterprise data warehousing. (See Figure 1.) The unified data architecture is based on a system that can capture and store a wide range of multi-structured raw data sources. It uses MapReduce to refine this

5 Sponsored Content DBTA DECEMBER data into usable formats, helping to fuel new insights for the business. In this respect, Hadoop is an ideal choice for capturing and refining many multi-structured data types with unknown initial value. It also serves as a cost-effective platform for retaining large volumes of data and files for long periods of time. The unified data architecture also preserves the declarative and storageindependence benefits of SQL, without compromising MapReduce s ability to extend SQL s analytic capabilities. By offering the intuitiveness of SQL, the solution helps less-experienced users exploit the analytical capabilities of existing and packaged MapReduce functions, without needing to understand the programming behind them. With this architecture, enterprise architects can easily and cost-effectively incorporate Hadoop storage and batch processing strengths together with the relational database system. A critical part of the unified data architecture is the Aster Big Analytics & Discovery platform that leverages the strengths of Hadoop for scale and processing while bridging the gaps around BI tool support, SQL access, and interactive analytical workloads. SQL- MapReduce helps bridge this gap by providing a distinct execution engine within the discovery platform. This allows the advanced analytical functions to execute automatically, in parallel across the nodes of the machine cluster, while providing a standard SQL interface that can be leveraged by BI tools. The discovery platform includes a library of prebuilt analytic functions such as path, pattern, statistical, graph, text and cluster analysis, and data transformation that help speed the deployment of analytic applications. Users can write custom functions as needed, in a variety of languages, for use in both batch and interactive environments. Finally, an interactive development tool reduces the effort required to build and test custom-developed functions. Such tools can also be used to import existing Java MapReduce programs. MAKING THE UNIFIED DATA ARCHITECTURE WORK As big data challenges become more pressing, vendors are introducing products designed to help companies effectively handle the huge volumes of data and Data Task perform insight-enhancing analytics. But selecting the appropriate solution for your requirements need not be difficult. With the inherent technical differences in data types, schema requirements, and analytical workloads, it s no surprise that certain solutions lend themselves to optimal performance in different parts of the unified big data architecture. The first criteria to consider should be what type of data and schema exist in your environment. Possibilities include: Data that uses a stable schema (structured) This can include data from packaged business processes with well-defined and known attributes, such as ERP data, inventory records, and supply chain records. Table 1. Matching Data Tasks and Workloads Potential Workloads Low-cost storage and retention Retains raw data in manner that can provide low TCO-per-terabyte storage costs Requires access in deep storage, but not at same speeds as in a front-line system Loading Brings data into the system from the source system Pre-processing/ prep/ cleansing/constraint validation Transformation Reporting Analytics (including user-driven, inter-active, or ad-hoc) MapReduce Prepares data for downstream processing by, for example, fetching dimension data, recording a new incoming batch, or archiving old window batch. Converts one structure of data into another structure. This may require going from third-normal form in a relational database to a star or snowflake schema, or from text to a relational database, or from relational technology to a graph, as with structural transformations. Queries historical data such as what happened, where it happened, how much happened, who did it (e.g., sales of a given product by region) Performs relationship modeling via declarative SQL (e.g., scoring or basic stats) Performs relationship modeling via procedural (e.g., model building or time series)

6 18 DECEMBER 2012 DBTA Sponsored Content Data that has an evolving schema (semi-structured) Examples include data generated by machine processes, with known but changing sets of attributes, such as web logs, call detail records, sensor logs, JSON (JavaScript Object Notation), social profiles, and Twitter feeds. Data that has a format, but no schema (unstructured) Unstructured data includes data captured by machines with a well-defined format, but no semantics, such as images, videos, web pages, and PDF documents. Semantics can be extracted from the raw data by interpreting the format and pulling out required data. This is often done with shapes from a video, face recognition in images, and logo detection. Each of these three schema types may include a wide spectrum of workloads that must be performed on the data. Table 1 lists several common data tasks and workload considerations. TERADATA, ASTER, AND HADOOP: WHEN TO USE WHICH SOLUTION Figure 2 offers a framework to help enterprise architects most effectively use each part of a unified data architecture. This framework allows a best-of-breed approach that you can apply to each schema type, helping you achieve maximum performance, rapid enterprise adoption, and the lowest TCO. The following use cases demonstrate how you can apply this framework to your big data challenges. Stable Schema Sample Applications: Financial analysis, ad hoc/olap queries, enterprise-wide BI and reporting, spatial/temporal, and active execution (in-process, operational insights). Characteristics: In applications with a stable schema, the data model is relatively fixed. Transactions collected from point-of-sale, inventory, customer relationship management, and accounting systems are known and change infrequently. This business requires ACID (atomicity, consistency, isolation, durability) property or transaction guarantees; as well as security; well-documented data models; extract, transform, and load (ETL) jobs; data lineage; and metadata management throughout the data pipeline from storage to refining through reporting and analytics. Recommended Approach: Leverage the strength of the relational model and SQL. You may also want to use Hadoop to support low-cost, scale-out storage and retention for some transactional data, which requires less rigor in security and metadata management. Figure 2. Choosing Solutions by Workload and Data Type

7 Sponsored Content DBTA DECEMBER Suggested Products: Teradata provides multiple solutions to handle deep dive analytics, low-cost storage and retention applications, as well as loading and transformation tasks. For example, customers that want to analyze and store large data volumes and perform light transformations can use the Teradata Extreme Data Appliance. This platform offers deep dive analytics at low-cost data storage with high compression rates at a highly affordable price. For CPU-intensive transformations, the Teradata Data Warehouse Appliance supports mid-level data storage with built-in automatic compression engines. Customers that want to minimize data movement and complexity and are executing transformations that require reference data can use the Teradata Active Enterprise Data Warehouse. Evolving Schema Sample Applications: Interactive data discovery, including web click stream, social feeds, set-top box analysis, sensor logs, and JSON. Characteristics: Data generated by machine processes typically requires a schema that changes or evolves rapidly. The schema itself may be structured, but the changes occur too quickly for most data models, ETL steps, and reports to keep pace. Company e-commerce sites, social media, and other fast-changing systems are good examples of evolving schema. Recommended Approach: The design of web sites, applications, third-party sites, search engine marketing, and search engine optimization strategies changes dynamically over time. Look for a solution that eases the management of evolving schema data by providing features that: Leverage the back end of the relational database management system (RDBMS), so you can easily add or remove columns Make it easy for queries to do late binding of the structure Optimize queries dynamically by collecting relevant statistics on the variable part of the data Support encoding and enforcement of constraints on the variable part of the data Suggested Products: Teradata Aster is an ideal platform for ingesting and analyzing data in an evolving schema. The product provides a big analytics & discovery platform, which allows evolving data to be stored natively without predefining how the variable part of the data should be broken up. This platform helps companies to bridge the gap between esoteric data science technologies and the language of business. It combines the developer-oriented MapReduce platform with the SQL-based BI tools familiar to business analysts. Also, if your process includes known batch data transformation steps that require limited interactivity, Hadoop MapReduce can be a good choice. Hadoop MapReduce enables large-scale data refining, so you can extract highervalue data from raw files for downstream data discovery and analytics. In an evolving schema, Hadoop and Teradata Aster are a perfect complement for ingesting, refining, and discovering valuable insights from big data volumes. No Schema Sample Applications: Image processing, audio/video storage and refining, storage, and batch transformation and extraction Characteristics: With data that has a format, but no schema, the data structure is typically a well-defined file format. However, it appears less relational than non-relational, lacks semantics, and does not easily fit into the notion of traditional RDBMS rows and columns. There is often a need to store these data types in their native file formats. Recommended Approach: Hadoop MapReduce provides a large-scale processing framework for workloads that need to extract semantics from raw file data. By interpreting the format and pulling out required data, Hadoop can discern and categorize shapes from video and perform face recognition in images. Suggested Products: When running batch jobs to extract metadata from images or text, Hadoop is an ideal platform. You can then analyze or join this metadata with other dimensional data to provide additional value. Once you ve used Hadoop to prepare the refined data, load it into Teradata Aster to quickly and easily join the data with other evolving- or stable-schema data. TERADATA For more information about how you can bring more value to the business through a unified data architecture, contact your Teradata or Teradata Aster representative or visit us on the web at Teradata.com or Asterdata.com

8 20 DECEMBER 2012 DBTA Sponsored Content The Big Data Dilemma for Financial Services Today s firms are looking for new ways to solve Big Data challenges. From frontoffice risk management to back-office trade operations, new infrastructure is needed to handle the explosive volume, velocity, and variety of data. Firms must be able to collect, store, and analyze rapidly changing, petabyte-scale data to maximize profits, reduce risk, and meet increasingly stringent regulatory requirements. According to research firm IDC, this Big Data dilemma is significantly impacting three challenge areas to financial sector health this year: operational efficiency, new product development, and compliance. THE OPERATIONS CHALLENGE Big Data is hitting at a time when global financial services companies are attempting to consolidate and streamline inefficient operations spawned from mergers. With anticipated growth, now is the time to upgrade the variety of systems spanning these vast multi-national entities to a futureproof platform. The challenge comes from uniting disparate data from a variety of third party systems designed to meet the diversity across regions regarding language, regulations, currency, time zone, and more. This disparateness of systems, devices, and data makes it difficult to aggregate and extract business value. For example, the trading desks of global firms follow-the-sun when one exchange closes, another opens. Leveraging this constant flow of data in real time is critical to make the right trading decisions. To maintain efficient and cost-effective operations, firms need to manage not only an explosive volume of data, but also more types of unstructured data than ever. Firms that cannot perform real-time analysis across all data stores create a potentially devastating knowledge gap, which consequently increases risk. Data warehousing and new technologies all promise to stem that gap, but with high cost. Operational managers worry about the costs of long-term data storage and the impracticality and expense of migrating from existing systems. They also do not have time to consolidate all data into a single model. THE PRODUCT DEVELOPMENT CHALLENGE To offset flat or declining revenue streams, financial services firms need to develop new products while also targeting existing products to new audiences. The ability to analyze the full spectrum of data is critical to discover the patterns, trends, and relationships that are the source of innovation. At the root of the problem are data silos which prevent firms from gaining insights on customers, products, and sales channels. In a recent poll by Capgemini, 85 percent of executives said the issue is not about volume but the ability to analyze and act on the data in real time. Organizations face a Big Data challenge which requires new business intelligence sourced from social media surveillance and predictive analytics. THE COMPLIANCE CHALLENGE The financial services industry continues to face heavy regulatory burdens. Across the globe, existing regulations change and new regulations emerge. It is a constant struggle to avoid fines and penalties. Recently, regulations have expanded beyond simple reporting to include liquidity requirements for capital reserves to cover exposures. Without a complete picture of institutional exposure, too little capital reserve could result in regulatory penalties while too much in reserve reduces leveraged investments. Global firms have to tackle the even bigger challenge of multiple regulatory bodies, laws, and enforcement rules. To quickly and accurately comply with new regulatory requirements stemming from Dodd-Frank, Basel III, and Solvency II (for insurers), as well as various Know-Your-Customer laws and unanticipated/unforeseeable future regulations, firms need a flexible data infrastructure and a proactive strategy to maintain compliance. In a highly competitive global financial services market, differentiation is the key to gaining the competitive advantage. Firms can differentiate themselves by reducing operational costs, developing new revenue streams, and adopting a proactive strategy for compliance. OVERCOMING THESE CHALLENGES When trends change by the second, agility counts. Gaining agility starts with an assessment of existing processes and systems. Firms must identify what existing practices will not support the progress they need. Incremental enhancements to their current environment will only lead to marginal gains. To get the significant advantage firms seek in today s competitive landscape, they should pursue technologies that can support new, innovative practices. A PROVEN, INNOVATIVE TECHNOLOGY TO SUPPORT FUTURE GROWTH One such technology, MarkLogic Server, is an enterprise NoSQL database proven to handle today s data challenges. Designed for petabyte scalability and high transaction volume, MarkLogic Server unifies the data located in disparate systems. Structured, unstructured, and geospatial data are combined in a single document-centric database with powerful, real-time search and analytics capabilities.

9 Sponsored Content DBTA DECEMBER Tier-1 banks, derivatives trading operations, and front-office commodity traders are using MarkLogic Server to ingest and analyze massive volumes of data in real time. As a result, these firms can make proactive decisions based on market trends and conditions, minimize operational cost and risk, develop innovative new products, including critical, decision support tools, and meet compliance requirements. MarkLogic is the foundation of many leading financial services operations, powering: The trade store of one of the world s top derivatives trading banks, enabling real-time monitoring, analyses, and actions. The authoring and delivery solution for the equities research division of a global Tier-1 investment bank. The front-office commodity trading operations of a Global Fortune 500 energy corporation, giving their traders a powerful decision support tool to predict demand and to forward trade against it. A financial market information intelligence firm that relies on MarkLogic to create its Mood Index, which quantifies the social mood of surrounding global securities. MARKLOGIC SERVER: SWISS ARMY KNIFE OF BIG DATA To create a powerful platform for a variety of financial services applications, MarkLogic combines three critical technologies: Enterprise NoSQL (Not Only SQL) database Enterprise-grade search Application server The document-centric database uses XML (extensible Markup Language) as its data model. This data model, which is both software- and hardwareindependent, enables MarkLogic to easily ingest all types of content from structured and semi-structured trade messages to unstructured customer onboarding data, contracts, news, and social media updates, etc. bringing it all together in a single operational database. MarkLogic has the power and speed to handle the thousands of data transactions per second that are common in trading operations. As data volume increases, performance can be maintained with easy scaling through the addition of affordable commodity hardware servers. Unlike a traditional relational database, MarkLogic is schema-less. It is not necessary to know the attributes of the data up front, and attributes can be added without amending the database. To ensure accuracy, MarkLogic can enforce strict data typing to validate these attributes. This flexibility is especially important in an industry where data is always changing and volume is always growing. Financial services firms do not have the time or money to waste tuning traditional databases that are buckling under increasing data volumes, or updating schemas to comply with the latest regulations and ever-evolving business requirements. With MarkLogic, these resources can be redirected to a more important objective creating new products. Search is built into the core MarkLogic database instead of being a separate, bolted-on application. The secret behind MarkLogic s search capabilities is the universal index, which indexes the full text as well as the entire document structure, including elements, attributes, and hierarchy. By indexing the full text of the data, MarkLogic enables users to compose meaningful queries that are executed in milliseconds. As new data flows into MarkLogic, a fully transactional update of the index is created in real time. Query results are always based on what is currently in the database, eliminating the downtime (often hours or even days) required by typical systems when the index and database synchronize. This gives firms a broader and deeper view into data that has an immediate impact on risk exposure and financial performance. In addition to user-initiated searches, MarkLogic provides an alerting framework that enables high performance, real-time query processing on live, incoming data streams. This framework enables instantaneous alerts when critical, relevant information arrives. Some examples of dynamic queries that a commodities trader might use include: When any ship deviates from its projected course by 5 miles and is carrying Brent Crude oil greater than x tonnage, me an alert. Text me when a story about labor unrest in Italy impacts my positions. Send me an alert when long-term forecasting predicts hurricane-force winds in Galveston, Texas. me State Department updates about the status of hostilities at a specified port. MarkLogic also has a built-in application server, which enables firms to build powerful financial applications such as next-generation trade stores, research portals, and customer onboarding systems delivered through secure web services. The service-oriented architecture provides simple integration with existing systems, so you can avoid the time and cost associated with replacement. The MarkLogic application server gives you the agility needed for valuable and complex data analysis, while simplifying the overall complexity of your infrastructure, reducing total cost of ownership (TCO) and time-tomarket. Like the dependable Swiss Army Knife, MarkLogic combines an extensive array of capabilities, giving you a multiuse technology tool that helps you solve your toughest challenges: maximizing operational efficiency, increasing revenue, and complying with industry regulations. BIG DATA SUCCESSES IN FINANCIAL SERVICES WITH MARKLOGIC A Tier-1 Investment Bank Relies on MarkLogic for Single Source of Truth. One highly successful MarkLogic client

10 22 DECEMBER 2012 DBTA Sponsored Content is a credit, interest rate, and equity derivatives trading operation that executes one-third of the world s total derivative trades. The firm replaced 20 disparate systems with a single MarkLogic-based, trade lifecycleprocessing engine that captures trades across all asset classes in real time. The new trade store enables the firm to effectively manage market and counterparty credit positions with subsecond updating and analysis response time. The firm can now add new data sources in a matter of hours a process that used to take days or weeks. With MarkLogic as the foundation of its trade store, the firm has improved performance, increased scalability, and achieved a much lower total cost of ownership. The firm has also cut development time for future implementations and requirements, such as new regulations. With trade data aggregated accurately across the firm s derivatives portfolio, risk management stakeholders can: Assess the true enterprise risk profile Conduct predictive analyses using accurate data Adopt a forward-thinking approach A GLOBAL RESEARCH GROUP ACHIEVES FIRST-TO-MARKET WITH MARKLOGIC A team of 400 analysts, located across 20 countries, conducts equity and fixed income research as well as productspecific analysis for individual and institutional clients. To stay ahead of the competition, the bank needed to achieve faster time-to-market, improve research quality, reduce duplication of effort, and meet client demands for mobile content delivery and alerting. The bank built an equity research authoring and delivery solution based on MarkLogic Server, replacing a patchwork of disparate systems around the globe that lacked the flexibility to quickly add new features. The bank can now offer clients upto-date research on global developments, ahead of the competition, in their preferred delivery method. The solution has enabled analysts to conduct searches at a component level for more efficient reuse of historical research. New features and functionality can be developed in 3 4 weeks instead of 6 12 months, reducing the total cost of ownership by at least 50 percent. With MarkLogic, the bank has achieved first-to-publish status, boosting revenue and its reputation as the go-to research firm. A GLOBAL FORTUNE 500 ENERGY COMMODITIES TRADER TRANSFORMS DECISION SUPPORT When the value of cargo is tied to market demand, traders can predict the value based on their knowledge of when that cargo would arrive at port. But their models are thrown when unknown forces such as weather, political strife, or pirates at sea caused delays. By building an agile system that allows the real-time aggregation of a variety of content sources structured, geospatial, and unstructured data including news feeds, market, and shipping information analysts could now adjust pricing forecasts on the fly, enabling them to more accurately predict demand and make profitable trading decisions. These financial powerhouses chose MarkLogic for its ability to handle high-speed, high-volume financial transactions and real-time data analysis without burdensome database changes. MarkLogic allows them to gain the agility they need to compete and succeed in today s market. SUMMARY Big Data is making a big impact on financial services firms. Exponential growth in data volume and variety has created new information management challenges. Traditional database technologies cannot keep pace with the rate of change in the industry. And most firms spend too much time, money, and resources to maintain inflexible legacy systems when they could be innovating with new decision-support and real-time analytics tools. The value in Big Data is in being able to aggregate all these new information sources into a unified platform to gain new insights and knowledge. MarkLogic Server enables financial services firms to gain competitive advantage in a global marketplace. By providing a single unified platform for all your data, MarkLogic transforms a complex environment into an agile business operation. With MarkLogic, financial services firms can: Reduce risk by acquiring real-time visibility into transactions and complying with current and future regulations Cut costs by simplifying IT infrastructure and quickly building Big Data applications Increase revenue by spotting trends before competitors and developing new products that meet customer demands Most important, businesses can accomplish these objectives at a lower cost and in a fraction of the time compared to other technologies. MarkLogic makes it possible to extract maximum value from your Big Data. With a robust, flexible data infrastructure, firms are looking for ways to innovate by finding new ways of rolling up different types of data to impactful decision support tools for tomorrow. For more information about how MarkLogic can help your enterprise get real-time value out of your enterprise data, visit MARKLOGIC MarkLogic Corporation Headquarters 999 Skyway Road, Suite 200 San Carlos, CA

11 Sponsored Content DBTA DECEMBER SQL or NoSQL: A Choice You Don t Have to Make Akiban s database exposes the same data as documents in JSON or a collection of tables using SQL allowing developers to move quickly while administrators and report generators use the tools they are comfortable with. Bridge the Gap: Documents and SQL Akiban is the only database that stores semi-structured data and structured tables in a single unified structure called Table Groups. Using Table Groups, developers can manipulate rich documents or objects via a REST interface while businesses can access the same data using structured SQL. Address Complexity with Simplicity As database schemas and object models grow in complexity the inefficiency of object-relational mapping tools increases. Akiban s Table Groups enable applications based on ORMs like Hibernate and ActiveRecord to efficiently access the objects as prejoined, pre-assembled records. Perform with Possibility: Fast SQL without ETL Table Groups remove serious bottlenecks in relational databases: joins are eliminated and transactions are optimized for today s hardware. Akiban enables real-time decision support against operational data without an expensive ETL effort. BENEFITS Integrate NoSQL with Existing Systems: While developers work with documents the business can continue to use standard clients and tools (including SQL) to derive instant value from structured data and semi-structured documents. Build amazing applications: Add competitive and powerful real-time search capabilities across document and relational data integrating faceted, geospatial and graph-based search in one SQL statement. Drive Revenue with Performance: Optimize problematic queries and gain a consistent x performance improvement to reduce Time to Interaction (TTI) a critical driver of customer satisfaction in online systems. Manage Increasingly Complex Data Portfolios: Use traditional tools to extract powerful insights from complex profiles, geospatial and multidimensional relationship (graph) data as structured data (SQL) or documents (JSON). Reduce Cost: Take advantage of JSON and semi-structured data without retooling your architecture or the added cost of retraining and reimplementation. FAST PATH TO NEW CAPABILITIES Dynamic Search: Choose Your Own Adventure Instead of restricting your search capabilities to pre-defined paths and options, Akiban enables your users to create personalized content navigation and an experience optimized just for them. Users can easily craft sophisticated Boolean queries through small, simple incremental steps as dynamic search filters are automatically generated based on actual query results. Because the search is executed against operational data, users are assured of getting up-tothe-moment information. SoLoMo/Geospatial: It s All About Me The dramatic increase in the number of mobile users today and accompanying GPS capabilities has conditioned users to increasingly search for what s around them, bringing about a new type of search: Social Local Mobile. SoLoMo lets retailers and service providers deliver the most compelling, personalized end-user experience and gain valuable insight about customer preferences and behavior. Akiban s geospatial capability provides a fast, simple and agile means of providing exciting new features such as: Localized content and point-inlocation searches to enable business or user location data to be used to deliver the best local match. Nearest Neighbor lets you find the nearest users to a given event or location, or allows users to determine their proximity to others. Geo-fencing creates a virtual perimeter around a geographic area that can be used for location-based mobile marketing. Like when a user enters a geo-fenced area, the retailer or service provider can push text messages and other media to present coupons and deals where an individual is most likely to buy. Geo-density search determines the optimal location to run a promotion, create a service depot, or target a new set of services. Relationship Discovery: I Didn t Know That Akiban excels at discovering relationships between diverse data in your database. Users or applications can quickly and easily return results from millions of complex data profiles or documents based on a series of facets that might include company, location, relationship, industry and more. Instant Insight: When You Need It Now Some answers just can t wait. Whether it is a business user who needs instant insight into campaign effectiveness or sales results, or a security analyst looking for suspicious activity, Akiban enables immediate access to your operational data without negatively impacting operational performance. This also eliminates the ETL effort and time delay to transfer operational data into a data mart or data warehouse. Visit: Call: info@akiban.com Try it: akiban.com/downloads

12 24 DECEMBER 2012 DBTA Sponsored Content Hadoop: Latest Trend In Big Connectivity Organizations are realizing they can mine and analyze vast amounts of data, and transform new insights into action. But how can they effectively and efficiently connect and collect volumes of data from their data warehouses? As companies are looking for ways to leverage the massive amount of data they have collected, the Apache Hadoop file system is quickly becoming all the rage. In many environments today, Hadoop is used in the data warehouse and holds some of the richest information available. Processing Hadoop data into meaningful forms can yield intelligence that complements traditional analytics. But as it becomes synonymous with Big Data, it s important to step back and consider what Hadoop is really about. 1. Hadoop stores massive amounts of data. Hadoop is an open-source (AKA: free) storage system that can manage extremely large quantities of all kinds of data (structured or unstructured). Exactly how much data can it handle? Well, Facebook recently announced that just one of its Hadoop clusters is 100 petabytes in size. Maria Korolov, technology writer and president of Trombly International, calculates that to be over a thousand years of video at a data transfer rate of roughly 1 gig per hour. Thinking about how much data that is can be dizzying, hence the moniker Big Data. 2. Hadoop saves money. Because it s free, Hadoop is extremely cost-effective when comparing it to traditional data extraction, transformation and loading (ETL) processes. It can also extract data from nearly any type of system and load it onto a database. So how does it work? Hadoop can run on multiple machines that don t share memory or disks (also known as a cluster). This makes it possible to use many different servers simultaneously to work on the same sets of data, thereby dramatically increasing the speed at which you can process and store the data. 3. Hadoop aids in data analysis. Yes, Hadoop can handle large quantities of data. But then what? Well, Hadoop can process and analyze a large amount of data at very high speeds, boiling it down quickly to be consumed and questioned. Think of Hadoop as a waiting room for your data just hanging out there waiting to be asked questions it has interesting answers for. Once you get someone who can ask the questions and communicate effectively with it (like a data scientist) you can start to analyze problems such as finding and fixing all of Boston s potholes or helping recruiters sift through dozens of resumes. Hadoop lays the foundation for gaining insight from your data, allowing you to make intelligent decisions about your business, your community, and your world and we can all benefit from that. The challenging step is connecting existing SQL-based business intelligence and data analytics tools to Hadoop data. Without such connectivity, companies analysts and decision makers are locked out of the insights contained in Hadoop until now. The DataDirect driver for Apache Hive is the only fully-compliant ODBC driver supporting multiple Hadoop distributions out-of-the-box, and includes: Support for Apache, Cloudera, MapR, and Amazon EMR Hadoop distributions Windows, Red Hat, Solaris, SUSE, AIX, and HP-UX platform support Support for SQL-92 syntax Full ODBC spec support, including metadata As enterprise organizations tackle the challenges of assimilating Big Data within existing data infrastructures, high-performance, scalable and reliable data connectivity is an imperative. Moreover, only when IT organizations employ next-generation data connectivity technologies such as Progress DataDirect Connect ODBC and JDBC drivers or ADO.NET data providers do they: Guarantee the availability of any size data from any source Easily move Apache Hadoop Big Data into any data source Manage single-driver connectivity to a wide array of enterprise databases and platforms Deliver the best possible bulk load performance, scalability and reliability Deploy with no application code changes or database vendor tools Reduce the time, cost and risk of making new data sets available to enterprise users With Progress DataDirect drivers and providers, including the Connect for ODBC Apache Hadoop Hive driver, organizations can optimize performance within the ever-so-critical arena of database connectivity and the growing requirements regarding high-performance import and export of Big Data. For more information, check out our latest webinar Big Data, Analytics, and Hadoop in the Enterprise. Progress DataDirect offers a 15-day Free Trial on all products, including the Connect for ODBC Apache Hadoop Hive driver. PROGRESS DATADIRECT

13 Sponsored Content DBTA DECEMBER Myths About Big Data Almost every organization today is swimming in data. And with cloud applications, virtualization and new social interfaces coming to the enterprise, this data is just going to get bigger. But what is Big Data beyond the media hype? How do you start to get control of your Big Data? And once you have it in control how do you leverage it for business? Whether you are a marketer, a business analyst or a developer there are a few things you need to know about your data and the tools and technologies that are available to support it. MYTH #1 BIG DATA ISN T THAT BIG There are a lot of solutions out there that are claiming to manage Big Data. But how much data is big data? And what is the significance? As we said before, if you have Big Data chances are the data you have will continue to grow exponentially. Big Data today is scaling from terabytes to petabytes and exabytes. Companies today are processing the equivalent of the Library of Congress every 24 hours. But, what about data complexity? Today, we are storing multiple types of data in a single database (video, text, voice, images). As data grows in complexity and size, organizations will require solutions to manage and analyze all of this complex content in order to identify opportunities, relationships and insights within Big Data for improved value and competitive advantage. MYTH #2 HADOOP DOES EVERYTHING Hadoop is a powerful technology that is on the cutting-edge of Big Data management. But it s important to understand what Hadoop can and cannot do. Hadoop is able to manage large amounts of data and allow companies to access that data to be analyzed by other solutions using a divide and conquer methodology, a decades old technique. Hadoop is a framework that supports data-intensive distributed applications via MapReduce and a distributed file system. Hadoop is not a true database and, alone, is unable to query data to find relevant relationships and connections to addvalue for business. MYTH #3 RELATIONAL DATABASES DO RELATIONSHIPS Relational databases have been around for a long time and are easy to leverage for critical applications. Although relational databases allow you to understand your data and deliver standard business statistics, such as demographics, purchase frequencies and average sales, they can t tell you about the relationships or connections between that data. New opportunities arise when companies are able to see the links between all of their data. For example, an insurance company, connecting the dots around a single customer can lead to new products and upselling by understanding their purchase needs (a recent home purchase requires insurance products) and behaviors (researching colleges could lead to upsell for savings products/extended benefits to a child). Connecting the dots between millions of customers, in real-time, creates a whole new set of business opportunities. MYTH #4 BIG DATA ANALYTICS CAN T BE DISTRIBUTED OR REAL-TIME One of the challenges of Big Data analytics technologies is in their inability to work in distributed environments. Most companies today are global and their critical data lives in multiple locations. This means that commercial technologies that deliver true, deep analysis must be distributed and scalable in order to pull data from all locations and connect the dots between them. Regardless of your vertical, today s enterprise organizations need to be able to access it all. In mission-critical situations like national defense, fraud detection, a call center preparing to upsell a customer, or a doctor prescribing medications, you need data analysis in real-time. This means within seconds or minutes, at the point of interest so that you can immediately take action. MYTH #5 ONE DATABASE TECHNOLOGY IS ALL YOU NEED This is one of the big myths of all technologies. Just look around in your company today. You use multiple solutions to manage the simplest things. Big Data is no different. There is no single solution for Big Data. Whether you want to manage it, measure it, truly analyze it or do all three, you are going to need to take a polyglot, or multitechnology, approach. This means understanding your current database environment and finding tools and technologies that are able work with them. Looking for technologies that remove the costly expense of upgrading your legacy systems, engines that can power massive amounts data now and in your future, and analysis technologies that go beyond relational statistics is the key to moving forward. As companies take that critical first step into leveraging their data, they need to go beyond the hype and understand what is real and what to expect from the technologies available today to solve real challenges. As with anything else in this world knowledge is power. OBJECTIVITY For more information on InfiniteGraph, The Distributed Graph Database, visit our website at

14 26 DECEMBER 2012 DBTA Sponsored Content HP Vertica Analytics Platform Ready to Grow With Cardlytics Cardlytics, a transaction-driven marketing company, focused on tracking and understanding consumer buying behavior and delivering actionable intelligence, has been growing tremendously, ramping up from 10,000 pilot customers to its current reach of 70 million households. Founded in 2008 and based in Atlanta, Georgia, Cardlytics partners with financial institutions such as Bank of America, PNC Bank, Regions Bank and Fiserv. The partnership works well, given the need for financial services institutions, particularly retail-facing divisions, to identify new, mostly fee-based revenue streams after the housing market crash. Associating fees with account-based services that are long-standing, such as checking and savings account fees, have the potential to negatively impact customer loyalty. Cardlytics has a proven track record of delivering effective rewards programs, working with some of the largest financial institutions in the U.S. While Cardlytics is pleased with its success, the company s growth also presents it with business and technology challenges. For several years, the company relied on a traditional RDBMS (SQL row store database) platform. However, each time the company added a new partner, Cardlytics volume of data mushroomed ten-fold. Faced with rapid expansion and the addition of larger banking partners, it was difficult to predict the company s future hardware and software requirements. Cardlytics knew that it needed a scalable architecture. The company also hoped a new platform would improve processing speed. CHOOSING THE HP VERTICA ANALYTICS PLATFORM After Cardlytics determined that it required a new analytics platform, Jon Wren, Cardlytics director of Data Innovation, began researching potential options. Cardlytics ruled out several vendors since the solutions did not provide what he saw as a need for commodity hardware an essential requirement. There are certain products that have distributed data, but I didn t want to get locked in to buying proprietary Our clients are looking for new ways to attract new and reward existing customers. HP Vertica Analytics Platform enables us to fine-tune and personalize offers, and we are providing this service at hyper-speed. Scott Grimes, Cardlytics CEO hardware. I wanted more flexibility over the infrastructure, Wren says. After considering all options, Cardlytics selected the HP Vertica Analytics Platform. I like that there is a predictable cost to scale out. That was one of our key criteria in selecting HP Vertica. That and flat-out speed, Wren says. While Cardlytics already has seen tremendous increases in speed, the company knows it can add more nodes, scale out, and go twice as fast. HP Vertica meets all of our requirements. I know how much it will cost and how much we will gain, and I appreciate that predictability, Wren explains. An analytics database I was involved with at another organization was not as large in scale as what Cardlytics has now, and that investment was $20 million. With HP Vertica Analytics Platform I could have built it at 5% of the cost, says Wren. NEW PROSPECTS LEAD TO MORE INVENTORY, INCREASED REVENUE One of the primary business benefits Cardlytics is realizing with its powerful analytics database is the ability to target prospective candidates for its merchant rewards programs. We can query our data quickly, in seconds, to identify specific merchants in a given geography and direct sales people to talk to them. We can now target at least 200 new merchant prospects on a weekly basis with the ability to rapidly scale out to reach many more customers, versus only 20 per week on our legacy platform. The more merchants we have, the more inventory of advertising offers we have, and that leads to more revenue, Wren says. In the past it took 20, 30 or 40 minutes to return a query typically, and sometimes as long as 20 hours. Now, the HP Vertica Analytics Platform returns queries in one-half to one minute on average five minutes at most. HP Vertica Analytics Platform now returns a typical query 40-to-80X faster than the legacy platform. Also, not only is there more data, but analysts or anyone asking a question of HP Vertica Analytics Platform, can rapidly analyze this new voluminous data. The frustrations associated with working on an underperforming data warehouse have been eliminated. Instead of waiting for returns on queries, the staff can ask more insightful questions and perform additional analysis that ultimately leads to better customer insight all in far shorter cycle times. We have found out that HP Vertica works well, and we are way more productive, Wren says. Since the deployment I have heard nothing but this is awesome. VERTICA

15 Sponsored Content DBTA DECEMBER Data Virtualization is Vital for Maximizing Your NOSQL and Big Data Investment Over the last few years the growth of non-relational data technologies has continued unabated, driven by the need for faster and smarter analytics as well as the increasing size of data itself. Technical buzzwords like NOSQL and Big Data are now commonplace. In this new postrelational age, it remains important to realize that simply throwing money and resources at the latest NOSQL or Hadoop technology won t solve business problems by itself. Having a strategy in place is vital so your company can actually use Big Data not merely manage it. But with any new technology investment in Big Data, a tangible risk lurks. Visions of creating even more data silos, integration problems, and data governance nightmares can frighten any CIO. Enter Data Virtualization (DV). DV helps integrate old and new data sources exposing them through a unified data services layer, which allows all users to leverage all data assets. NOSQL IS NOT A TURNKEY SOLUTION Considering its nascent status, NOSQL brings both disadvantages and advantages to the table. Its benefits are well documented the ability to process large amounts of data quickly combined with the added flexibility of storing both structured and unstructured data. On the other hand, it is important to realize that NOSQL databases aren't the general-purpose data management solutions of the past. Different types of NOSQL databases exist to solve different problems. Apache Cassandra is perfect for processing a huge number of writes to a key-value single table, but don't try to use it for running aggregates. Differences in data models can lead to difficulties in integration with existing data. Non-standard APIs make client programming more time-consuming and less productive ultimately costing more in resources, which impacts your bottom line. Choose NOSQL technologies for their advantages, but leverage those gains through an easy to consume Data Services architecture to provide maximum value to your business. DATA VIRTUALIZATION TO THE RESCUE Data Virtualization allows for universal data access, ensuring a standardized interface to disparate sources of data, all while retaining their unique advantages. The enterprise benefits from virtualization, as additional applications are now able to consume the entirety of your organization s data. Data Virtualization also offers flexible integration, helping to prevent those dreaded data silos. Advanced Data Virtualization tools are able to import hierarchical content or provide SQL query capabilities for non-relational databases like HBase, which runs on top of the Hadoop File System. These capabilities allow data to flow freely throughout the enterprise, facilitating the creation of integrated data feeds and views. Leveraging the Data Services provided through Data Virtualization enhances the use of data within your business, helping to achieve a faster ROA Return On (Data) Assets. Additionally, the absence of data silos means implementing data security and governance programs just became a lot easier! DENODO HELPS CUSTOMERS MAXIMIZE BENEFITS Not all Data Virtualization tools are created equal. Denodo provides a flexible DV solution with the broadest data access to Big Data as well as most structured, semi-structured and unstructured sources. It features a unique extended relational model to normalize them; ultimately publishing data services in over a dozen familiar formats to be leveraged widely (see diagram). This advantage is evident for Denodo customers across a wide range of industries. Whether integrating millions of point-of-interest (POI) data points for a leading navigational data service, enabling smart network analytics from machine-level Big Data, or enhancing a mutual fund company s MongoDB implementation with user-defined annotations and attributes offered through a virtual data service layer, Denodo brings real-world proven success to Data Virtualization, leveraging Big Data. Data Virtualization needs to be a fundamental part of any data strategy for both new initiatives as well as traditional systems. In order to derive meaningful ROA from investments in Big Data, Cloud Computing, and NOSQL, consider adopting Denodo s Data Virtualization solution from the get-go. DENODO TECHNOLOGIES is the leader in Data Virtualization. Please contact us at info@denodo.com or to discuss your next project.