Microsoft Big Data and Analytics. Server, an on-premises solution, and Windows Azure HDInsight Service*, a completely cloud-based solution.

Transcription

1 Executive Summary Microsoft has established a firm foothold in the world of traditionally structured data with Microsoft SQL Server* and an even firmer foothold in the world of data analysis with tools such as Microsoft Excel*. However, the big data era requires solutions to store, query, and analyze data beyond that which is traditionally structured in relational databases or spreadsheets. Microsoft has responded to this big data challenge not only by offering a new big data solution, but also by describing a broad solution for comprehensive data management and analysis that is supported by a combination of new and old Microsoft products. The big data trend in recent years has been largely driven by the popular, open-source software framework of Apache Hadoop*. Apache Hadoop allows massive amounts of data that is not structured into relational databases to be stored in clusters of commodity servers and then analyzed for correlations, trends, and other potentially valuable information. So popular has Apache Hadoop become as a big data solution that to many, the terms big data and Apache Hadoop have become synonymous. Microsoft is offering an Apache Hadoop component with Microsoft HDInsight*, a set of services built on Hortonworks Data Platform* (HDP*) for Windows*. More specifically, HDInsight can refer to either of two separate Microsoft products, both still in preview and months away from general release: HDInsight Server, an on-premises solution, and Windows Azure HDInsight Service*, a completely cloud-based solution. Although Microsoft does offer these two new Apache Hadoop products for storing and mining both semi-structured and unstructured data, the company has also been keen to steer the big data conversation away from the need for big data solutions per se and toward the need for a universal data management and analysis solution. Until recently, in fact, Microsoft used the term big data to refer to this universal vision, but its most recent messaging makes a distinction between big data of Apache Hadoop and other forms of data. Microsoft s broader vision is supported in part by Microsoft SQL Server 2012 Parallel Data Warehouse* (PDW), which is a data-warehouse hardware appliance that stores only structured data but that also supports queries of both structured and unstructured data through Microsoft s proprietary PolyBase technology. Microsoft also positions SQL Server Analysis Services (SSAS), Excel, and Microsoft SharePoint Server* as part of its all data tool set, along with optional analysis add-ons for Microsoft Office* such as PowerPivot, Power View, Power Map, and Power Query.

2 Contents Executive Summary... 1 Evaluating the Microsoft Data Platform... 3 Is Microsoft Really Democratizing Big Data?... 3 Does Microsoft Offer a Truly Comprehensive Data-Management Solution?... 3 Conclusion... 4 Microsoft s Big Data Vision... 5 Microsoft s General Claims about its Comprehensive Data Solution... 6 Claim: The Microsoft big data solution offers an integrated platform for managing data of any type or size Claim: Microsoft s big data solution gives you the power to enable anyone in your organization to easily glean insight from your data so they can make. smarter decisions Microsoft HDInsight*: Microsoft s Apache Hadoop* Solution... 7 Creating HDInsight Service Clusters... 8 HDInsight Storage Options... 8 HDInsight Management... 9 Getting Data in and out of HDInsight... 9 Technical Notes about HDInsight Microsoft s Claims about HDInsight Claim: [HDInsight lets you] accelerate the deployment with the cloud by deploying an Apache Hadoop cluster on Windows Azure* in just 10 minutes Claim: Microsoft simplifies programming on Apache Hadoop Claim: [Microsoft big data lets you] seamlessly extend privileges across HDInsight with Active Directory* Claim: HDInsight is 100% compatible with Apache Hadoop SQL Server 2012* Parallel Data Warehouse: An (Almost) All-in-One Data Solution PDW Hardware Specifications Dell Parallel Data Warehouse Appliance HP AppSystem for Microsoft SQL Server 2012 Parallel Data Warehouse How PDW Works Comparison of Data Warehousing Appliances Big Data Integration PolyBase CREATE EXTERNAL TABLE Statement CREATE TABLE AS SELECT Statement Querying the Data Pushing Data to Apache Hadoop from PDW Roadmap for PolyBase ETL in PDW Microsoft s Claims about PDW Claim: PolyBase for PDW provides seamless integration of Apache Hadoop data with the data warehouse in a single query Claim: HDFS Bridge in PolyBase enable[s] direct communication between HDFS data nodes and PDW compute nodes Business Intelligence and Analytics Apache Hive* ODBC Driver PowerPivot Power Query Power View Power Map Microsoft s Claims about BI Claim: HDInsight democratizes the power of big data BI Claim: [Microsoft lets you] analyze big data with familiar tools Conclusion Notes... 23

3 Evaluating the Microsoft Data Platform Microsoft makes two alluring pitches for its suite of data products. The first is that its solution can bring the power of big data to the masses, making queries easier to submit and data easier to analyze with tools that are already ubiquitous. The second claim is that the Microsoft solution offers a single, comprehensive solution to manage all enterprise data regardless of size, structure, or speed. Is Microsoft Really Democratizing Big Data? Despite the near-exuberant rhetoric about bringing big data analysis to the masses, Microsoft s progress on this count has been somewhat modest. Microsoft is indeed lowering the barrier to entry for big data, but only incrementally. Its clearest success along these lines comes in deployment and management. Whether on premises or in the cloud, HDInsight is easy to set up and manage compared to other big data solutions, especially for IT personnel who lack Linux* expertise. This solid innovation, however, does not simplify deriving value from data in the cluster. For this ultimate purpose, HDInsight only modestly reduces the difficulty of searching, analyzing, and mining Apache Hadoop data compared to other Apache Hadoop solutions. Microsoft s unique contribution toward simplifying data mining from Apache Hadoop clusters is to offer a set of programming libraries that allows programmers to run operations against Apache Hadoop data in simpler programming languages, such as JavaScript* and.net* languages such as C# and F#. It also offers an interactive JavaScript console that allows programmers to run JavaScript commands against data in Apache Hadoop files one line of code at a time. As a comparison, the classic WordCount program in Apache Hadoop requires approximately 60 lines of code in Java*, but only 15 in JavaScript. Such advancements will allow more people to gain insights from data stored in Apache Hadoop files, but that wider group must still be programmers. One area where Microsoft is truly democratizing data analysis and visualization is on the client end, in Excel. Excel has the ability to take data stored in individual Apache Hadoop files, run traditional database queries against this data, perform analysis on tables in this data, and finally present this data in impressive visualizations that can provide valuable insights. However, it is essential to understand first that Excel can load data from any Apache Hadoop source, not just from HDInsight. Excel allows users to import Apache Hadoop data from any source by means of a special add-on driver (the Apache Hive* Open Database Connectivity [ODBC] driver from Microsoft). As much as Microsoft is attempting to connect Excel and HDInsight as part of a single solution, there is no substantial advantage to choosing HDInsight as the particular backend source of Apache Hadoop data in Excel. Moreover, Excel does not allow users to perform complex operations, such as machine learning, that analyze or mine vast amounts of data from an Apache Hadoop cluster in the way the term big data suggests. With Excel, information workers can merely import individual Apache Hadoop files and perform analyses and visualizations on tables stored in these files. Does Microsoft Offer a Truly Comprehensive Data- Management Solution? The closest Microsoft comes to a comprehensive data solution today is with its PDW hardware appliance, which includes SQL Server 2012 to store structured data and which can also connect to Apache Hadoop data from an external source. PDW thus enables unified access to both structured and unstructured data. However, PDW does not currently favor any particular Apache Hadoop solution as the external source of unstructured data, making it a big data solution far from specific to Microsoft. Microsoft s current big data solution is also limited in that none of its components can handle streaming unstructured data, such as from social media or user clickstreams. PDW might have clear limitations today, but in the future the appliance is likely to fulfill Microsoft s promise of delivering truly comprehensive data management on-premises, if at a high price. This comprehensive vision is set to be realized with the next version of PDW, which will likely include a preinstalled version of HDInsight Server (at least as an option). The ability to perform real-time queries of unstructured data

4 streams is also likely to be incorporated into future versions of HDInsight, making Microsoft s data-handling capabilities truly comprehensive. Microsoft has not confirmed that it will include PolyBase in a future, general release of SQL Server outside of PDW, but such a move is also plausible. Adding PolyBase broadly to SQL Server would bring the capabilities of handling structured and unstructured data to the wider database market. A cloud-based, on-demand solution that meets Microsoft s promise of comprehensive data management is also eventually likely to arrive. Windows Azure* already allows users to create, store, and manage databases in Windows Azure SQL Database* online, so all of the components of a comprehensive data solution will be available through Windows Azure when Windows Azure HDInsight Service matures. It is unclear if a wider distribution of PolyBase to SQL Server would extend to a cloud-based version, however. Even if it does, the bottleneck of upload speeds on large, proprietary data sets could limit the usefulness of the cloud-only option for some data-heavy firms. However, the cloud-only solution will present an attractive option for firms that generate data online or that work with public data files. fundamentally, Microsoft s solution for unstructured big data is still not released, and it will be a matter of time before general usage can truly reveal its strengths and its faults. Despite these reservations, there are reasons to be optimistic about Microsoft s chances of bringing big data to the masses in the future. Compared to other companies, Microsoft has more of the components in place for a comprehensive data solution, including popular database management software in SQL Server, a rapidly-maturing cloud provider in Windows Azure, widely-used business intelligence tools, and the resources to invest in this comprehensive vision for the long term. Conclusion Microsoft provides a vision for big data within a larger context of all data, structured and unstructured. While this vision is tantalizing for the future, it ultimately lacks substance today. Democratizing big data would hold some of the same revolutionary promise that personal computing and later the Internet realized in the last three decades, yet it is far from clear that Microsoft will ultimately consummate this revolution. PolyBase shows potential for managing and analyzing structured and semi-structured enterprise data by using familiar database skills, but it is currently only available in a high-end data-warehouse appliance. Using Excel as a frontend for bigdata analysis is another alluring vision, but it too is limited to dealing with structured and semi-structured data. Moreover, if Excel continues to be agnostic about the big-data backend supporting it, it does not provide an argument for companies to pick HDInsight over any other Apache Hadoop solution. Most

5 Microsoft s Big Data Vision Microsoft is currently developing a big data solution whose main components are likely to be released over the next year. These products have not yet been finalized, but their features have been made public, and Microsoft s own statements about their soon-to-be-released big data tools provide insight not only into the company s big data strategy, but also into its broader data strategy in general. This paper provides an overview of this broader strategy and an analysis of Microsoft s big data claims. Big data as a trend relies technically on the open-source software framework of Apache Hadoop. Originally created at Yahoo!, Apache Hadoop allows nearly unlimited amounts of unstructured or semi-structured data (such as is found in log files) to be stored in clusters of inexpensive servers and then analyzed for correlations, trends, causal relationships, and other insights. Apache Hadoop has become the industry standard for big data, and for many, the terms big data and Apache Hadoop have become synonymous. For many companies selling a big data solution, the conversation about big data begins and ends with Apache Hadoop. Microsoft s vision of big data differs from many others in that it has publicly positioned Apache Hadoop as only a component of a more comprehensive data strategy. This comprehensive strategy includes not only the unstructured and semi-structured data that are the accepted mainstays of big data, but also data that is structured (such as into traditional database tables, such as in a data warehouses), along with the business intelligence tools used to analyze all data, whether unstructured, semistructured, or structured. This broader all data vision allows Microsoft to draw into the big data conversation the company s existing strengths in products such as SQL Server and Excel. By re-imagining the business staples of SQL Server and Excel as having a role in a big data solution, Microsoft is targeting their suite of big data products toward the many businesses that have already invested heavily in these tools and accumulated large amounts of potentially useful data in them. Microsoft is also targeting the many companies that have high skills in common software tools but that lack the specialized knowledge reserved for data scientists and pure Apache Hadoop experts. The most central component of Microsoft s big data strategy is provided by HDInsight, an Apache Hadoop solution built from a particular Apache Hadoop distribution, namely HDP for Windows. (HDP for Windows, developed by Hortonworks, Inc., is in fact the first distribution of Apache Hadoop that runs natively on Windows, and it is already publicly available as a free tool.) HDInsight can actually refer to either of two separate Apache Hadoop products, both still available only as preview versions: HDInsight Server, an on-premises solution, and Windows Azure HDInsight Service, a completely cloud-based solution. Both of these options are touted as versions of Apache Hadoop that are easier to set up and use than are the Apache Hadoop products offered by competitors. More recently, Microsoft has also described HDInsight as a solution for analyzing data that is semi-structured in particular, such as data sourced from smartphones, web sites, RFID tags, and Twitter feeds. Microsoft has also hinted that its search engine technologies, Bing* and Microsoft FAST Search, will act as the solutions to interact with completely unstructured data, such as documents. Code from both products was in fact incorporated into the search function in Microsoft SharePoint 2013*. However, Microsoft has not elaborated on the particular role it sees for its search engines within its comprehensive data strategy. A second cornerstone of Microsoft s big data vision is SQL Server 2012 PDW, a hardware appliance that supports queries of data stored both in SQL tables and Apache Hadoop files through Microsoft s proprietary PolyBase technology. PDW is already available at a price of approximately $1.5M. (Note that data warehouses commonly cost as much as $30M, so while high, the cost of PDW is actually low relative to that of the competition.) The third, and currently final, component of Microsoft s big data solution is its business intelligence (BI) and visualization tools. These tools include Excel most importantly, but also Microsoft

6 SharePoint Server and Microsoft Office 365*, along with optional analysis add-ons such as PowerPivot, Power View, Power Map, and Power Query. Future components might be added to this suite of big data products as they become available. For example, the next version of SQL Server will include an in-memory online transaction processing (OLTP) engine, currently code-named Hekaton. Hekaton will allow any new products based on it to efficiently process data captured in real-time, such as from data streams. It is plausible that Microsoft s big data strategy will eventually reflect this new functionality provided by Hekaton and include a real-time data analysis tool. Although Microsoft is describing these various products as components of an integrated big data solution, they do not function cohesively today. It is more accurate to view these components as a list of separate tools that might slowly become integrated over time. Another limitation to keep in mind about Microsoft s big data solution is that its central component, HDInsight, is still a work in progress and many months away from release. Moreover, there is even a question about whether HDInsight will be outdated when it finally is released. HDInsight is currently based on HDP for Windows 1.x, which in turn is based on the Linux exclusive Apache Hadoop 1.0. The next version of Apache Hadoop based on Linux, version 2.0, is currently in community preview and is scheduled for general release in late summer 2013; it offers an architectural overhaul that promises to dramatically improve performance and extensibility. Hortonworks s port of Apache Hadoop 2.0 to HDP for Windows 2.0 is currently being targeted for late Any future version of HDInsight that incorporates the updates in Apache Hadoop 2.0 can only be built after HDP for Windows 2.0 is finalized in late Microsoft s General Claims about its Comprehensive Data Solution Microsoft s claims surrounding its all-data solution fall into three broad categories: that Microsoft provides a platform to manage data of any type and size, that the Microsoft solution provides a way to analyze all data, and that the Microsoft solution enables information worker generalists to glean insights from big data. While these claims are generally accurate, careful examination of each claim yields a more nuanced picture. Claim: The Microsoft big data solution offers an integrated platform for managing data of any type or size. 1 In discussing its comprehensive data solution, Microsoft places data into two broad categories for management: structured data (managed by SQL Server) and semi- and unstructured data (managed by HDInsight). The fact that Microsoft points to two products actually hints at the lack of an integrated platform for data management: data for SQL Server and Apache Hadoop are not integrated into a single platform. Even within each discrete product, data is not necessarily integrated. On the one hand, it is true that SQL Server is the management tool for structured data. On the other hand, managing data in HDInsight is more complex. For companies choosing the cloud-based Windows Azure HDInsight Service as their Apache Hadoop option, both semi-structured and unstructured data are likely to be stored and managed in Windows Azure blob storage. For firms choosing the onpremises HDInsight Server option, semi-structured and unstructured data are likely to be managed in separate locations. Semi-structured data will likely be stored and managed in Apache Hadoop Distributed File System (HDFS). Unstructured data, such as documents, spreadsheets, presentations, videos, and audio recordings, will likely be managed not in Apache Hadoop but in SharePoint, in a Microsoft product-centric IT deployment.

7 The technology that currently comes closest to realizing the claim of an integrated platform is PolyBase. PolyBase should not be viewed as a silver bullet, however. Beyond being currently locked away in a specialized, expensive data warehouse appliance, it is unclear to what extent it will integrate with Microsoft s principal tool for unstructured data querying, Bing, or with Microsoft FAST Search for queries in SharePoint. As with so many other aspects of the Microsoft data vision, time alone will tell how and to what degree organizations can implement them. In general, Microsoft does not currently offer a comprehensive data management solution but a set of tools and products that allows organizations to handle structured, semi-structured, and unstructured data. Claim: Microsoft s big data solution gives you the power to enable anyone in your organization to easily glean insight from your data so they can make smarter decisions. 2 This claim exaggerates the democratizing power of the Microsoft big data solution. Microsoft s integration of ubiquitous and well-understood tools for big data analytics (particularly Excel) should not be confused with making big data queries and analysis inherently easier. Using laymen s tools for big data work is not the same as putting big data insights within reach of all laymen. That said, this represents a key part of Microsoft s competitive advantage in the big data arena, particularly with the saturation of Excel in the enterprise productivity market. Many more knowledge workers are familiar with Excel than with even SQL queries, for example, opening up direct examination of big data sets to a larger pool of analysts who previously had to work through middlemen like data scientists. Moreover, Excel addins such as PowerPivot, Power View, Power Map, and Power Query definitively put more analytical power in the hands of end users than before. IT organizations looking at these solutions, however, should keep their eyes wide open for the behind-the-scenes work that can go into preparing data sets for wider use within a company. A sample data set of electrical usage of households in two Dallas suburbs used to demonstrate Power Map and Power View in Excel provides a telling example. The Microsoft team loaded Dallas County Appraisal District flat-file records into SQL Server, converted geographical coordinates within them from planer to an ellipsoid projection with a third-party tool, and calculated the centroid of each land parcel in SQL Server to obtain a longitude and latitude figure for each plot before exporting the data to Excel. (All of this before adding details to the data set, such as simulated rates of electricity usage.) The result was a rich data set that could be dissected by information workers across a variety of dimensions, including time. The route to get there was anything but trivial, however. Microsoft HDInsight*: Microsoft s Apache Hadoop* Solution HDInsight is the brand Microsoft has assigned to its two upcoming Apache Hadoop products: the cloud-based Windows Azure HDInsight Service and the on-premises HDInsight Server. Both of these solutions are built from a core of HDP for Windows. HDInsight in both cases thus refers to a product composed of this basic Hortonworks Apache Hadoop distribution in addition to extensive software customizations added by Microsoft. (HDInsight and HDP for Windows do not, in other words, refer to distinct components that communicate with each other.) Of the two versions of HDInsight, Microsoft has promoted the cloud-based Windows Azure HDInsight Service to a much greater degree. This product, hosted on Windows Azure, is also expected to be released first, mostly likely in Q The emphasis on the cloud-based HDInsight suggests that this version of the product aligns more closely with Microsoft s chosen market positioning for HDInsight in general.

8 The HDInsight Service web page (found at windowsazure.com/en-us/services/hdinsight/) describes the service by featuring words and phrases such as gain insight from any data, any size, anywhere, provides simplicity, ease of management, simplicity of Windows Azure, simple and straightforward, seamless scale, quickly create, cost savings only possible on a cloud environment, glean insights on all your data with familiar tools, and analyze all your data easily. The messaging is clear: HDInsight Service is simple, cost-efficient, and takes advantages of existing knowledge. Simple as it might be, HDInsight Service is not the only cloud-based Apache Hadoop solution. Other such products include Amazon s Elastic MapReduce*, Joyent Solution for Apache Hadoop*, and InfoChimps Cloud::Hadoop*. Microsoft s offering differs from these others most obviously in that it runs on Windows and that it is integrated into the Windows Azure platform. Another idiosyncrasy of HDInsight is that it is currently based on Apache Hadoop and HDP for Windows 1.1.0, 3 even though (as of August 2013) the most recent stable releases of Apache Hadoop based on Linux are Apache Hadoop and HDP Even the most recent version of HDP for Windows is a later version: Because Apache Hadoop is a quicklymaturing platform, the difference in incremental updates can be significant. For example, HDP features a revision of the Hive query language called the Stinger Initiative that supports 50 times faster performance and increased compatibility with the SQL query language, but this technology is not currently included in HDInsight. In addition, the next full version of Apache Hadoop, Apache Hadoop 2.0, is expected to be released in Q and to be incorporated into HDP for Windows in Q4. Apache Hadoop 2.0 is an important update that will dramatically improve the efficiency and extensibility of the platform, but it is not clear when these updates will reach HDInsight. Creating HDInsight Service Clusters Clusters created in HDInsight are intended to be disposable as a way to minimize costs. HDInsight was designed with the expectation that users will create an HDInsight cluster, load the data needed, run the analyses desired, and then destroy the cluster. HDInsight promises to be simple, and as far as the procedure to create a new cluster is concerned, it lives up to this promise. With the Quick Create option in particular, the user merely chooses the cluster size (as defined by the number of nodes) and then assigns a name, password, and storage account for the cluster. Once the user clicks the option to create the cluster, the process takes 15 to 20 minutes. HDInsight Storage Options HDInsight allows data to be stored in the local HDFS file system, as does any Apache Hadoop distribution. However, an option unique to HDInsight Service is the Azure Storage Vault (ASV) protocol, which builds on the HDFS API to map Apache Hadoop operations to Windows Azure blob storage instead of to local HDFS. Through ASV, customers can keep their Apache Hadoop data in an inexpensive Windows Azure blob storage account and avoid having to import this data into the physical compute nodes of the HDInsight cluster. Because the data accessed through ASV isn t physically stored in the HDInsight cluster, the data remains in Windows Azure blob storage before clusters are created and after they are destroyed. After users spin up an HDInsight cluster, they can point operations such as Hive queries toward data that has been stored in Windows Azure blob storage by using a URI beginning with asv:// or asvs://. The drawback to ASV is that, because this data is not stored in the Apache Hadoop cluster itself, performance is not always optimized. However, write performance on Windows Azure blob storage is much faster that it is on HDFS, and with large file reads, temporary writes can be used so often that ASV can actually even result in better overall performance than local HDFS storage can. Figure 1 shows the setting to configure ASV for HDInsight.

9 For more fine-grained management of HDInsight clusters and their associated storage, Windows PowerShell* is available. Windows PowerShell cmdlets for HDInsight are currently in version 0.9 and are available through the Microsoft.NET SDK For Apache Hadoop web site on Codeplex ( codeplex.com/releases/view/109811). Figure 1. HDInsight cluster management screen If optimal performance is important, it is advisable to run tests with data stored in both ASV and local HDFS and compare the results. Note however that the cost of storing data in HDFS on HDInsight node instances is much higher than the cost of storing a comparable amount of data in Windows Azure blob storage. Another drawback to HDFS over ASV is that data stored in HDFS is removed when the cluster is destroyed. Figure 2 illustrates the relationship between an HDInsight cluster, HDFS, and ASV. Figure 3. HDInsight cluster dashboard Beyond these current tools, Microsoft has stated that in the future, Microsoft System Center* will provide tools to manage HDInsight. Given this information, it seems most likely that this System Center integration will become available in first full release of System Center after the official public release of HDInsight. Figure 2. Relationship between an HDInsight cluster, HDFS, and Windows Azure Blob Storage HDInsight Management Windows Azure HDInsight Service and its on-premises counterpart HDInsight Server share the same web-based management interface, shown in Figure 3. The graphical user interface (GUI) provides options such as an interactive JavaScript and Hive console to a cluster, a remote desktop connection to the name (main) node, and monitoring data. Getting Data in and out of HDInsight HDInsight offers a number of standard Apache Hadoop ecosystem tools for loading and unloading data, such as the Apache Hadoop command or, if the source is a relational database, the Apache Sqoop* tool (included in all Apache Hadoop distributions). To load log file data, the standard Apache Hadoop ecosystem tool Apache Flume* is used. To load data into or out of Windows Azure blob storage (as opposed to HDFS), users have more options. For example, one can use any number of tools that make use of the HDFS API, such as the free graphical tools Azure Storage Explorer*

10 and CloudXplorer* or the command-line tool AzCopy*. One can also use JavaScript via the interactive console, the Apache Hadoop command line (using the Apache Hadoop command), or a.net language such as C#. Yet another option is Windows PowerShell. After data has been unloaded, it s typically necessary to clean it before it can be consumed, analyzed, or displayed in a visualization. These data cleaning operations are often referred to as extract, transform, and load (ETL). For ETL operations with HDInsight, the standard Apache Hadoop tool Apache Pig* can be used. However, Microsoft also makes ETL for Apache Hadoop possible through SSIS, by means of the Hive ODBC Driver; the Hive ODBC Driver allows external applications such as Excel and SQL Server to connect to Apache Hadoop data. Technical Notes about HDInsight HDInsight was developed with ease of use in mind and has not been optimized for other features such as performance. In addition, it is unlikely that HDInsight will ever be built on the very latest version of Apache Hadoop because these versions are written on Linux. As a result, HDInsight will be late to adopt cutting edge features and frameworks such as Intel s Project Rhino, which provides a common security framework for Apache Hadoop; or Intel Advanced Encryption Standard New Instruction (Intel AES-NI), which speeds performance on encryption; or cell-level security in Apache Hadoop, such as is being developed in the Apache Accumulo* project. Regarding security, the only claims Microsoft is in fact making about HDInsight and security relate to its integration with Active Directory* Domain Services. Microsoft s Claims about HDInsight Microsoft s main claims about HDInsight usually suggest that the product makes Apache Hadoop easier. What follows are some representative examples of Microsoft followed by a brief analysis. Claim: [HDInsight lets you] accelerate the deployment with the cloud by deploying an Apache Hadoop cluster on Windows Azure* in just 10 minutes. 7 The claim is specific and easy to verify, but it also suggests something general: that creating an HDInsight cluster in Windows Azure is a trivial exercise and is far easier than setting up one s own hardware cluster. Although it takes closer to 20 minutes to set up an HDInsight cluster, it is true that by using Windows Azure HDInsight Service the circumscribed process of setting up an HDInsight cluster is quick and easy. However, this statement is essentially misleading because it ignores the necessary aspect of uploading data into the cloud. This uploading process is necessary unless the enterprise data destined for Apache Hadoop is already stored in Windows Azure blob storage (an uncommon scenario). To upload 1 TB of uncompressed data at a rate of 1 MB/second would require approximately 12 days. Compression can reduce the transfer times by 80 to 90 percent, but even assuming the rate can be increased to a brisk 1 TB per day, the process of uploading 100 TB would still take 100 days. (Windows Azure does not yet allow customers to ship physical disks to speed the process of loading data, but this service is planned before the end of ) In addition, regardless of how complicated or time-consuming the process of deploying an Apache Hadoop cluster might be, this difficulty of deployment is not a major deterrent to the sound use of Apache Hadoop. In the broader scheme, ease of installation is a nice-to-have feature of HDInsight that does not help businesses derive any value whatsoever from an Apache Hadoop cluster. Note that for the on-premises version of this product, the true ease of installation cannot yet be verified because the current preview of HDInsight Server (for on-premises deployment) can only be installed as a single node.

11 Claim: Microsoft simplifies programming on Apache Hadoop. 9 The claim that a procedure has been simplified can mean either that it has been made simple, or that it has merely been made simpler. In this case it is true that Microsoft has made programming on Apache Hadoop a little simpler, but it is not true that it has made programming on Apache Hadoop simple. services related to Apache Hadoop use for logon credentials. These services, and the hadoop logon account, are shown in Figure 4. Microsoft s programmatic addition to Apache Hadoop has been to create a.net software development kit (SDK) and a set of JavaScript libraries for HDInsight, in addition to providing an interactive JavaScript console to Apache Hadoop. (The.NET SDK allows programmers to write essential Apache Hadoop MapReduce jobs in all.net languages such as C# and F#.) These additions in principle should make programming for Apache Hadoop easier for the many programmers who are not Java specialists. However, programming MapReduce jobs will remain fundamentally complex even in these other languages. For the IT decision maker, the take-away is that developers comfortable in any.net language or JavaScript will now be able to program MapReduce jobs and quickly perform queries in a console against data stored in Apache Hadoop. Claim: [Microsoft big data lets you] seamlessly extend privileges across HDInsight with Active Directory*. 10 This implication of this claim is that the integration of HDInsight with Active Directory Domain Services makes managing HDInsight easier. Apache Hadoop is in fact integrated with Active Directory Domain Services, but not yet to the high degree that is suggested in the claim. The locus of integration is currently with user accounts, authentication, and authorization: Windows accounts are used to manage Apache Hadoop, and it s not necessary to create user accounts within HDInsight itself. In fact, with HDInsight, no aspect of authentication and authorization remains siloed in Apache Hadoop; security is handled by Windows Azure, Active Directory Domain Services, or local Windows security. In addition, HDInsight creates a special Windows user account named hadoop that the 14 Figure 4. HDInsight services In general, IT should not soon expect dramatic improvements in the manageability of Apache Hadoop because of its loose integration in Active Directory Domain Services. However, it is likely that Apache Hadoop and Active Directory Domain Services will become more integrated over time, leading to (for example) specific HDInsight group policy objects (GPOs) and other administrative benefits. HDInsight will likely need some years to mature before that will happen, however. Claim: HDInsight is 100% compatible with Apache Hadoop. 11 Buried within Microsoft s general claim that HDInsight makes Apache Hadoop easier is the implicit claim that HDInsight really is Apache Hadoop. Is it? In general, yes. Apache Hadoop runs inside HDInsight, and it is true that Apache Hadoop files from other Apache Hadoop distributions are 100 percent compatible with it. In addition, one can download an Apache Hadoop component such as Apache Mahout* straight from the Apache web site, and it will run on an HDInsight cluster without errors. However, it is not true (as the claim might be interpreted) that HDInsight has the same features as all standard versions of Apache Hadoop. At the time of this writing, for example, HDP and Apache Hadoop support features that have not yet appeared in HDInsight. This lag time between Apache

12 Hadoop versions is likely to persist indefinitely, and it remains to be seen whether in some cases it could actually lead to file or code incompatibilities. In general, the take away for the IT decision maker is that HDInsight is likely to be running a slightly outdated version of standard Apache Hadoop. Today, code and syntax is 100 percent portable from standard Apache Hadoop, but in the future, exceptions to this rule cannot be ruled out. Ultimately, however, Microsoft has made clear that they want to remain 100 percent compatible with Apache Hadoop, so if such an incompatibility should arise, it will likely be a temporary problem. SQL Server 2012* Parallel Data Warehouse: An (Almost) All-in-One Data Solution Another pillar in Microsoft s all-data product lineup is SQL Server 2012 PDW. PDW is a massive parallel processing (MPP) data warehousing appliance that combines custom software built on SQL Server 2012 with commodity hardware. Currently, the appliance is sold in various scalable configurations only by Dell and Hewlett-Packard. At the lowest end, both vendors sell a one-quarter rack version (of a standard 42U rack). The Dell appliance can scale up to 6 racks, and the HP counterpart can scale up to 7 racks. A key concept in understanding PDW is that it represents a scale-out solution, as opposed to a scale-up solution. When users run T-SQL queries against PDW, the queries are broken down and distributed among all required nodes. The processing itself is therefore distributed and not centralized. As nodes are added to the appliance, the raw processing power of PDW increases in an essentially linear manner. Storage in the PDW appliance is both replicated and distributed. Smaller tables (approximately 5 GB or smaller) are replicated among all nodes for improved performance. Larger tables are broken up and distributed across nodes. PDW Hardware Specifications The PDW versions from both Dell and Hewlett-Packard are not identical, but they do share some common specifications. First, both vendors assign 256 GB of RAM to each physical node in the appliance. Second, for both Dell and HP, the first rack in the appliance (or only rack, if there s only one) includes one node assigned control and management responsibilities. Microsoft also specifies that one extra node per rack should remain essentially unused and be included for failover, so this is another common element from both vendors. Finally, in both the Dell and HP solutions, nodes are connected with InfiniBand* and Ethernet, both of which are implemented with redundancy. These control and failover nodes along with the redundant networking components occupy 6U in the first (or only) rack, and 5U in all subsequent racks (because the control node is needed only in the first rack). Dell Parallel Data Warehouse Appliance Dell s PDW product is officially called the Dell Parallel Data Warehouse Appliance. The following list provides additional detailed hardware specifications about the Dell PDW configuration options, beyond the elements described above: Basic scale unit of 10U: 3 servers in a 2U enclosure, and two 4U drive arrays Basic scale unit = 3 Dell PowerEdge R620* compute nodes, 2 Dell PowerVault MD3060e* JBOD SAS arrays (102 drives) Up to 3 scale units (9 compute nodes) per rack ¼ 6 racks 3 54 compute nodes total 1, 2, or 3 TB storage capacity per drive ,223.1 TB raw free storage space 79 6,116 TB user storage (with compression) 6U available for customer space on first rack, 7U on other racks

13 HP AppSystem for Microsoft SQL Server 2012 Parallel Data Warehouse HP s PDW product is called the HP AppSystem for Microsoft SQL Server 2012 Parallel Data Warehouse. The HP AppSystem offers a different range of hardware options: Basic scale unit of 7U: two 1U servers and one 5U drive array Basic scale unit = 2 Dell ProLiant Gen8 DEL360 compute nodes, 1 HP P6000 JBOD SAS array (70 drives) Up to 4 scale units (8 computer nodes) per rack ¼ 7 racks 2 56 compute nodes 1, 2, or 3 TB storage capacity per drive ,268.4 TB raw free storage space 53 6,342 TB user storage (with compression) 8U available for customer space on first rack, 9U on other racks These different hardware specifications for the first rack from each vendor are shown in Figure 5. responds to the client with the results of the query. To answer the query, the control node uses its metadata to break up an original query into smaller parts and send these smaller component queries to the appropriate nodes. The control node then compiles into one response the results received from these various nodes and then sends this response to the client. PDW virtualizes all servers on its physical nodes and uses failover clustering to protect these virtualized workloads. No one node (including the control node) represents a single point of failure. Figure 6 shows a view of the PDW from the perspective of an administrator. Figure 6. SQL Server 2012 Parallel Data Warehouse management portal Figure 5. Comparison of SQL Server 2012 Parallel Data Warehouse hardware specificaitons between Dell and HP How PDW Works Despite the many components included in PDW, to external clients the appliance looks just like a single instance of SQL Server T-SQL queries to PDW are directed from clients toward the PDW control node, and the control node eventually

14 Vendor and Appliance Memory (GB) Total Cores Compression EMC Greenplum Data Computing Appliance* IBM PureData System for Analytics N * Microsoft SQL Server 2012 Parallel Data Warehouse (Dell)* Oracle Exadata Database Machine X3-2* Teradata Data Warehouse Appliance 2690* User Storage (TB, Compressed) List Price to $2,000,000 n/a to $1,599,000 2, to $1,569,970 2, to $13,580, to $1,168,000 Table 1. Comparison of hardware specifications for full-rack implementations of data warehousing appliances from several vendors Vendor I/O Bandwidth (GB/sec) Price per GB/sec of I/O Bandwidth EMC 24 $83,333 Microsoft 108 $14,537 Oracle 100 $136,440 Table 2. Comparison of input/output (I/O) rates among three data warehouse appliances Comparison of Data Warehousing Appliances Within the playing field of data warehousing appliances, Microsoft makes essentially three pitches in favor of PDW: that it offers a great value, that it has excellent performance, and that it connects seamlessly to Apache Hadoop. Table 1 compares hardware specifications for full-rack implementations of data warehousing appliances from various vendors. 12 Table 2 compares input/output (I/O) rates among three data warehouse appliances. 13 With respect to value, an advantage highlighted by Table 1 is that, compared to other solutions, the SQL Server 2012 PDW displays a low cost per unit storage. Microsoft is able to attain these cost reductions mainly by using direct-attached storage (DAS) with its nodes instead of storage area network (SAN) storage, an option made possible because of a Windows Server 2012 feature called Storage Spaces. Storage Spaces allows flexible SAN-like storage provisioning from a JBOD SAS array that is attached to one node only. With respect to performance, Table 2 shows that the I/O throughput of PDW compares favorably with that of the EMC and Oracle solutions. (Data from IBM and Teradata are not available.) Microsoft claims PDW is also able to speed I/O performance (over 10 times) through the use of columnstore indexing and batch processing, both members of the xvelocity* family of memory-optimized technologies in SQL Server Regarding the integration of PDW and Apache Hadoop, Microsoft is careful not to claim that it is unique among data warehouses in offering this capability. In fact, all of the data warehouse appliance vendors mentioned in Table 1 have presented a product roadmap involving some integration with Apache Hadoop. Of these, however, the PolyBase roadmap is distinctive in its plan to deeply integrate Apache Hadoop processing with PDW processing. The next section provides more detail about PolyBase and its product roadmap.

15 Big Data Integration PolyBase PolyBase is a PDW-only feature that provides a means to integrate Apache Hadoop data with SQL Server and to make this data accessible through T-SQL queries. The manner in which PolyBase integrates T-SQL with Apache Hadoop is illustrated in Figure 7. Apache Hadoop source, query results will show the updated data. However, query performance isn t optimized. Figure 8 shows an example of a CREATE EXTERNAL TABLE statement that creates a table called ClickStream from an Apache Hadoop file called employee.tbl. Figure 8. Example of a CREATE EXTERNAL TABLE statement from Apache Hadoop CREATE TABLE AS SELECT Statement The CTAS statement can be run after an external table is created. When a PDW administrator creates a table as a select statement from an external table, this external data is physically copied into a SQL table that resides in PDW. In this case, PDW can perform parallel processing on the remote Apache Hadoop data, and when the table is created, the administrator can optimize its storage in PDW by distributing it across nodes. The imported Apache Hadoop data then persists in PDW until the new table is deleted. Creating a table as a select statement optimizes query response times, but the imported data is not updated from its source if that source data should ever change. Figure 7. PolyBase integration of T-SQL with Apache Hadoop To achieve this integration, PDW must first be connected to an Apache Hadoop source. Administrators can then integrate the external Apache Hadoop data into SQL data on PDW by using either a CREATE EXTERNAL TABLE statement or a CREATE TABLE AS SELECT (CTAS) statement. Administrators can also push data from PDW to Apache Hadoop by means of a CREATE EXTERNAL TABLE AS SELECT (CETAS) statement. CREATE EXTERNAL TABLE Statement When an external table is created from Apache Hadoop data, PDW frames a SQL structure around the external data. Users can then query the external table as if it were a normal table residing in a SQL database. If the data is updated in the The following example shows a basic CTAS statement: CREATE TABLE ClickStream _ PDW WITH DISTRIBUTION = HASH(url) AS SELECT url, event _ date, user _ IP FROM ClickStream Note that Apache Hadoop data does not need to persist as an isolated table. Imported data can also be mashed up with native relational data through JOIN statements.

16 Querying the Data After data is imported into a table in PDW, users can perform ordinary T-SQL queries on it, as shown in the three examples in Figure 9. SQL and Apache Hadoop data, makes a cost-based decision about when to process queries with SQL and when to push queries onto HDFS data as MapReduce jobs. The goals of PolyBase phase 3 have not been finalized, but Microsoft has publicly stated that it is considering compatibility with Apache Hadoop MapReduce 2.0 (YARN) and more efficient alternatives to MapReduce. No dates have been given for the release of PolyBase phase 2 or phase 3. Figure 9. Examples of T-SQL queries performed on data imported to a SQL Server 2012 Parallel Data Warehouse table Pushing Data to Apache Hadoop from PDW Finally, PDW administrators also have the option of migrating data PDW to an Apache Hadoop source. To achieve this, a CETAS statement is used, as in the following example: CREATE EXTERNAL TABLE ClickStream (url, event _ date, user _ IP) WITH (LOCATION = hdfs://myhadoop:5000/ users/outputdir, FORMAT _ OPTIONS (FIELD _ TERMINATOR = ' ')) AS SELECT url, event _ date, user _ IP FROM ClickStream _ PDW Roadmap for PolyBase Currently, PolyBase is in phase 1 of a multi-phase rollout. Phase 1 allows data to be imported directly from and exported directly to HDFS on Apache Hadoop. Because MapReduce is bypassed and parallel processing is used, performance for import and export operations is normally optimized. Phase 2 goes beyond integrating Apache Hadoop data into PDW and will move toward integrating the processing power of Apache Hadoop clusters into PDW queries. This next phase will include a PDW query optimizer that, for all queries of both Besides this roadmap for planned functionality in PolyBase, Microsoft has occasionally hinted that the technology will eventually be integrated into its SQL Server product, perhaps as soon as the next release (SQL Server 2014). ETL in PDW The Microsoft specifications for PDW do not include any ETL server, such as a dedicated instance of SQL Server loaded with SSIS. Both Dell and HP include SQL Server tools installed on the control node, but it is expected that many firms will use a pre-existing ETL server to connect to PDW. Using SSIS packages to import data is sensible if these packages are already created. It should be noted, however, that in PDW, ordinary T-SQL queries offer much better performance as a way to import data. 15 Microsoft s Claims about PDW This paper focuses on Microsoft s comprehensive data strategy and how the various components of that strategy might work together. Although Microsoft makes claims about PDW that relate to its value and its performance, these claims do not relate to its big data strategy. One important claim that Microsoft is making about PDW, however, does relate to its comprehensive data strategy: that PDW integrates Apache Hadoop data with traditional relational data. We will look at two representative examples

17 Claim: PolyBase for PDW provides seamless integration of Apache Hadoop data with the data warehouse in a single query. 16 This claim essentially states that a single query executed against PDW will return both Apache Hadoop data and SQL data. The implication of the claim, along with the word seamless, is that all Apache Hadoop data will easily be brought into the SQL world and made accessible to all users through ordinary T-SQL statements. The claim can be construed as true if it is limited to describing the availability of data that has already been imported from Apache Hadoop, but it is essentially misleading in describing this process as seamless. In truth, the only Apache Hadoop data that can be queried through SQL statements is data that an administrator has located and made the effort to import with CREATE EXTERNAL TABLE or CTAS statements. In addition, the only Apache Hadoop data that is capable of being imported is data that is semi-structured with delimiters such as commas. Most Apache Hadoop data, however, is not structured at all. Although importing Apache Hadoop data into a SQL table is certainly a useful capability, this is not a particularly common use case for PolyBase. It is useful only for data that can fit easily into a table (such as log files) and whose location within the Apache Hadoop cluster is known. In the words of Yale researcher Daniel Abadi, [PolyBase lets you] dynamically get at data in Hadoop/ HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS. 17 It should also be noted that other rival technologies offer a more seamless integration of SQL with Apache Hadoop, such as Hadapt Adaptive Analytical Platform* and the Hortonworks Stinger initiative. Claim: HDFS Bridge in PolyBase enable[s] direct communication between HDFS data nodes and PDW compute nodes. 18 This particular claim is more conservative than the last. It states merely that instances of SQL Server in PDW can communicate directly with data nodes in HDFS through a PDW component called the HDFS Bridge. The implication of the claim is that IT personnel do not need to use additional tools (such as Hive) or write additional MapReduce scripts to import data from Apache Hadoop to PDW or export data from PDW to Apache Hadoop. SQL communicates with Apache Hadoop directly. The claim offers a reasonable description of what PolyBase can do, and if anything, it might sell the technology a little short. PolyBase doesn t merely allow users to bypass MapReduce and import and export data directly; it also allows PDW to use parallel processing when it performs queries on an Apache Hadoop cluster. On the other hand, the claim also hints at the lack of advanced integration between PDW and Apache Hadoop. The two technologies are connected only through a bridge, so importing and exporting data is required before data can be accessed from one system to another. Other companies are indeed working on solutions that dispense with the need for such a bridge. When assessing Microsoft s comprehensive data vision, therefore, it s important to recognize that PolyBase does not represent a singular cutting edge to integration of SQL and Apache Hadoop.

18 Business Intelligence and Analytics BI typically refers to tools used to collect, analyze, and view enterprise data for the purpose of meeting business goals. These three functions of collecting, analyzing, and viewing data for Microsoft have traditionally been filled by SSIS, SQL Server Analysis Services (SSAS), and SQL Server Reporting Services (SSRS), respectively. Microsoft BI has traditionally relied heavily on IT personnel, for example, to create packages in SSIS for importing data, to develop online analytic processing (OLAP) cubes in SSAS, and then to build reports in SSRS that are finally delivered to end-users. In the big data era, however, Microsoft has been expanding its vision of BI to include what it calls managed self-service BI. In this new vision, IT manages access to data sources, and end-users connect to these data sources as needed with client tools, most notably Excel. Users import data as tables into Excel and then shape and visualize the data as needed. Possible sources of enterprise data can still include databases, but they also now include Apache Hadoop and other sources, such as web pages and Open Data Protocol (OData) feeds. (OData is a data access protocol released under the Microsoft Open Specification Promise.) Excel in particular is able to achieve high performance when handling data sets and processing visualizations because it uses the xvelocity inmemory analytics engine for these purposes. (This in-memory engine was first available only in SQL Server 2008 R2.) PolyBase. (Because PolyBase provides better functionality than the Hive ODBC Driver, there is no such driver for PolyBase.) The Hive ODBC Driver is central to Microsoft big data because the company s strategy does not provide any BI tools specifically for Apache Hadoop. Microsoft s goal with big data and BI is merely to provide a method to import Apache Hadoop data into well-known Microsoft tools, where this data can be shaped, analyzed, and visualized just like any other data can. Figure 10 shows how the Hive ODBC driver (labeled ODBC for Hive ) is used to connect Windows Azure HDInsight Service to Excel, SQL Server, and Analysis Services. Figure 10. Connection of Windows Azure HDInsight Service to Excel, SQL Server, and Analysis Services through the Hive ODBC driver The next section describes some new Microsoft BI features available in Excel and some other tools that users can employ to connect to and manipulate big data. Apache Hive* ODBC Driver The Hive ODBC Driver is currently a critical piece of software in Microsoft s all-data strategy. This driver allows Apache Hadoop data sets to be imported into SQL Server, Excel, and Analysis Services through HiveQL queries. Unlike PolyBase, which is available only in PDW, the Hive ODBC Driver connects to Apache Hadoop by allowing HiveQL queries to be translated to MapReduce jobs. Performance is much lower than with PowerPivot The PowerPivot add-in for Excel first appeared in Excel 2010, allowing users to load large amounts of highly compressed data into Excel from different sources, create relationships within that data, and then perform analysis on the data. In Excel 2013, much of this functionality is now built directly in. Without installing the PowerPivot add-in, one can already import large data sets (millions of rows) from multiple data sources, create relationships between data from different sources and between multiple tables in a PivotTable, create implicit calculated fields, and manage data connections.

19 In Excel 2013, data is also now automatically loaded into the xvelocity in-memory analytics engine even before the PowerPivot add-on is installed. When the PowerPivot add-in is installed in Excel 2013, more advanced modeling capabilities become available, such as the ability to filter and rename data as it is imported, to define custom calculated fields throughout a workbook, to define key performance indicators (KPIs) to use in PivotTables, and to use the Data Analysis Expressions (DAX) expression language to create advanced formulas. Data imported into PowerPivot can originate from databases, like SQL Server, IBM DB2*, and Oracle, or from other types of data sources, like Apache Hadoop, OData feeds, reporting services reports, and text files. 19 Figure 11 shows the functions available in the PowerPivot ribbon in Excel Figure 12. Importing data from HDInsight to Excel 2013 Power Query Figure 11. Excel 2013 Power Pivot ribbon Power Query Power Query is a new tool whose name has recently been updated from its preview name, Data Explorer. Power Query allows users to query external sources of data, such as the Internet in general or HDInsight, and import detected tabular data sets. Once imported, the data in the table can be modified, combined with other data, analyzed, and visualized by using other tools. Figure 13. Excel 2013 Power Query results of an online search for most populous metropolitan areas in North America Figure 12 shows how Power Query can be used to import data from HDInsight and other external sources. Figure 13 shows a result when Power Query is used to perform an online search for most populous metropolitan areas in North America. When the data set is selected in the right column, a table containing the data set is automatically created.

20 Power View Power View is a visualization tool that first appeared in SharePoint 2010 and that is now also available in SQL Server 2012 SP1 and Excel Power View allows users to create interactive charts and maps from tabular data in Excel and then add them to a view or dashboard, as shown in Figure 14. Figure 15. Example of a visualization created in Excel 2013 Power Map Microsoft s Claims about BI Microsoft s claims about its BI tools within the context of its big data strategy are essentially variations of one point, that its BI tools bring big data to the masses. These claims can be mostly accurate or mostly misleading, depending on how they are phrased. Figure 14. Excel 2013 Power View output of most populous metropolitan areas in North America Power Map Power Map is a new visualization tool that until recently was known by its preview name, GeoFlow. Power Map provides 3D geographical visualizations that are superimposed on a globe. Such data can range from remote sensor output to data from Twitter. An important constraint, however, is that Power Map can work only work with data preformatted in a table and cannot work with live or streaming data. Figure 15 shows an example of a visualization created in Power Map. Claim: HDInsight democratizes the power of big data BI. 20 This claim explicitly states that HDInsight itself, and not some other component in the Microsoft big data strategy, is democratizing big data BI. The implication of the claim is that by opting for HDInsight over another Apache Hadoop solution, firms will have an advantage in their ability to derive valuable insights from their data. In truth, Microsoft s big data BI strategy is to bring Apache Hadoop data into its suite of existing BI tools, and these tools are almost completely agnostic about the particular source of the Apache Hadoop data. Microsoft does not have any BI solution for HDInsight in particular. It s true that in Microsoft Excel, some might consider that importing data from an HDInsight account is easier than doing so from a generic Apache Hadoop file, but this difference at the moment is negligible. Moreover the principal interface between Excel and Apache Hadoop data, the Hive ODBC Driver, is backend agnostic, meaning that it could draw its data from a competing Apache Hadoop distribution as easily as from HDInsight.

21 Over time, it is plausible that Microsoft will continue to develop HDInsight and Excel in a way that optimizes this connection far more, but for now, it is not accurate to suggest that HDInsight itself lowers the barrier to entry for big data BI. Claim: [Microsoft lets you] analyze big data with familiar tools. 21 If Microsoft Excel can be considered a familiar tool, then this claim is accurate. In Excel, information workers can perform a HiveQL query to import over a million rows of Apache Hadoop data, clean the data so that it fits into a table, and then analyze this data with advanced tools such as DAX statements. (Note that professional data analysts can also import Apache Hadoop data into the tools they are used to, such as SSAS, and perform the same analytics on this data as they could with any data that is originally in a static, tabular format.) However, there are some important caveats to keep in mind about using familiar tools to analyze big data. First, one can t import all Apache Hadoop data into Excel (or SSAS). Users can only import data that lends itself to being shaped into a table, such as comma-delimited files and other forms of semi-structured data. Microsoft is in fact heavily promoting semi-structured data as the type of Apache Hadoop data it can handle with its existing tools, but semi-structured data represents only a small percentage of the data that is kept in Apache Hadoop clusters. Second, the fact that one can use Excel to perform analytics on big data does not mean that this task is in any way easy to perform. The ability to import the useful data through HiveQL, fashion this data into a clean table filled with the right information, and then perform the right analytics in a way that yields valuable insights is a set of skills reserved for specialists such as data analysts familiar with tools such as the DAX scripting language. It is true that Microsoft has lowered the barrier to entry for reading semi-structured data stored in Apache Hadoop and especially for creating visualizations of tabular data. It is likely that in future releases of Excel and HDInsight, Microsoft will continue to make these processes gradually easier. However, it is unlikely that Microsoft will succeed in bringing true analytics (as opposed to mere visualizations) to the masses, whether for structured data or unstructured data. True data analysis that is capable of revealing valuable and non-obvious insights, after all, is a discipline that requires specialized mathematical and statistical skills that go beyond the simple familiarity with a given scripting interface or software tool. Conclusion The main products Microsoft includes in its all-data vision HDInsight, PDW, and Excel comprise a compelling and comprehensive set of features, albeit with some significant limitations. Even with these limitations, however, this vision offers a unique take on big data that is not available through other vendors. On the positive side, Microsoft offers the many firms that are already heavily invested in Microsoft products a way to ease into big data with minimal adjustment. For example, HDInsight will be manageable from System Center and increasingly integrated into Active Directory Domain Services, reducing administrative overhead compared to other Apache Hadoop solutions. Furthermore, companies planning to deploy future releases of SQL Server will find that this product is likely to include PolyBase, and by extension, a T-SQL connection to Apache Hadoop. Existing BI expertise in Excel, meanwhile, can be used by connecting to both Apache Hadoop and relational data sources. Finally, for the many organizations already moving their servers or data into Windows Azure, Windows Azure HDInsight Service offers an attractive option because it can connect directly to data stored in Windows Azure blob storage. Another legitimate advantage of Microsoft s vision is ease of implementation, and to a lesser degree, ease of use. Spinning up a data cluster in HDInsight Service is decidedly easier than creating a physical Apache Hadoop cluster on premises, even if uploading big data sets to the cloud can be extremely

22 time-consuming. For ease of use, the.net SDK and JavaScript libraries for HDInsight, as well as the interactive JavaScript console, do make programming for and interacting with Apache Hadoop somewhat easier than with other Apache Hadoop distributions. All of this said, although the components of Microsoft s big data offerings are compatible with each other, they are not truly integrated into a single solution and do not yet achieve any significant synergistic effects when used together. Another significant limitation to Microsoft s big data strategy is that while Apache Hadoop is rapidly improving in performance, extensibility, and ease of use, Microsoft has not yet proven that it can keep pace with these changes. By the time HDInsight is officially released, HDInsight might in fact already be outdated, if not far surpassed, by superior Apache Hadoop-based alternatives. The BI portion of Microsoft s all-data vision does not yet fully live up to claims of democratizing big data. Microsoft s solution manages to simplify relatively easy portions of dealing with big data, such as server-cluster installation or semistructured data visualization. Benefits such as these provide evolutionary business value but still leave fundamentally difficult aspects of working with big data such as crafting queries for unstructured data, sanitizing data for visualization, and implementing effective machine-learning algorithms unaddressed. Until Microsoft finds ways help non-specialist information workers ask the right questions of any kind of data, the company will not truly live up to its claim of bringing big data to the masses. Today, what Microsoft is truly offering is merely the promise of an all-data solution, a promise that might or might not be realized soon. None of these limitations can negate that fact that the unusual breadth of Microsoft s product line across Apache Hadoop, relational databases, data warehousing, spreadsheets, the cloud, business intelligence, and server administration, make the company a unique contender among big data vendors. Microsoft has more of the components for a comprehensive data solution either in place or credibly maturing than any other company. Moreover, Microsoft provides a vision of data management and analysis that could be revolutionary if fully realized. However, it is important to underscore that Microsoft s all-data vision is currently just that: a vision. Those waiting for the Microsoft all-data solution will have to be patient and wait to see whether its promise matures into the truly comprehensive, integrated, and synergistic suite of data management and analysis products currently promised. But perhaps the most damaging critique one can level against Microsoft s all-data solution is that it is currently little more than a marketing vision (though a compelling one). In reality, HDInsight is now functional, but only as a preview, and the other main components of the vision, PDW and Excel, are essentially pre-existing products that have been repositioned as critical components of a grand big data marketing strategy.

23 Notes 1 Microsoft. Microsoft Big Data. 2 Microsoft. From Data to Insights with Microsoft Big Data. 3 Microsoft. What Version of Hadoop Is in Windows Azure HDInsight? 4 Apache. Hadoop Releases. 5 Hortonworks. Hortonworks Data Platform (HDP). 6 Hortonworks. HDP for Windows. 7 Microsoft. Microsoft Big Data. 8 Windows Azure Storage Team Blog. Windows Azure Storage BUILD Talk What s Coming, Best Practices and Internals. 9 Microsoft. Microsoft Big Data Microsoft. Microsoft Big Data Microsoft. Microsoft Big Data Value Prism Consulting. Microsoft s SQL Server Parallel Data Warehouse Provides High Performance and Great Value. March Value Prism Consulting. Microsoft s SQL Server Parallel Data Warehouse Provides High Performance and Great Value. March For more information about columnstore indexing, see For more information about other technologies in xvelocity, see 15 Data Warehouse Junkie. Rock Your Data with SQL Server 2012 Parallel Data Warehouse (PDW) POC Experiences. June Microsoft. Appliance: Parallel Data Warehouse (PDW) Monash Research. SQL-Hadoop Architectures Compared. June Microsoft. PolyBase The complete list of supported data sources for PowerPivot can be found at 20 Microsoft. Business Insights Newsletter Article. December Microsoft. Microsoft Big Data. The analysis in this document was done by Prowess Consulting and derived from work done with Intel. Results have been simulated and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Prowess, the Prowess logo, and SmartDeploy are trademarks of Prowess Consulting, LLC. Copyright 2014 Prowess Consulting, LLC. All rights reserved. Other trademarks are the property of their respective owners.