A Platform for Big Data Analytics on Distributed Scale-out Storage System
|
|
|
- Junior Lang
- 10 years ago
- Views:
Transcription
1 A Platform for Big Data Analytics on Distributed Scale-out Storage System Kyar Nyo Aye* Software Department, Computer University (Thaton), The Union of Myanmar *Corresponding author Thandar Thein Hardward Department, University of Computer Studies (Yangon), The Union of Myanmar Biographical notes: Kyar Nyo Aye is a tutor of Software Department at Computer University (Thaton). She got the Degree of Bachelor of Computer Science (B.C.Sc), the Degree of Bachelor of Computer Science (Honours), and the Degree of Master of Computer Science in 2004, 2005, and She received her Ph.D in Her research interests include information retrieval, distributed databases, distributed systems, cloud computing, mobile computing, big data analytics and big data technologies. Thandar Thein received her M.Sc. (computer science) and Ph.D. degrees in 1996 and 2004, respectively from University of Computer Studies, Yangon (UCSY), Myanmar. She did post doctorate research in Korea Aerospace University. She is currently a professor of UCSY. Her research interests include cloud computing, mobile cloud computing, big data, digital forensic, security engineering, and network security and survivability.
2 Abstract Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Existing big data platforms such as data warehouses cannot scale to big data volumes, cannot handle mixed workloads, and cannot respond to queries quickly. Hadoop-based platform based on distributed scale-out storage system emerges to deal with big data. In Hadoop NameNode is used to store metadata in a single system s memory, which is a performance bottleneck for scale-out. Gluster file system has no performance bottlenecks related to metadata because it uses an elastic hashing algorithm to place data across the nodes and it runs across all of those nodes. To achieve massive performance, scalability and fault tolerance for big data analytics, a big data platform is proposed. The proposed big data platform consists of big data storage and big data processing. For big data storage, Gluster file system is used and Hadoop MapReduce is applied for big data processing. The Hadoop big data platform and the proposed big data platform are implemented on commodity linux virtual machines clusters and performance evaluations are conducted. According to the evaluation analysis, the proposed big data platform provides better scalability, fault tolerance, and faster query response time than the Hadoop platform. Keywords big data, big data analytics, big data platform, Hadoop MapReduce, Gluster file system, Apache Pig, Apache Hive, Jaql I.INTRODUCTION Today, information is generated continuously around the globe. Almost every growing organization wants to automate most of its business processes and is using IT to support every conceivable business function. This is resulting into huge amount of data being generated in the form of transactions and interactions. Web has become an important interface for interactions with suppliers and customers generating the huge amount of data in the form of s etc. Besides this, there is a huge amount of data emitted automatically in the form of logs like network logs and web server logs. Various Telecom Service Providers get huge amount of data in the form of conversations and Call Data Records. Various Social N/W Sites have started getting TBs of data every day in the form of tweets, blogs, comments, photos and videos etc. Facebook generates 4TBs of compressed data every day. Web Companies like these get huge amount of click stream data generated daily as well. Hospitals have data about the patients, their diseases and the data generated by various medical devices as well. Sensors used in various machines used for production keep generating so much of event data in seconds. Almost every sector like transport, finance is seeing a tsunami of data. Now the important question that arises at this point of time is how do we store and process such huge amount of data most of which is Semi structured or Unstructured. There is a highlevel categorization of big data platforms to store and process them in a scalable, fault tolerant and efficient manner [13]. The first category includes massively parallel processing or MPP Data warehouses that are designed to store huge amount of structured data across a cluster of servers and perform parallel computations over it. Most of these solutions follow shared nothing architecture which means that every node will have a dedicated disk, memory and processor. All the nodes are connected via high speed networks. As they are designed to hold structured data so there is a need to extract the structure from the data using an ETL tool and populate these data sources with the structured data. These MPP Data Warehouses include: MPP Databases: these are generally the distributed systems designed to run on a cluster of commodity servers. E.g. Aster ncluster, Greenplum, DATAllegro, IBM DB2, Kognitio WX2, Teradata. Appliances: a purpose-built machine with preconfigured MPP hardware and software designed for analytical processing. E.g. Oracle Optimized Warehouse, Teradata machines, Netezza Performance Server and Sun s Data Warehousing Appliance. Columnar Databases: they store data in columns instead of rows, allowing greater compression and faster query performance. E.g. Sybase IQ, Vertica, InfoBright Data Warehouse, ParAccel. Another category includes distributed file systems like Hadoop to store huge unstructured data and perform MapReduce computations on it over a cluster built of commodity hardware. Pavlo et al. [1] described and compared MapReduce paradigm and parallel DBMSs for large scale data analysis and defined a benchmark consisting of a collection of tasks to be run on an open source version of MR as well as on two parallel DBMSs. Hadoop is a popular open-source mapreduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, in Hadoop the NameNode can become a performance bottleneck because it keeps the directory tree of all files in the Hadoop Distributed File System. The architecture within Gluster does not depend on metadata in any way. Therefore, Gluster has no performance bottlenecks and no inconsistency risks related to metadata. In addition, using Hadoop was not easy for end users, especially for those users who were not familiar with MapReduce. The MapReduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hadoop lacked the expressiveness of popular query languages like SQL and as a result users ended up spending hours to write programs for even simple analysis. In order to analyze this data more productively, the query capabilities of Hadoop need to be improved. So, several application development languages have emerged to make it easier to write MapReduce programs in Hadoop and that run on top of Hadoop. Among them, Hive, Pig, and Jaql are popular. The aim of this paper is to propose big data platform that is built upon open source and built on Hadoop MapReduce,
3 Gluster File System, Apache Pig, Apache Hive and Jaql. The rest of the paper is organized as follows: In section 2, we explain big data concepts and technologies such as Big Data and Big Data Analytics. In section 3 we introduce our proposed big data platform and performance evaluations are conducted in section 4. Then vendor products for big data analytics are explained in section 5 and conclusion is described in section 6. II.BIG DATA CONCEPTS AND TECHNOLOGIES This section provides an overview of big data, big data Analytics, Hadoop and MapReduce framework, Apache Pig, Apache Hive, Jaql, Gluster File System and big data platform. Due to space constraints, some aspects are explained in a highly simplified manner. A detailed description of them can be found in [3, 9, 10]. A. Big Data Big Data are data sets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. There are three characteristics of Big Data: volume, variety, and velocity. Volume: Volume is the first and most notorious feature. It refers to the amount of data to be handled. The sheer volume of data being stored today is exploding. In the year 2000, 800,000 petabytes (PB) of data were stored in the world. This number is expected to reach 35 zettabytes (ZB) by Organizations that don t know how to manage massive volumes of data are overwhelmed by it. But the opportunity exists, with the right technology platform, to analyze almost all of the data to gain better insights. Variety: Variety represents all types of data. With the explosion of sensors, and smart devices, as well as social collaboration technologies, data in an enterprise has become complex, because it includes not only traditional relational data, but also raw, semistructured, and unstructured data. To capitalize on the Big Data opportunity, enterprises must be able to analyze all types of data, both relational and nonrelational: text, sensor data, audio, video, transactional, and more. Velocity: A conventional understanding of velocity typically considers how quickly the data is arriving and stored, and its associated rates of retrieval. However, today the term velocity is defined to data in motion: the speed at which the data is flowing. More and more of the data being produced today have a very short shelflife, so organizations must be able to analyze this data in near real time if they hope to find insights in this data. There are two types of big data: data at rest (e.g. collection of what has streamed, web logs, s, social media, unstructured documents and structured data from disparate system) and data in motion (e.g. twitter/facebook comments, stock market data and sensor data). So dealing effectively with Big Data requires performing analytics against the volume and variety of data while it is still in motion, not just after it is at rest. B. Big Data Analytics Big data analytics is the application of advanced analytic techniques to very big data sets. Advanced analytics is a collection of techniques and tool types, including predictive analytics, data mining, statistical analysis, complex SQL, data visualization, artificial intelligence, natural language processing, and database methods that support analytics (such as MapReduce, in-database analytics, in-memory database, columnar data stores). There are three approaches for big data analytics: direct analytics over MPP DW, indirect analytics over Hadoop and direct analytics over Hadoop. Direct analytics over MPP DW: The first approach for big data analytics is using a BI tool directly over any of the MPP DW. For any analytical request by the user the BI tool will send SQL queries to these DWs. These DWs will execute the queries in a parallel manner across the cluster and return the data to BI tool for further analytics. Indirect analytics over Hadoop: The second approach is indirect analytics over Hadoop which processes, transforms and structures the data inside Hadoop and then exports the structured data into RDBMS. The BI tool will work with the RDBMS to provide the analytics. Direct analytics over Hadoop: The last approach is performing analytics directly over Hadoop. In this case all the queries will be executed as MR jobs over big unstructured data placed into Hadoop. C. Hadoop and MapReduce Framework Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software s ability to detect and handle failures at the application layer. Hadoop enables a computing solution that is: Scalable: New nodes can be added as needed and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. Cost effective: Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all data. Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Fault tolerant: When a node fails, the system redirects
4 work to another location of the data and continues processing. A MapReduce framework typically divides the input dataset into independent tasks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the jobs are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and reexecuting the failed tasks [15]. Hadoop is supplemented by an ecosystem of Apache projects, such as Pig and Hive, that extend the value of Hadoop and improves its usability. Figure 1 shows mapreduce data flow with multiple reduce tasks. Fig. 1. MapReduce Data Flow with Multiple Reduce Tasks D. High Level Query Languages There are three high level query languages for big data analytics: apache pig, apache hive and jaql. Apache Pig Apache Pig [4] is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a runtime environment where PigLatin programs are executed. The Pig runtime environment translates the program into a set of map and reduce tasks and runs them. This greatly simplifies the work associated with the analysis of large amounts of data and lets the developer focus on the analysis of the data rather than on the individual map and reduce tasks. Apache Hive Apache Hive [2] is an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. Hive also includes a system catalog - Metastore that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. Jaql Jaql [7] is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries into low-level queries consisting of MapReduce jobs. Jaql is primarily a query language for JavaScript Object Notation (JSON). JSON is the popular data interchange format because it is easy for humans to read, and because of its structure, it is easy for applications to parse or generate. Both Jaql and JSON are record-oriented models, and thus fit together perfectly. JSON is not the only format that Jaql supports, Jaql is extremely flexible and can support many semistructured data sources such as XML, CSV, flat files and more. E. Gluster File System GlusterFS is a scalable open source clustered file system that offers a global namespace, distributed front end, and scales to hundreds of petabytes without difficulty. It is also a software-only, highly available, scalable, centrally managed storage pool for unstructured data. It is also scale-out file storage software for NAS, object, big data. By leveraging commodity hardware, Gluster also offers extraordinary cost advantages benefits that are unmatched in the industry. There are many advantages of Gluster over any other file systems. These advantages are: It is faster for each individual operation because it calculates metadata using algorithms and that approach is faster than retrieving metadata from any storage media. It is faster for large and growing individual systems because there is never any contention for any single instance of metadata stored at only one location. It is faster and achieves true linear scaling for distributed deployments because each node is independent in its algorithmic handling of its own metadata, eliminating the need to synchronize metadata. It is safer in distributed deployments because it eliminates all scenarios of risk which are derived from out-of-synch metadata [5]. Fig. 2. Gluster File System Architecture Both performance and capacity can be scaled out linearly in Gluster by employing three fundamental techniques:
5 The elimination of metadata Effective distribution of data to achieve scalability and reliability The use of parallelism to maximize performance via a fully distributed architecture Figure 2 describes the Gluster file system architecture. F. Big Data Platform Big data platform cannot just be a platform for processing data; it has to be a platform for analyzing that data to extract insight from an immense volume, variety, and velocity of that data. The main components in the big data platform provide: Deep analytics: a fully parallel, extensive and extensible toolbox full of advanced and novel statistical and data mining capabilities High agility: the ability to create temporary analytics environments in an end-user driven, yet secure and scalable environment to deliver new and novel insights to the operational business Massive scalability: the ability to scale analytics and sandboxes to previously unknown scales while leveraging previously untapped data potential Low latency: the ability to instantly act based on these advanced analytics in the operational, production environments [14]. a runtime component for Hadoop. The Jaql compiler automatically rewrites Jaql scripts so they can run in parallel on Hadoop. Processing layer: The jobtracker coordinates all these MapReduce jobs by scheduling tasks to run on tasktrackers. The tasktrackers run map tasks and reduce tasks. Interface layer: The file system function calls flow from MapReduce jobs to the Gluster java library through the FUSE mount. These file system calls are translated into POSIX file system calls. Storage layer: The Gluster storage pool is a trusted network of storage servers which consist of one or more bricks. A brick is the GlusterFS basic unit of storage, represented by an export directory. III.PROPOSED BIG DATA PLATFORM The proposed big data platform performs large-scale data analysis by using MapReduce framework on unstructured data stored in GlusterFS over distributed scale-out storage system. GlusterFS can provide these features: scalability to petabytes and beyond, affordability (use of commodity hardware), flexibility (deploy in any environment), linearly scalable performance, high availability, and superior storage economics. By combining these advantages of GlusterFS with parallel data processing, schema free processing and simplicity of MapReduce programming model, the proposed big data platform can perform large scale data analysis efficiently and effectively. The proposed big data platform is shown in Figure 3. The proposed big data platform consists of four layers: application layer, processing layer, interface layer and storage layer. Application layer: Multiple GlusterFS clients use high level query languages such as hive, pig, and jaql to submit analytical jobs. These jobs are compiled into MapReduce jobs. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. In Hive, all commands and queries go to the Driver, which compiles the input, optimize the computation required, and executes the required steps, usually with MapReduce jobs. The Metastore is a separate relational database where Hive persists table schemas and other system metadata. Jaql consists of a scripting language and compiler, as well as Fig. 3. Proposed Big Data Platform A. Big Data Analytics on Proposed Platform The proposed platform can handle any data type such as call data records, web clickstreams, network logs, and so on. Hadoop MapReduce processes these data that are stored in GlusterFS on commodity servers to extract useful information for users. Users can use high level query languages such as hive, pig and jaql to get analytical results. Figure 4 describes the conceptual architecture of big data analytics on proposed platform.
6 IV.PERFORMANCE EVALUATION Fig. 4. Conceptual Architecture of Big Data Analytics on the Proposed Platform B. Gluster File System Server Volumes There are seven types of GlusterFS server volumes. These are: Distributed volume: It randomly distributes files throughout the bricks in the volume. It can be used in environments where the requirement is to scale storage and the redundancy is either not important or is provided by other hardware or software layers. Replicated volume: It creates copies of files across multiple bricks in the volume. It can be used in environments where high-availability and highreliability are critical. Striped volume: It stripes data across bricks in the volume. For best results, it should be used only in high concurrency environments accessing very large files. Distributed Replicated volume: It distributes files across replicated bricks in the volume. It can be used in environments where the requirement is to scale storage and high-reliability is critical. Distributed Striped volume: It stripes files across two or more nodes in the cluster. It should be used in environments where the requirement is to scale storage and in high concurrency environments accessing very large files is critical. Striped Replicated volume: It stripes data across replicated bricks in the cluster. It should be used in highly concurrent environments where there is parallel access of very large files and performance is critical. Distributed Striped Replicated Volume: It distributes striped data across replicated bricks in the cluster. It should be used in highly concurrent environments where there is parallel access of very large files and performance is critical. Enhanced hadoop gluster connector can support MapReduce workloads on all these volume types. To achieve linear scalability and high performance for big data analytics, striped replicated volume and distributed striped replicated volume are the best storage options. We have evaluated the performance of two big data platforms on three commodity Linux clusters first cluster with two virtual machines (testbed 1), second cluster with three virtual machines (testbed 2), and third cluster with four virtual machines (testbed 3). The VMs are interconnected via a 1-gigabit Ethernet. The host machine runs Windows 7 Ultimate and has Intel Core i7-3.40ghz processor, 4GB physical memory, and 950-GB disk. As software components, Hadoop , Gluster 3.4.0, Hive 0.9.0, Pig and Jaql are used. Table 1 shows the experimental setup to evaluate the query performance of two big data platforms. TABLE I. EXPERIMENTAL SETUP FOR PERFORMANCE EVALUATIONS TABLE II. SPECIFICATION OF TESTBED 1 The parameters of testbed 1, testbed 2, and testbed 3 are shown in Table 2, Table 3 and Table 4 respectively.
7 TABLE III. SPECIFICATION OF TESTBED 2 Striped replicated volume is used for storage in testbed 1 and testbed 3 and distributed striped replicated volume is used in testbed 2. To create a striped replicated volume in testbed 1: # gluster volume create populationvolume stripe 2 replica 2 server1:/exp1 server2:/exp2 server1:/exp3 server2:/exp4 The striped replicated volume used in testbed 1 is shown in Figure 5. TABLE IV. SPECIFICATION OF TESTBED 3 Fig. 5. Striped Replicated Volume for Testbed 1 To create a distributed striped replicated volume in testbed 2: # gluster volume create populationvolume stripe 2 replica 2 server1:/exp1 server2:/exp2 server3:/exp3 server1:/exp4 server2:/exp5 server3:/exp6 server1:/exp7 server2:/exp8 The distributed striped replicated volume used in testbed 2 is described in Figure 6. US census dataset [16] is used to evaluate the performance of two big data platforms. The data set consists of 331 tables. Population table is used to evaluate the query performance of two big data platforms. It consists of records for 52 states (50 US states, the District of Columbia, and Puerto Rico). Table 5 describes the data dictionary of population table. TABLE V. DATA DICTIONARY OF POPULATION TABLE Fig. 6. Distributed Striped Replicated Volume for Testbed 2 To create a striped replicated volume in testbed 3: # gluster volume create populationvolume stripe 2 replica 2 server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 The striped replicated volume used in testbed 3 is shown in Figure 7. Fig. 7. Striped Replicated Volume for Testbed 3
8 A. Sample Analytical Workloads Four queries are used as sample analytical workloads for performance evaluation of two big data platforms. Figure 8 shows HiveQL, PigLatin, and Jaql for the first query. The HiveQL (Hive Query Language) is hive> create table population (ID int, FILEID string, STUSAB string, CHARITER string, CIFSN string, LOGRECNO string, P int) row format delimited fields terminated by '\,' stored as textfile; hive> load data inpath '/user/root/population.csv' overwrite into table population; hive> select * from population where P > 30000; The PigLatin is grunt> population = load '/user/root/population.csv' using PigStorage(',') as (ID: int, FILEID: chararray, STUSAB: chararray, CHARITER: chararray, CIFSN: chararray, LOGRECNO: chararray, P : int); grunt> result = filter population by P > 30000; grunt> dump result; The Jaql is jaql> $population= read(del("/user/root/population.csv", { schema: schema { ID: long, FILEID: string, STUSAB: string, CHARITER: string, CIFSN: string, LOGRECNO: string, P : long} })); jaql> $population -> filter $.P > 30000; Fig. 8. HiveQL, PigLatin, and Jaql for the First Query The first section in figure 8 shows HiveQL of creating a population table, loading the file into the table, and finding the records where population is greater than The second section describes a pig program that takes a file composed of population data, selects only those records whose population is greater than 30000, and displays the result. The last section shows a Jaql sample that finds the records where population is greater than Figure 9 illustrates HiveQL, PigLatin, and Jaql for the second query. The HiveQL is hive> select count(*) from population; The PigLatin is grunt> grouped = group population all; grunt>result = foreach grouped generate COUNT (population); grunt> dump result; The Jaql is Jaql> $population -> group into count($); Fig. 9. HiveQL, PigLatin, and Jaql for the Second Query The first and last sections in figure 9 show queries to find the number of records in population table using Hive and Jaql respectively. The second section illustrates a pig program that groups the population, and displays the number of records in that group. Figure 10 describes HiveQL, PigLatin, and Jaql for the third query. The HiveQL is hive> select STUSAB, sum(p ) from population group by STUSAB; The PigLatin is grunt> grouped = group population by STUSAB; grunt> result = foreach grouped generate group, SUM(population.P ); grunt> dump result; The Jaql is jaql>$population -> group by $STUSAB={$.STUSAB} into {$STUSAB, total: sum($[*].p )}; Fig. 10. HiveQL, PigLatin, and Jaql for the Third Query The first section in figure 10 shows HiveQL of finding the total population for each state. The second section describes a pig program that groups the population by the state, and displays the sum of the number of population for each state. The last section shows a Jaql example that finds the total population for each state. Figure 11 demonstrates HiveQL, PigLatin, and Jaql for the fourth query. The HiveQL is hive> select STUSAB, count(*) from population group by STUSAB; The PigLatin is grunt> grouped = group population by STUSAB; grunt> result = foreach grouped generate group, COUNT(population); grunt> dump result; The Jaql is jaql>$population -> group by $STUSAB={$.STUSAB} into {$STUSAB, count: count($)}; Fig. 11. HiveQL, PigLatin, and Jaql for the Fourth Query The first and last sections in figure 11 show queries to find the number of records for each state using Hive and Jaql respectively. The second section illustrates a pig program that groups the population by the state, and displays the number of records for each state. B. Experimental Results In this paper, the Hadoop big data platform (MapReduce and HDFS) and the proposed big data platform (MapReduce and GlusterFS) are implemented on three testbeds and performance evaluations are conducted with four queries on different record sizes. Figure 12 shows the query execution time for query 1 on testbed 1 for various states. The data sizes range from records for 10 states to records for 52 states. Hive provides the fastest query execution time and Pig provides the slowest query execution time on both platforms. There are no significant differences in query execution time between the two platforms. Fig. 12. Query 1 s Execution Time on Testbed 1
9 Figure 13 displays the query execution time for query 2 on testbed 1 for various states. Between 10 states and 52 states Pig s query execution time and Jaql s query execution time fluctuate on Hadoop platform. The proposed platform provides more stable query execution time than the Hadoop platform. provides slightly faster query execution time in three query languages than the Hadoop platform. Figure 17 displays the query execution time for query 2 on testbed 2 for various states. Although there are significant differences in Pig s query execution time and Jaql s query execution time, Hive s query execution time has slight gap between the two platforms. Fig. 13. Query 2 s Execution Time on Testbed 1 Figure 14 illustrates the query 3 s execution time on testbed 1, measured in seconds over a range from records for 10 states to records for 52 states. There is a greater difference in Pig s query execution time between the two platforms. Hive s query execution time and Jaql s query exection time have no significant differences between the two platforms. Fig. 17. Query 2 s Execution Time on Testbed 2 Figure 18 shows the query execution time for query 3 on testbed 2 for various states. Between 10 states and 52 states Pig s query execution time fluctuates dramatically, hitting a peak of over 110 seconds on Hadoop platform. The proposed platform provides more stable query execution time than the Hadoop platform. Fig. 14. Query 3 s Execution Time on Testbed 1 According to Figure 12 to 15, the proposed platform provides the faster query execution time than the Hadoop platform in three query languages. Fig. 18. Query 3 s Execution Time on Testbed 2 Fig. 15. Query 4 s Execution Time on Testbed 1 Fig. 19. Query 4 s Execution Time on Testbed 2 Figure 19 illustrates the query execution time for query 4 on testbed 2 for various states. According to Figure 19, there is fluctuation in Pig s query execution time on the Hadoop platform and Pig gives significant difference in query execution time between the two platforms. Fig. 16. Query 1 s Execution Time on Testbed 2 The query execution time for query 1 on testbed 2 for various states is plotted in Figure 16. The proposed platform Fig. 20. Query 1 s Execution Time on Testbed 3
10 The query execution time for query 1 on testbed 3 for various states is shown in Figure 20. There is wild fluctuation in Pig s query execution time on the Hadoop platform, but the trend is upward. Hive s query execution time and Jaql s query exection time have slight differences between the two platforms. Figure 21 describes the query execution time for query 2 on testbed 3 for various states. The proposed platform provides faster query execution time in three query languages than the Hadoop platform. Therefore, the proposed big data platform can support large scale data analysis efficiently and effectively. Table 6 describes the comparisons of two big data platforms from various aspects. TABLE VI. COMPARISON OF TWO BIG DATA PLATFORMS V.VENDOR PRODUCTS FOR BIG DATA ANALYTICS Fig. 21. Query 2 s Execution Time on Testbed 3 Fig. 22. Query 3 s Execution Time on Testbed 3 Figure 22 displays the query execution time for query 3 on testbed 3 for various states. The most striking feature is that Pig s query execution time fluctuates on the Hadoop platform from 10 states to 52 states. There are significant differences in query execution time between the two platforms. The query execution time for query 4 on testbed 3 for various states is described in Figure 23. Pig and Jaql have the greater differences in query execution time between the two platforms. The proposed platform provides faster query execution time in three query languages than the Hadoop platform. There are many vendor products to consider for big data analytics. In this paper, we discuss two products IBM big data platform and Splunk. IBM [3] offers a platform for big data including IBM InfoSphere Biginsights and IBM InfoSphere Streams. IBM InfoSphere Biginsights represents a fast, robust, and easy-to-use platform for analytics on Big Data at rest. IBM InfoSphere Streams is a powerful analytic computing platform that delivers a platform for analyzing data in real time with micro-latency. Splunk [17] is a generalpurpose search, analysis and reporting engine for time-series text data, typically machine data. It provides an approach to machine data processing on a large scale, based on the MapReduce model. Table 5 describes the comparisons of proposed platform with vendor products. TABLE VII. COMPARISON OF PROPOSED PLATFORM WITH VENDOR PRODUCTS VI.CONCLUSION Fig. 23. Query 4 s Execution Time on Testbed 3 As a result of experiments, we can conclude that Hive provides the fastest query execution time and Pig provides the slowest query execution time on both platforms. However, Pig and Jaql have the greater differences in query execution time between the two platforms. Experimental results prove that three query languages can provide faster query execution time on the proposed platform than the Hadoop platform. Big data is a growing problem for corporations as a result of sheer data volume along with radical changes in the types of data being stored and analyzed, and its characteristics. The main challenges of big data analytics are performance, scalability and fault tolerance. To address these challenges, many vendors have developed big data platforms. In this paper, a big data platform for large scale data analysis by using Hadoop MapReduce framework and GlusterFS over scale-out storage system is proposed. The proposed big data platform solves volume and variety issues of big data and only supports batch processing. Therefore it is necessary to address
11 velocity issue of big data and to support real-time processing. A solution to this can be achieved by adding Complex Event Processing (CEP) techniques to the proposed platform. In addition, the proposed platform does not consider visualization aspects and this can be solved by using visualization tools on the proposed platform. Moreover, developing the proposed big data platform requires downloading, configuring, and testing the individual open source projects such as Hadoop, GlusterFS, Pig, Hive and Jaql. The proposed platform should be deployed on Amazon Elastic Compute Cloud (EC2) instances to support cloud computing infrastructure. REFERENCES [1] A. Pavlo, E. Paulson and A. Rasin, A Comparison of Approaches to Large-Scale Data Analysis, in Proceedings of the 35 th SIGMOD International Conference on Management of Data, ACM, [2] A. Thusoo, et al., "Hive - a petabyte scale data warehouse using Hadoop", in Proceedings of the IEEE 26th International Conference on Data Engineering, ICDE'10, pp , [3] C. Eaton, T. Deutsch, D. Deroos, G. Lapis and P. Zikopoulos, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill, [4] C.Olston, B.Reed, U.Srivastava, R.Kumar, and A.Tomkins, Pig latin: a not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages ACM, [5] Gluster, Gluster File System Architecture, August [6] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1, pp , January [7] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.- C. Kanne, F. Özcan, and E. J. Shekita, Jaql: A scripting language for large scale semistructured data analysis, PVLDB, 4(12): , [8] K. Shvachko, et al., "The Hadoop Distributed File System", Proc. IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST'10, pp. 1-10, [9] P. Carter, Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO, IDC, September [10] P. Russom, Big Data Analytics, TDWI best practices report, fourth quarter [11] S. Agrawal, The Next Generation of Hadoop Map-Reduce,Apache Hadoop Summit India [12] T. White, Hadoop: The Definitive Guide, O'Reilly and Yahoo! Press, [13] [14] possible [15] [16] [17] [18] HDFS Architecture Guide, design.html. [19] Languages/Comparison_Of_High_Level_Data_Query_Languages.pdf
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
In-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Navigating the Big Data infrastructure layer Helena Schwenk
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK OVERVIEW ON BIG DATA SYSTEMATIC TOOLS MR. SACHIN D. CHAVHAN 1, PROF. S. A. BHURA
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Big Data Weather Analytics Using Hadoop
Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting
BIG DATA APPLIANCES July 23, TDWI R Sathyanarayana Enterprise Information Management & Analytics Practice EMC Consulting 1 Big data are datasets that grow so large that they become awkward to work with
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
Big Data and Its Impact on the Data Warehousing Architecture
Big Data and Its Impact on the Data Warehousing Architecture Sponsored by SAP Speaker: Wayne Eckerson, Director of Research, TechTarget Wayne Eckerson: Hi my name is Wayne Eckerson, I am Director of Research
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look
IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based
Microsoft Big Data. Solution Brief
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ
End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,
Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop
Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Are You Ready for Big Data?
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
IBM Big Data Platform
Mike Winer IBM Information Management IBM Big Data Platform The big data opportunity Extracting insight from an immense volume, variety and velocity of data, in a timely and cost-effective manner. Variety:
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard
Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop
Tap into Hadoop and Other No SQL Sources
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
CIO Guide How to Use Hadoop with Your SAP Software Landscape
SAP Solutions CIO Guide How to Use with Your SAP Software Landscape February 2013 Table of Contents 3 Executive Summary 4 Introduction and Scope 6 Big Data: A Definition A Conventional Disk-Based RDBMs
Approaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies [email protected] This paper
Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014
Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM
Using Big Data for Smarter Decision Making Colin White, BI Research July 2011 Sponsored by IBM USING BIG DATA FOR SMARTER DECISION MAKING To increase competitiveness, 83% of CIOs have visionary plans that
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
Agile Business Intelligence Data Lake Architecture
Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
Exploiting Data at Rest and Data in Motion with a Big Data Platform
Exploiting Data at Rest and Data in Motion with a Big Data Platform Sarah Brader, [email protected] What is Big Data? Where does it come from? 12+ TBs of tweet data every day 30 billion RFID tags
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014
5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
ANALYTICS BUILT FOR INTERNET OF THINGS
ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that
Big Data and Apache Hadoop Adoption:
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved
Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment
