CACSS: TOWARDS A GENERIC CLOUD STORAGE SERVICE
|
|
|
- Emily Mitchell
- 10 years ago
- Views:
Transcription
1 CACSS: TOWARDS A GENERIC CLOUD STORAGE SERVICE Yang Li, Li Guo and Yike Guo * Department of Computing, Imperial College London, UK {yl4709,liguo,yg}@doc.ic.ac.uk Keywords: Abstract: Cloud Computing, Cloud Storage, Amazon S3, CACSS The advent of the cloud era has yielded new ways of storing, accessing and managing data. Cloud storage services enable the storage of data in an inexpensive, secure, fast, reliable and highly scalable manner over the internet. Although giant providers such as Amazon and Google have made a great success of their services, many enterprises and scientists are still unable to make the transition into the cloud environment due to often insurmountable issues of privacy, data protection and vendor lock-in. These issues demand that it be possible for anyone to setup or to build their own storage solutions that are independent of commercially available services. However, the question persists as to how to provide an effective cloud storage service with regards to system architecture, resource management mechanisms, data reliability and durability, as well as to provide proper pricing models. The aim of this research is to present an in-depth understanding and analysis of the key features of generic cloud storage services, and of how such services should be constructed and provided. This is achieved through the demonstration of design rationales and the implementation details of a real cloud storage system (CACSS). The method by which different technologies can be combined to provide a single excellent performance, highly scalable and reliable cloud storage system is also detailed. This research serves as a knowledge source for inexperienced cloud providers, giving them the capability of swiftly setting up their own cloud storage services. 1. INTRODUCTION For many IT professionals, researchers and computer owners, finding enough storage space to hold data is a real challenge. Some people invest in ever larger external hard drives, storing files locally in order to deal with their data growth. However, protecting against the many unexpected things that can happen to computers or hard drives is not a trivial task. Similar issues exist for a number of large and geographically diverse enterprises; they store and process their rapidly growing data in several data centres and may adopt multiple storage systems. Finding and managing files in this increasingly large sea of data is extremely difficult. Although modern distributed computing systems provide ways of accessing large amounts of computing power and storage, it is not always easy for non-experts to use. There is also a lack of support in user defined file metadata in these * Please direct your enquiries to the communication author Professor Yike Guo systems. This has led most users to store a lot of data that strongly relates to the file content in relational databases. This separation of the file data and semantic metadata can create issues of scalability, interoperability and furthermore may result in low performance. Unlike local storage, cloud storage relieves end users of the task of upgrading their storage devices constantly. Cloud storage services enable inexpensive, secure, fast, reliable and highly scalable data storage solutions over the internet. Leading cloud storage vendors, such as (Amazon) and (Google), provide clients with highly available, low cost and pay as you go based cloud storage services with no upfront cost. Recently, Amazon announced that it currently holds more than 566 billion objects and that it processes more than 370,000 requests per second at peak times (Barr, 2011). Even so, many enterprises and scientists are still unable to shift into the cloud environment due to privacy, data protection and vendor lock-in issues. In addition, the high read/write bandwidths that are demanded by I/O intensive operations, which occur in many
2 different scenarios, cannot be satisfied by current internet connections. These reasons provide an incentive for organisations to set up or build their own storage solutions, which are independent of commercially available services and meet their individual requirements. However, knowledge of how to provide an effective cloud storage service with regards to system architecture, resource management mechanisms, data reliability and durability, as well as proper pricing models, remains untapped. In order to reveal this secret knowledge behind cloud storage services and thereby a generic solution, we present CACSS, a generic computational and adaptive cloud storage system that adapts existing storage technologies to provide efficient and scalable services. Through a demonstration of CACSS, full details can be given of how a proper cloud storage service can be constructed, in consideration of its design rationale, system architecture and implementation. We also show that our system not only supports the mainstream cloud storage features such as well supported scalability, sufficient replications and versioning but that it also supports large scale metadata operations such as metadata searching. These features provide a generic basis that enables potential storage service providers, both internet wide and intranet wide, to setup their private S3. Furthermore, CACSS is designed at petabyte scale and can be deployed across multiple data centres in different regions around the world. We describe in detail how we manage the separation of file content and file metadata, including user defined metadata. We also demonstrate the implementation of enabling multi-region support for this cloud storage system. This paper demonstrates how different technologies can be combined in order to provide a single and highly superior generic solution. 2. PROBLEM ANALYSIS 2.1 Generic Cloud Storage Features A cloud storage system should be designed and implemented with consideration of several generic features: High scalability and performance: storage demand has increased exponentially in recent years and as such, it is necessary for cloud storage systems to be able to seamlessly and rapidly scale their storage capability for both file content and file metadata. Traditionally, metadata and file data are managed and stored by the same file system and most of the time on the same device. In some modern distributed file systems, to improve scalability and performance, metadata is stored and managed separately by one or more metadata servers (Gibson and Van Meter, 2000). However, many of these still suffer from bottlenecks at high concurrent access rates (Carns et al., 2009). The metadata of a petabyte scale file system could contain in excess of billions of records, consuming terabytes of space or more. Therefore, the efficient metadata management of cloud storage systems is crucial for their overall storage system performance. Data durability: a far more common event than hardware failure or a disaster scenario is end user error, by which data is unintentionally deleted or overwritten. This demands that cloud storage systems have sophisticated replication, versioning and recovery mechanisms to restore data. Support of various pricing models: the traditional pricing model for software has been a one-time payment for unlimited use. Cloud pricing models, such as pay as you go and pay monthly are very similar to the usage charges of utility companies. In order to accommodate this, it is necessary for cloud storage systems to have an efficient monitoring framework to track all kinds of resource usage, including network data transfer, I/O requests, amounts of data stored (file content and file metadata) and resources consumed for various computations. Security models: various security models need to be implemented to ensure that documents and files are accessible at the correct time, location and by the right person, providing adequate and accurate security control over the data without compromising performance. Interoperability: no specific standards that enable interoperability between cloud storage vendors currently exist, and this has created problems in the moving of data from one cloud to another. Similar issues arise in converting existing applications that are built on traditional file systems to the cloud. The ideal cloud storage system must offer a level of abstraction, portability and ease of use that enables the consuming of storage services with minimal support and development overhead.
3 2.2 Cloud Storage for Different Domains Enterprises and scientists use cloud storage services for various purposes, and files are in different sizes and formats. Some use cloud storage for large video and audio files, and some use it for storing relatively small files that are large in quantity; the variety and range is vast. The different purposes of using cloud storage services give rise to a significant diversity of patterns of access to stored files. The nature of these stored files, in terms of features such as size and format, and the way in which these files are accessed are the main factors that influence the quality of cloud storage services that are eventually delivered to the end users. Some common domains for future cloud storage are as follows: Computational storage: many applications in science and enterprise are becoming increasingly demanding in terms of their computational and data requirements. Some applications store terabytes of data and carry out intensive I/O operations. Examples include bioinformatics analysis and log processing applications. Improving the overall performance of such applications often demands a cloud storage system that can bring computational power closer to the data. Amazon Elastic MapReduce service (Amazon) adapts a hosted Hadoop framework running on the infrastructure of Amazon EC2, in order to provide an on-demand service to setup computational tasks and process data that are stored on Amazon S3. Small file storage: some large online ecommerce companies and social networking websites store enormous numbers of small files; these are mostly image files, and their numbers are constantly growing. Every second, some of these files are requested at a high rate by public users. As the metadata of a small file may well occupy more space than the file content itself, large concurrent accessing of a small file may cause excessive and redundant I/O operations due to metadata lookups (Beaver et al., 2010). This scenario can sometimes create a bottleneck. Metadata operation intensive storage: metadata is the information that describes data files. Common metadata attributes include the time of an event, author s name, GPS location and captions. Many scientists record information about their experimental configuration, such as temperature, humidity and other data attributes; for many it has become an integral part of storage. Accurate identification and adequate support for metadata queries of these metadata would bring additional values to the stored files and ensure that analyses and computations are carried out correctly and efficiently. Unfortunately most storage systems are not capable of efficiently searching file metadata, especially those which are user defined on a large scale. Leading cloud storage providers such as Amazon S3, Cloud Files (Rackspace) and Google Cloud Storage (Google) offer cloud storage services for object data combined with a key-value style of user defined metadata. These metadata can be retrieved and used by user applications. However, none of these cloud storage providers yet support any object search or query based on the object metadata. Some research has already been done on the topic of metadata indexing and searching services (Singh et al., 2003, Leung et al., 2009) The present study addresses only some of the common domains to which a cloud storage system can be applied; indeed there are more domains that are yet to be studied. In general, it is not easy to make a system that can work for absolutely any domain. However, we can learn through generic features and the challenges that exist in those domains already studied, and eventually we can build towards a more adaptive and generic cloud storage system that can serve different purposes without too much effort. 3. BACKGROUND Amazon Simple Storage Service (Amazon S3) is an online storage service that aims to provide reliable and excellent performance at a low cost. However, neither its architecture nor its implementation has yet been made public. As such, it is not available for extension in order to develop the capability of creating private clouds of any size. Amazon S3 is the leading de facto standard of bucket-object oriented storage services. Successive cloud storage vendors, such as (Rackspace) and (Google) all adopt s3 s style of bucket-object oriented interface. This style hides all the complexities of using distributed file systems, and it has proven to be a success (Barr, 2011). It simply allows users to use the storage service from a higher level: an object contains file content and file metadata, and it is associated with a client assigned
4 key; a bucket, a basic container for holding objects, plus a key together uniquely identify an object. 4. CACSS DESIGN Following earlier discussion on key characteristics of the generic private cloud storage system, we now present in detail the design rationale of a CACSS system. From a conceptive level, the architecture of CACSS has five main components: the access interface, which provides a unique entry point to the whole storage system; the metadata management service, which manages the object metadata; the object operation management service, which handles a wide range of object operation requests; the metadata storage space, which stores all of the object metadata; and the object data storage space, which stores all of the object content data (Figure 1). perform and under what conditions such operations can be carried out. 4.3 Metadata Management To achieve high performance in metadata access and operation, CACSS s object metadata and content are completely separated. Each object s metadata including its system metadata such as size, last date modified and object format, together with user defined metadata are all stored as a collection of blocks addressed by an index in CACSS s Metadata Storage Space (MSS). MSS keeps all of the collections data sorted lexicographically by index. Each block is akin to a matrix which has exactly two columns and unlimited rows. The values of the elements in the first and second columns are block quantifiers and block targets, respectively. All of the block quantifiers have unique values in each block: Block A a i,j 1im, 1j2, for any k,s m, where ks, a k,1 a s,1 E.g. an index of W maps to a collection:,,,,,,,,,,,,,, Metadata Management Service Figure 1 CACSS Architecture 4.1 Access Interface CACSS offers a web-based interface for managing storage space and searching for objects. The current implementation supports Amazon s S3 REST API, the prevailing standard commercial storage cloud interface. 4.2 Identity and Access Management service IAM is a separated service that provides authorization and access control of various resources. It offers sub user, group management and precise permission control of which operations a user can MMS manages the way in which an object s metadata is stored. In such a system a client will consult the CACSS MMS, which is responsible for maintaining the storage system namespace, and they will then receive the information specifying the location of the file contents. This allows multiple versions of an object to exist. MMS handles requests as follows. First, it checks if a request contains an access key and a signed secret key. CACSS consults AIM and MSS to verify whether the user has the permission to perform the operation. If they do have permission, the request is authorized to continue. If they don t, error information is returned. If a request does not contain an access key or a signed secret key, MMS is looked up to verify if the request to the bucket or object is set as publicly available to everyone. If it is set as public, then the request continues to the next step. All the requests are logged, both successful and failed. The logging data can be used by both the service provider and storage users for billing, analysis and diagnostic purposes. Differing from traditional storage systems that limit the file metadata which can be stored and accessed, MMS makes metadata more adaptive and
5 comprehensive. Additional data regarding file and user-defined metadata can be added to the metadata storage, and these data can be accessed and adopted on demand by users or computational works at any time. Searching via metadata is another key feature of CACSS Buckets To reduce interoperability issues, CACSS adopts the de facto industry standard of buckets as basic containers for holding objects. Unlike some traditional file systems, in which a limited number of files can be stored in a directory, there is no limit to the number of objects that can be stored in a CACSS bucket. CACSS has a global namespace bucket names are unique and each individual bucket s name is used as the index in MSS. We use various block quantifiers and block targets to store a variety of information, such as properties of a bucket or an object, permissions and access lists for a particular user, and other user defined metadata. For example, for a bucket named bucket1, an index bucket1 should exist, which maps to a collection of data such as: : bucket1 : 1 : 1 :.h : :2 ; _; : h Objects: The index of each object is comprised of a string, which has the format of the bucket name together with the assigned object key. As a result of this nomenclature, objects of the same bucket are naturally very close together in MSS; this improves the performance of concurrent metadata access to objects of the same bucket. For example, considering an object with the userassigned key object/key.pdf in bucket bucket1, an index of bucket1- object/key.pdf should exist, which maps to the following collection of data: : /. : 1 : h://1/1/... : :1 _; :h h : Object Versioning: When versioning setting is enabled for a bucket, each object key is mapped to a core object record. Each core object record holds a list of version IDs that map to individual versions of that object. For example, for an object with a predefined key object/paper.pdf in bucket versionbucket, an index of "versionbucket object/paper.pdf" should exist, which maps to the collection data: : object/paper.pdf : 1 : : 1 :1 :2 Similarly, the object s version record with row key versionbucket-object/paper.pdf-uuid1 maps to the collection data: pp:loc pp:type pp:replicas nfs://cluster1//uuid version pm:userid2 READ; Object Data Management CACCS stores all the unstructured data, such as file content, in the Object Data Storage Space (ODSS). ODSS is intentionally designed to provide an adaptive storage infrastructure that can store unlimited amounts of data and that does not depend on underlying storage devices or file systems. Storage service vendors are able to compose one or multiple types of storage devices or systems together to create their own featured cloud storage system based on their expertise and requirements in terms of level of availability, performance, complexity, durability and reliability. Such implementation could be as simple as NFS (Sandberg et al., 1985), or as sophisticated as HDFS (Borthakur, 2007), PVFS (Carns et al., 2000) and Lustre (Schwan, 2003). CACSS s File Operation Management Service (FOMS) implements all ODSS s underlying file systems API, so that it can handle a wide range of file operation requests to the ODSS. FOMS works like an adapter that handles the architectural differences between different storage devices and file systems. It works closely with MMS to maintain the whole namespace of CACSS. FOMS also possesses the capability of utilising all the available resources by performing various object allocation strategies, which are based on factors such as objects types, sizes and even their previous usage patterns.
6 5. IMPLEMENTATION After considerable research and experimentation, we chose HBase as the foundational MSS storage for all object metadata. HBase is highly scalable and delivers fast random data retrieval. Its columnorientation design confers exceptional flexibility in the storing of data. We chose Hadoop DFS (HDFS) as the foundational storage technology for storing object data in ODSS. Hadoop also supports MapReduce framework (Apache) that can be used for executing computation tasks within the storage infrastructure. Although there is a single point of failure at the NameNode in HDFS s original design, many research studies have been carried out in order to build a highly available version of HDFS NameNode, such as AvatarNode (Borthakur, 2010). Every file and block in HDFS is represented as an object in the NameNode s memory, each of which occupies about 150 bytes. Therefore the total memory available on NameNode dictates the limitation of the number of files that can be stored in the HDFS cluster. By separating object metadata and object data, CACSS is able to construct an adaptive storage infrastructure that can store unlimited amounts of data using multiple HDFS clusters, whilst still exposing a single logical data store to the users (Figure 3). 5.1 Multi-region support The design and architecture of CACSS are based on the principles of scalability, performance, data durability and reliability. Scalability is considered in various aspects including the overall capacity of multi-region file metadata and file storage, as well as throughput of the system. Taking another perspective, the implementation of CACSS consists of a region controller and multiple regions (Figure 2). A Tomcat cluster is used as the application server layer in each region. It is easy to achieve high scalability, load balancing and high availability by using a Tomcat cluster and configuring with other technologies such as HAProxy and Nginx (Doclo, 2011, Mulesoft). The region controller has a MySQL cluster for storing various data such as user account information and billing and invoice details. A bucket can be created in one of the regions. At the same time, a DNS A record is also inserted into the DNS server. This mapping ensures that clients will send a hosted-style access request of the bucket and the object to the correct region. Each region is consistent with a Tomcat cluster, an HBase cluster and a set of HDFS clusters. The object data is stored in one of the HDFS clusters in the region. The object key and metadata are stored in the region s HBase cluster. It is always important to consider that any access to a bucket or object requires access rights to be checked. In CACSS, each request goes through its region first; if the requested bucket or object is set to be public, there is no need to communicate with the region controller. If it is not set as public, it consults the region controller to perform the permission check before making a response. The region controller, which includes a MySQL cluster, keeps records of all the requests and maintains user accounts and billing information. A DNS system (such as Amazon Route 53 (Amazon)) serves to map the bucket name to its corresponding region s Tomcat cluster IP. The region controller can also connect to the existing IAM service to provide more sophisticated user and group management. Figure 2: Implementation of CACSS Figure 3: Implementation multi-region HDFS clusters for storing buckets and contents of objects
7 CACSS also adopts other useful features of HDFS such as no explicit limitation on a single file size and no limitation on the number of files in a directory. In CACSS, most of the objects are stored in a flat structure in HDFS. Each object s file name under HDFS is a generated UUID to ensure uniqueness. The implementation of CACSS does not need to rely solely on HDFS. The separation of file metadata entirely from file content enables CACSS to adapt to one or even multiple file systems, such as GPFS or Lustre. It is now deployed as a service under IC- Cloud platform (Guo and Guo, 2011), and is expected to work with a variety of distributed file systems through POSIX or their APIs without much effort. 6. EXPERIMENTS We performed our experiments on top of Amazon EC2 instances, to enable the comparison of CACSS and Amazon S3 under similar hardware and network environments. We used (JetS3t), an open source Java S3 library, configuring it with our experiment code to evaluate the performance of CACSS. We used one m2.xlarge instance, with 17.1GB of memory and 6.5 EC2 Compute Units, to run MySQL, HDFS NameNode, HBase Hmaster and Tomcat with the CACSS application. Three m1.large instances, each with 7.5GB memory and 4 EC2 Compute Units ran HDFS DataNodes and HBase Regionservers. Each of these instances was attached with 100GB volumes of storage space. Another two m1.large instances were configured with the same experiment code but different S3 end points. We refer to these two instances as S3 test node and CACSS test node. To evaluate the performance of CACSS, we ran a series of experiments on both Amazon S3 and CACSS. The evaluation of the performance of Amazon EC2 and S3 has been carried out previously by (Garfinkel, 2007). A similar method was adopted here to evaluate the overall throughput of CACSS. Figure 4 and Figure 5 illustrate respectively the write and read throughputs of Amazon EC2 to Amazon S3, and of EC2 to CACSS, based on our experiments. Each graph contains traces of observed bandwidths for transactions of 1Kbyte, 1Mbyte, 100Mbyte and 1Gbyte. Both Amazon S3 and CACSS perform better with larger transaction sizes, because smaller size files would experience more transaction overhead. For files larger than 1Mbyte, the average speed of transaction of CACSS is higher than Amazon S3; this is probably due to underlying hardware differences between Amazon EC2 and Amazon S3, such as hard drive RPM and RAID levels. Amazon S3 s List Objects operation only supports a maximum of 1000 objects to be returned at a time, so we could not properly evaluate its object metadata service performance. However, we were able to run some tests to see how CACSS s metadata management performs. We ran a List All Objects operation after every 1000 Put Object operations. All of the operations were targeted to the same bucket. Each Put Object was done using an empty file, because we were only interested in the performance of the metadata access in this experiment. Figure 6 shows a scatter graph of the response time of each Put Object, with respect to the total number of objects in the bucket. The result shows an average response time of s and a variance of s for each Put Object operation. This indicates that the response time is pretty much constant no matter how many objects are stored in the bucket. Figure 7 illustrates the response time of each List All Objects operation with respect to the total number of objects contained in the bucket. There are several peaks in the graph which have been marked with a red circle. These peaks are caused by sudden network latency between Amazon EC2 instances during that time. Otherwise, the overall result shows a linear relation between the response time and the total number of objects. Figure 4: Cumulative Distribution Function (CDF) plots for writing transactions from EC2 to Amazon S3 and CACSS of various sizes.
8 7. RELATED WORK AND DISCUSSION Figure 5: CDF plots for reading transactions from EC2 to Amazon S3 and CACSS of various sizes. Figure 6: Put Object requests Figure 7: List all objects requests Besides Amazon S3, there have been quite a few efforts in cloud storage services, including the following. Walrus (Nurmi et al., 2009) is a storage service included with Eucalyptus that is interfacecompatible with Amazon S3. The open source version of Walrus does not support data replication services. It also does not fully address how file metadata is managed and stored. The Openstack (Openstack) project has an object storage component called Swift, which is an open source storage system for redundant and scalable object storage. However, it does not support object versioning at present. The metadata of each file is stored in the file s extended attributes in the underlying file system. This could potentially create performance issues with a large number of metadata accesses. pwalrus (Abe and Gibson, 2010) is a storage service layer that integrates parallel file systems into cloud storage and enables data to be accessed through an S3 interface. pwalrus stores most object metadata information as the file s attributes. Access control lists, object content hashes (MD5) and other object metadata are kept in.walrus files. If a huge number of objects are stored under the same bucket, pwalrus may be inefficient in searching files based on certain metadata criteria; this factor can cause bottlenecks in metadata access. Cumulus (Bresnahan et al., 2010) is an open source cloud storage system that implements the S3 interface. It adapts existing storage implementations to provide efficient data access interfaces that are compatible with S3. However, details of metadata organisation and versioning support are not fully addressed. Hadoop Distributed File System (HDFS) (Borthakur, 2007) is a distributed, reliable, scalable and open source file system, written in Java. HDFS achieves reliability by replicating data blocks and distributing them across multiple machines. (HBase) is an open source, non-relational, versioned, column-oriented distributed database that runs on top of HDFS. It is designed to provide fast real time read/write data access. Some research has already been done to evaluate the performance of HBase (Carstoiu et al., 2010) (Khetrapal and Ganesh, 2006). Table 1 shows a comparison of the support features of Amazon S3, Google Cloud Storage and CACSS. Many features are shared, with CACSS having the additional capability of metadata and user defined data searching.
9 Bucket region support Security control Table 1 Amazon S3 Google Cloud CACSS Storage Yes Yes Yes ACL; access key with signature; policy Multi part upload ACL; OAuth access ACL; access key with signature Large file upload Resumable upload Multi part upload Object Yes Yes Yes immutable Host static Yes No Yes website Versioning Yes No Yes Access logs Yes Yes Yes Random read Yes Yes Yes support Random No No No write support Search Object key Object key support only only Pricing SLA Storage space, network usage and number of requests 99.9% uptime and % durability Object key, system metadata and user-defined data Storage space, N/A network usage and number of requests 99.9% uptime N/A 8. CONCLUSIONS AND FUTURE WORK We have presented the design and implementation of CACSS, a cloud storage system based on the generic principles of scalability, performance, data durability and reliability. CACSS not only enables users to access their data through the S3 interface, the de facto industry standard, but also provides support for additional features that make the storage service more comprehensive. These features include user defined metadata and object metadata searching. The storage model we propose offers service providers considerable advantage in combining existing technologies to compose a single customized cloud storage system. Furthermore, CACSS performance was found to be comparable to Amazon S3 in formal tests, with similar read/write capabilities. Although other features were difficult to compare directly, CACSS performance was highly adequate in terms of Put Object and List All Object requests. There are several directions of potential future studies. In any science and enterprise environment, data experiences rapid growth. Some of these data are very active in that they need to be retrieved very frequently, and conversely much other data are hardly accessed. In light of this, a potentially useful research direction is the extension of current work to enable a more application-aware cloud storage service. This would include a data movement framework that enables users or applications to automatically and manually archive inactive data in lower cost storage space, whilst upgrading highly active data to higher performance storage space. This should enhance capacity on demand cloud storage services to a higher level, towards performance on demand cloud storage services. Further study could also address the issue that current cloud storage systems do not offer much support for in-house data-centric services, such as processing and converting existing stored data on top of the storage infrastructure. This drives users into an inefficient workflow pattern of downloading data from cloud storage, dealing with the data externally, and then uploading the processed data back to cloud storage. Although parallel computing (Kumar, 2002) and MapReduce (Dean and Ghemawat, 2008) provide ways of effectively executing computational tasks distributed on a large scale, these implementations are not always easy for non-experts to setup or use. As a possible solution to this lack of in-house services, a cloud storage system could be developed with a framework that enables different storage oriented services and operations to be plugged in. This framework should implement sophisticated data and computing resource sharing mechanisms in order to allow stored data to be used by those plugged in services in a controlled and safe manner. In such a way, it is possible to create a cloud storage environment for a user community. For example, a user can publish an application, describing the specific computational task in an application repository; another user is then able to process the same task in a way that requires little or no knowledge about the computation, simply by specifying which published application to execute. REFERENCES ABE, Y. & GIBSON, G. pwalrus: Towards better integration of parallel file systems into cloud storage IEEE, 1-7. AMAZON. Amazon Elastic MapReduce [Online]. Available: AMAZON. Amazon Simple Storage Service (S3) [Online]. Available:
10 AMAZON. Route 53 [Online]. Available: APACHE. Hadoop MapReduce [Online]. Available: BARR, J Available from: n-s3-566-billion-objects requestssecond-and-hiring.html. BEAVER, D., KUMAR, S., LI, H. C., SOBEL, J. & VAJGEL, P Finding a needle in Haystack: Facebook s photo storage. Proc. 9th USENIX OSDI. BORTHAKUR, D The hadoop distributed file system: Architecture and design. Hadoop Project Website. BORTHAKUR, D Hadoop avatarnode high availability. Available from: doop-namenode-high-availability.html. BRESNAHAN, J., KEAHEY, K., FREEMAN, T. & LABISSONIERE, D Cumulus: an open source storage cloud for science. SC10 Poster. CARNS, P., LANG, S., ROSS, R., VILAYANNUR, M., KUNKEL, J. & LUDWIG, T. Smallfile access in parallel file systems IEEE, CARNS, P. H., LIGON III, W. B., ROSS, R. B. & THAKUR, R. PVFS: A parallel file system for Linux clusters USENIX Association, CARSTOIU, D., CERNIAN, A. & OLTEANU, A. Hadoop Hbase performance evaluation. New Trends in Information Science and Service Science (NISS), th International Conference on, May DEAN, J. & GHEMAWAT, S MapReduce: simplified data processing on large clusters. Commun. ACM, 51, DOCLO, L Clustering Tomcat Servers with High Availability and Disaster Fallback. Available from: GARFINKEL, S. L. An evaluation of amazon s grid computing services: EC2, S3, and SQS Citeseer. GIBSON, G. A. & VAN METER, R Network attached storage architecture. Communications of the ACM, 43, GOOGLE. Google Cloud Storage Service [Online]. Available: GUO, Y.-K. & GUO, L IC cloud: Enabling compositional cloud. International Journal of Automation and Computing, 8, HBASE, A. Available: JETS3T. JetS3t [Online]. Available: KHETRAPAL, A. & GANESH, V HBase and Hypertable for large scale distributed storage systems. Dept. of Computer Science, Purdue University. KUMAR, V Introduction to parallel computing, Addison-Wesley Longman Publishing Co., Inc. LEUNG, A. W., SHAO, M., BISSON, T., PASUPATHY, S. & MILLER, E. L Spyglass: fast, scalable metadata search for large-scale storage systems. Proccedings of the 7th conference on File and storage technologies. San Francisco, California: USENIX Association. MULESOFT. Tomcat Clustering - A Step By Step Guide. Available from: NURMI, D., WOLSKI, R., GRZEGORCZYK, C., OBERTELLI, G., SOMAN, S., YOUSEFF, L. & ZAGORODNOV, D. The eucalyptus open-source cloud-computing system IEEE, OPENSTACK. Available: RACKSPACE. Cloud Files [Online]. Available: SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. & LYON, B. Design and implementation of the Sun network filesystem SCHWAN, P. Lustre: Building a file system for 1000-node clusters SINGH, G., BHARATHI, S., CHERVENAK, A., DEELMAN, E., KESSELMAN, C., MANOHAR, M., PATIL, S. & PEARLMAN, L. A metadata catalog service for data intensive applications IEEE,
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems
Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems 1 Some Numbers (2010) Over 260 Billion images (20 PB) 65 Billion X 4 different sizes for each image. 1 Billion
Improving Scalability Of Storage System:Object Storage Using Open Stack Swift
Improving Scalability Of Storage System:Object Storage Using Open Stack Swift G.Kathirvel Karthika 1,R.C.Malathy 2,M.Keerthana 3 1,2,3 Student of Computer Science and Engineering, R.M.K Engineering College,Kavaraipettai.
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Hadoop Distributed FileSystem on Cloud
Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, tsikudis}@csd.uoc.gr Abstract. The growing
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
PRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS
PRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS Amar More 1 and Sarang Joshi 2 1 Department of Computer Engineering, Pune Institute of Computer Technology, Maharashtra,
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage
Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
Cumulus: An Open Source Storage Cloud for Science John Bresnahan Mathematics and CS Division Argonne National Laboratory
Cumulus: An Open Source Storage Cloud for Science John Bresnahan Mathematics and CS Division Argonne National Laboratory [email protected] David LaBissoniere Computation Institute University of Chicago
A Survey on Cloud Storage Systems
A Survey on Cloud Storage Systems Team : Xiaoming Xiaogang Adarsh Abhijeet Pranav Motivations No Taxonomy Detailed Survey for users Starting point for researchers Taxonomy Category Definition Example Instance
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP)
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Simple Storage Service (S3)
Simple Storage Service (S3) Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers. Amazon S3 provides a simple web services interface that can be used
Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.
Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat
Cloud computing doesn t yet have a
The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Amazon Cloud Storage Options
Amazon Cloud Storage Options Table of Contents 1. Overview of AWS Storage Options 02 2. Why you should use the AWS Storage 02 3. How to get Data into the AWS.03 4. Types of AWS Storage Options.03 5. Object
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
2) Xen Hypervisor 3) UEC
5. Implementation Implementation of the trust model requires first preparing a test bed. It is a cloud computing environment that is required as the first step towards the implementation. Various tools
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Investigating Private Cloud Storage Deployment using Cumulus, Walrus, and OpenStack/Swift
Investigating Private Cloud Storage Deployment using Cumulus, Walrus, and OpenStack/Swift Prakashan Korambath Institute for Digital Research and Education (IDRE) 5308 Math Sciences University of California,
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
www.basho.com Technical Overview Simple, Scalable, Object Storage Software
www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...
StorReduce Technical White Paper Cloud-based Data Deduplication
StorReduce Technical White Paper Cloud-based Data Deduplication See also at storreduce.com/docs StorReduce Quick Start Guide StorReduce FAQ StorReduce Solution Brief, and StorReduce Blog at storreduce.com/blog
Cloud Computing and Advanced Relationship Analytics
Cloud Computing and Advanced Relationship Analytics Using Objectivity/DB to Discover the Relationships in your Data By Brian Clark Vice President, Product Management Objectivity, Inc. 408 992 7136 [email protected]
Cloud computing - Architecting in the cloud
Cloud computing - Architecting in the cloud [email protected] 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Distributed Metadata Management Scheme in HDFS
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 Distributed Metadata Management Scheme in HDFS Mrudula Varade *, Vimla Jethani ** * Department of Computer Engineering,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
pwalrus: Towards Better Integration of Parallel File Systems into Cloud Storage
: Towards Better Integration of Parallel File Systems into Storage Yoshihisa Abe and Garth Gibson Department of Computer Science Carnegie Mellon University Pittsburgh, PA, USA {yoshiabe, garth}@cs.cmu.edu
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
Cyber Forensic for Hadoop based Cloud System
Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics data 4
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
SLA-aware Resource Scheduling for Cloud Storage
SLA-aware Resource Scheduling for Cloud Storage Zhihao Yao Computer and Information Technology Purdue University West Lafayette, Indiana 47906 Email: [email protected] Ioannis Papapanagiotou Computer and
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
FREE AND OPEN SOURCE SOFTWARE FOR CLOUD COMPUTING SERENA SPINOSO ([email protected]) FULVIO VALENZA (fulvio.valenza@polito.
+ FREE AND OPEN SOURCE SOFTWARE FOR CLOUD COMPUTING SERENA SPINOSO ([email protected]) FULVIO VALENZA ([email protected]) + OUTLINE INTRODUCTION OF CLOUD DEFINITION OF CLOUD BASIC CLOUD COMPONENTS
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
DATA SECURITY MODEL FOR CLOUD COMPUTING
DATA SECURITY MODEL FOR CLOUD COMPUTING POOJA DHAWAN Assistant Professor, Deptt of Computer Application and Science Hindu Girls College, Jagadhri 135 001 [email protected] ABSTRACT Cloud Computing
From Grid Computing to Cloud Computing & Security Issues in Cloud Computing
From Grid Computing to Cloud Computing & Security Issues in Cloud Computing Rajendra Kumar Dwivedi Assistant Professor (Department of CSE), M.M.M. Engineering College, Gorakhpur (UP), India E-mail: [email protected]
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF
Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides
Cloud Computing For Bioinformatics
Cloud Computing For Bioinformatics Cloud Computing: what is it? Cloud Computing is a distributed infrastructure where resources, software, and data are provided in an on-demand fashion. Cloud Computing
Big Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief
DDN Solution Brief Personal Storage for the Enterprise WOS Cloud Secure, Shared Drop-in File Access for Enterprise Users, Anytime and Anywhere 2011 DataDirect Networks. All Rights Reserved DDN WOS Cloud
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Alfresco Enterprise on AWS: Reference Architecture
Alfresco Enterprise on AWS: Reference Architecture October 2013 (Please consult http://aws.amazon.com/whitepapers/ for the latest version of this paper) Page 1 of 13 Abstract Amazon Web Services (AWS)
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
