Indexed Terms: Big Data, benefits, characteristics, definition, problems, unstructured data

Size: px
Start display at page:

Download "Indexed Terms: Big Data, benefits, characteristics, definition, problems, unstructured data"

Transcription

1 Managing Data through Big Data: A Review Harsimran Singh Anand Assistant Professor, PG Dept of Computer Science & IT, DAV College, Amritsar id: harsimran_anand@yahoo.com A B S T R A C T Big Data applies to information that can t be processed or analyzed using traditional processes or tools. For decades, companies have been making business decisions based on transactional data stored in relational databases. However, there is a potential treasure of non-traditional data derived from various sources such as social media, s, online surveys etc that can be mined for useful information. With the decrease in the cost of storage and computation, it has become possible for enterprises to use this data to reap maximum benefits. This paper aims at presenting an insight into the vast paradigm of Big Data. Indexed Terms: Big Data, benefits, characteristics, definition, problems, unstructured data I. INTRODUCTION We are awash in a flood of data today. In a broad range of application areas, data is being collected at unprecedented scale. According to IBM Big Data Flood Infographic study, there are 100 Terabytes updated daily through Facebook, and a lot of activity on social networks this leading to an estimate of 35 Zettabytes of data generated annually by 2020 [2]. The growth of data constitutes the Big Data phenomenon which is a technological phenomenon brought about by the rapid rate of data growth and parallel advancements in technology that have given rise to an ecosystem of software and hardware products that are enabling users to analyse this data to produce new and more granular levels of insight. The term Big Data was first introduced to the computing world by Roger Magoulas from O Reilly media in 2005 in order to define a great amount of data that traditional data management techniques cannot manage and process due to the complexity and size of this data. Big Data refers to huge data sets that are orders of magnitude larger (volume); more diverse, including structured, semi structured, and unstructured data (variety); and arriving faster (velocity) than you or your organization has had to deal with before. The concept behind Big Data is based on the fact that the datasets are so large that typical database systems are not able to store and analyze the datasets. The datasets are large because the data is no longer traditional structured data, but data from many new sources, including , social media, and Internet-accessible sensors [3]. In the past few years, Big Data has demonstrated the capacity to make more informed and timely predictions of market trends, save money, boost efficiency and improve decision-making in fields as disparate as traffic control, weather forecasting, disaster prevention, finance, fraud control, education, business transaction, national security, and health care. According to a survey by TCS of 1,217 companies in nine countries in four regions of the world (U.S., Europe, Asia-Pacifi c and Latin America) in late December , IJAFRSE and ICCICT 2015 All Rights Reserved

2 2012 and January 2013, a little more than half (643) said they had undertaken Big Data initiatives in 2012 [1]. II. DEFINITION OF BIG DATA At present, the industry does not have a unified definition of Big Data. It has been defined in differing ways as follows by various parties: According to McKinsey, Big Data refers to datasets whose size are beyond the ability of typical database software tools to capture, store, manage and analyse. IDC defines Big Data technologies as a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high velocity capture, discovery and analysis. According to O Reilly, Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of existing database architectures. To gain value from these data, there must be an alternative way to process it. According to Wikipedia, Big Data usually includes datasets with sizes beyond the capability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. According to Gartner, Big Data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, and process optimization. In the nutshell, Big Data is used to describe massive volumes of structured and unstructured data that are so large that it is very difficult to process this data using traditional databases and software technologies. III. CHARACTERISTICS OF BIG DATA The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020 [4]. However, volume of data is not the only characteristic that matters. In fact, Big Data has four main characteristics: Volume, Velocity, Variety, and Value commonly referred to as 4V, referencing the huge amount of data volume, fast processing speed, various data types, and low-value density. A. Volume Volume is synonymous with the word big in the term Big Data. Volume is a relative term some smallersized organizations are likely to have mere gigabytes or terabytes of data storage as opposed to the petabytes or exabytes of data that big global enterprises have. There are several factors which contributes towards increase in volume of data such as storing the transactional data, data obtained from live streaming and data collected from sensors. Twitter alone generates more than 7 terabytes of data everyday, Facebook 10 terabytes and some enterprises generate terabytes of data every hour of every day of the year [5]. Organizations with the use of right technology can analyze this data and use it to their maximum benefit. B. Variety , IJAFRSE and ICCICT 2015 All Rights Reserved

3 Variety refers to a number of sources from which data can be obtained. With the explosion of smart devices, sensors and social networking technologies, data collected by an organization is not just structured traditional data but also semi and unstructured data. Structured data is grouped into a relational scheme which can answer the simple queries to arrive at usable information. Semi-structured data does not conform to an explicit and fixed schema. This data is self-describing and contains tags or other markers to enforce hierarchies of records and fields within the data. Examples include weblogs and social media feeds. And finally, the unstructured data consists of formats which cannot easily be indexed into relational tables for analysis or querying. Examples include images, audio and video files. To capitalize the benefits of Big Data, companies need to integrate and analyze data from complex traditional and nontraditional sources of information, including the companies internal and external data. C. Velocity Velocity refers to the speed at which the data is generated and needs to be handled. But this is a conventional definition of velocity. More appropriately, velocity refers to the speed at which data is flowing. Across the means of collecting social formation, new information is being added to the database at rates ranging from as slow as every hour or so, to as fast as thousands of events per second which has made it impossible for traditional systems to handle. D. Value The economic value of different data varies significantly. Because of the magnitude, Big Data s value density per unit of data is constantly reducing. However, the overall value of the data is increasing. There is good information hidden amongst a larger body of non-traditional data and the challenge is to identify what is valuable and then transforming and extracting that data for analysis. Big Data is even compared to gold and oil, indicating Big Data contains unlimited commercial value [8]. IV. BENEFITS OF BIG DATA With unstructured data dominating the world of data, the way to exploit is just becoming clearer. Information proliferation is playing a vital role in leveraging the opportunities presented by the data. Within an organization, it is quite difficult for business leaders to rely solely on experience (or pure intuition) to make decisions. They need to rely on good data services for their decisions. By placing data at the heart of the business operations to provide access to new insights, organizations will then be able to compete more effectively. The industry opportunities presented by the plethora of data are plenty. For years, organizations have captured transactional structured data and used batch processes to place summaries of the data into traditional relational databases. In recent years, new technologies with lower costs have enabled improvements in data capture, data storage and data analysis. Organizations can now capture more data from many more sources and types (blogs, social media feeds, audio and video files). Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Big Data integrates both structured and unstructured data. The analysis of data can be done in real time or close to real time, acting on full datasets rather than summarized elements. The underlying cost of the , IJAFRSE and ICCICT 2015 All Rights Reserved

4 infrastructure to power the analysis of data has fallen dramatically, making it economic to mine the information. Like traditional analytics, it can also support internal business decisions. The technologies and concepts behind Big Data allow organizations to achieve a variety of objectives. When Big Data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater innovation all of which can have a significant impact on the bottom line. The competitive pressure on organizations has increased to the point where most traditional strategies are offering only marginal benefits. Big Data has the potential to provide new forms of competitive advantage for organizations. In Big Data, the software packages provide a rich set of tools and options where an individual could map the entire data landscape across the company, thus allowing the individual to analyze the threats he/she faces internally. This is considered as one of the main advantages as Big Data keeps the data safe. With this an individual can be able to detect the potentially sensitive information that is not protected in an appropriate manner and makes sure it is stored according to the regulatory requirements. Some of the areas where Big Data is quite beneficial are stated below. It is widely believed that the use of information technology can reduce the cost of healthcare while improving its quality. Use of in-home monitoring devices to measure vital signs, and monitor progress is just one way that sensor data can be used to improve patient health and reduce both office visits and hospital admittance. Scientific research has been revolutionized by Big Data. The Sloan Digital Sky Survey has today become a central resource for astronomers the world over. The field of Astronomy is being transformed from one where taking pictures of the sky was a large part of an astronomer s job to one where the pictures are all in a database already and the astronomer s task is to find interesting objects and phenomena in the database [6]. Big Data helps retailers know who buys their products. Use of social media and web log files from their ecommerce sites can help them understand who didn t buy and why they chose not to buy etc. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies through more accurate demand planning. Finally, social media sites like Facebook and LinkedIn simply wouldn t exist without Big Data. Their business model requires a personalized experience on the web, which can only be delivered by capturing and using all the available data about a user or member [4]. V. PROBLEMS OF BIG DATA While the potential benefits of Big Data are real and significant, and some initial successes have already been achieved, there remain many technical challenges that must be addressed to fully realize this potential. We are now in the days of Big Data. The sheer volume of the data poses a major challenge. In this internet savvy world, more and more IT companies have increasing needs to store and analyze the ever growing data, such as search logs, crawled web content, and click streams, usually in the range of petabytes, collected from a variety of web services. However, web data sets are usually non-relational or less structured and processing such semi-structured data sets at large scale poses another challenge , IJAFRSE and ICCICT 2015 All Rights Reserved

5 Huge volume of data and centralized storage slow down the Big Data s speed and response. Traditional DBMSs are not suitable for processing extremely large scale data. Single server cannot handle the ever increasing volume of data and this act as a serious performance bottleneck. Simple distributed file systems cannot satisfy service providers like Google, Yahoo!, Microsoft and Amazon. While processing a query in Big Data, speed is a significant demand. However, the process may take time because mostly it cannot traverse all the related data in the whole database in a short time [7]. Because of centralized data storage and indexing for tasks such as importing and exporting large amounts of data, statistical analysis, retrieval, and queries, its performance declines sharply as data volume grows, in addition to the statistics and query scenarios that require real-time responses [8]. Data security in Big Data is another area of concern. If a security breach occurs to Big Data, it would result in even more serious legal repercussions and reputational damage than at present. Unlike traditional security method, security in Big Data is mainly in the form of how to process data mining without exposing sensitive information of users. Only the users with the right privileges and permissions can see and access the data. Since large amounts of unstructured data may require different storage and access mechanisms, a unified security access control mechanism for multisource and multitype data has yet to be constructed and become available. Because Big Data means more sensitive data is put together, it s more attractive to potential hackers. Also there should be effective backup and redundancy mechanisms for the massive volume of structured and unstructured data, so data will never be lost under any circumstances. By using online Big Data application, a lot of companies can greatly reduce their IT cost. This involves massive use of third-party services and infrastructures that are used to host important data or to perform critical operations. Hence privacy of data becomes critical. Besides, current technologies of privacy protection are mainly based on static data set, while data is always dynamically changed, including data pattern, variation of attribute and addition of new data. Thus, it is a challenge to implement effective privacy protection in this complex circumstance. Data privacy is a liability, thus companies must be on privacy defensive. VI. CONCLUSION There is no doubt that Big Data is the hot frontier of today s information technology development. The amount of data currently generated by the various activities of the society has never been so big, and is being generated at an ever increasing speed. Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances in several disciplines and improving the profitability and success of many enterprises. However, in order to fully reap the benefits of Big Data, the above stated challenges need to be handled. VII. REFERENCES [1] The Emerging Big Returns on Big Data, A TCS 2013 Global Trend Study. [2] Elena Geanina Ularu, Florina Camelia Puican, Anca Apostu, Manole Velicanu, Perspectives on Big Data and Big Data Analytics, Database Systems Journal, Volume 3, No. 4, [3] Bernice M Purcell, Big Data Using Cloud Computing, OC [4] Big Data for the Enterprise, An Oracle White Paper, June , IJAFRSE and ICCICT 2015 All Rights Reserved

6 [5] Chris Eaton, Dirk Deroos, Tom Deutsch, George Lapis, Paul Zikopoulos, Understanding Big Data. [6] Challenges and Opportunities with Big Data, A Community White Paper Developed by Leading Researchers Across United States, [7] Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li, Big Data Processing in Cloud Computing Environments, 2012 International Symposium on Pervasive Systems, Algorithms and Networks. [8] , IJAFRSE and ICCICT 2015 All Rights Reserved