Web intelligence on Big Data in Today s Life Updesh Kumar Jaiswal I.M.S Engineering College,Ghaziabad, U.P, India updesh1984@gmail.com Abhishek Gupta I.M.S. Engineering College, Ghaziabad, U.P, India abhishekftp@yahoo.com ABSTRACT Future will be reverse Turing test into which machine have to determine whether it is interacting with human or machine. Big data is a lot and lot of data in which millions of transactions occurs per day. In order to use and handle this petabytes of data we use web intelligence. Understanding the human behavior and interest will be helpful in better and efficient searching.so in this paper, we are proposing a system which thinks like human, acts like human, thinks rationally and acts rationally.with the help of our proposed systemsearching of information will be according to the user s choice and interest. And in searching of information, useful documents on the web will be available in comparatively less time. Our proposed system will also help in showing only those advertisements of the products which user likes and wants to buy. General Terms Algorithm, Artificial Intelligence, DBMS, Data mining, Information Technology and Web Intelligence. Keywords Artificial Intelligence, Big data, Data, Data mining, Information, searching, Web and Web Intelligence. 1. INTRODUCTION Over the years there has been a lot of talk about the Kryder s law which states that doubling the power and memory of computer semiconductors every 18 months has driven technological advancement [1]. The graph in figure 1 shows the capacity of hard drives which are increasing per year according to the assumptions taken by the Kryder. The collections of as many bits which are possible to store on hard drives are same as Kryder s assumption, or may be more. And this task becomes very difficult to handle. After Kryder s work, some observations were also taken by Moore and he proposed the Moore s Law [2]. Over the chronological record of computing hardware is that in approximately every two years the number of transistors on integrated circuits doubles. His prediction has proven to be accurate because his law is now used in the semiconductor industry to guide long-term planning and to set targets for research and development. The number of CPU transistors against the date of corresponds to exponential growth with transistor count doubling in every two years. Fig 2: Assumption of Moore s Law Fig 1: Assumption ofkryder s Law http:///d.htm Due to this massive growth of data on the web, typical large enterprises like Facebook, Twitter, Google, large banks, retail companies, insurance companies and hospitals have Page10
manyservers and around petabytesof data and millions of transactions per day are stored in these servers. As a result there a need of a technology that is to be used by the large enterprise which can handle and solve this millions of data without any delay. 1.1 Big data Data are values of quantitative or qualitative variables that represent a set of item. Big data is the collection of large and complex data sets, which is difficult to process using on-hand database. Some of the examples of big data are: Data of social networking web sites ( e.g. Facebook.com, Orkut.com etc.) Data of banking sector Debit and credit card based data E-commerce based data Study material data Large number of text documents, images, audios and videosavailable on web etc. In other term we can say that big data is the collection of transactions, interactions and observations. The figure 3 shows size of big data with the increasing complexity and variety of the data. The collection of data from ERP system, CRM, and web causes the formation of big data which requires petabytes amount of storage capacity. So it becomes very difficult to process big data using on-hand database management tools and traditional data processing application. The challenges include capture, storage, search, sharing, transfer, analysis, and visualization. Big data is difficult to work with using most relational database management systems, desktop statistics and visualization packages. And it requires massively parallel software running on hundreds and even thousands of the organizations. The images and text of Wikipedia is a simple example of big data which is terabytes in size. And it becomes very difficult to handle these data because large number of edits and updates are taken place frequently. The figure given below shows the large number of Wikipedia edits on text and images, which is counted by the IBM. Fig 4: A visualization created by IBM of Wikipedia edits Fig 3: Storage capacity required for big data http:///d.htm There are many techniques used by the enterprises to handle the database like traditional SQL, MYSQL, and EXCEL etc.they normally use simple queries to store and explore the data. But these queries will not be much efficient with big data where lots and lots of petabytes of data are available. Google, Facebook, Amazon, LinkedIn, ebay and many others do not use traditional databases for handling big data. Due to this two questions arises, first one is that why they are not using traditional database and the second one is if they are not using traditional database, then what they are using to handle big data. The answer to the first one is that traditional database techniques are not efficient for the big data. They are not responsive as fast the requirement is and if the response speed is slow for the data analysis and result output then there is no use of that. The answer to the second one is they are using massive parallelism and Map-reduced algorithm to handle the big data. Page11
1.1.1 Massive Parallelism Massive Parallelism refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel. For example, in grid computing the processing power is increased by using diverse administrative domains opportunistically whenever a computer is available [3]. Another example is computer cluster where a large number of processors are used in a close proximity to each other. In such a centralized system the flexibility and the speed become very important [4]. Massive Parallel Processor Arrays (MPPAs) is a type of integrated circuit with an array of hundreds or thousands of CPUs and RAMs. These processors pass work to one another through a reconfigurable interconnect channels. By controlling a large number of processors working in parallel, a MPPA chip can accomplish more demanding tasks than conventional chips. MPPAs are based on a software parallel programming model for developing high-performance embedded system applications [6]. 1.1.2 Map Reduce Algorithm Map reduce is a programming model for processing large data sets with a parallel distributed algorithm on a cluster [6, 10].Map Reduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name Map Reduce originally referred to the proprietary of Google technology but has since been generalized. Map Reduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel- though in practice this is limited by the number of independent data sources and/or the number of CPU s near each source. mappers: take in k1, v1 pairs emit k2, v2 pairs k2, v2 <- Map(k1, v1) reducers: receive all pairs for some k2 combine these in some manner k2, f(..v2..)<- reduce(k2, [..v2..]) 1.2 Shortcomings in Existing System In the existing system there are lots of difficulties such as: Storage Collection Searching Management Retrieval of data and information from the Web. By machine login any hacker can make server very busy on the Web.Machine login means, a hacker develops the programs, which will make server very busy by multiple access to the server. In searching of data and information on the web generally user s choice is not taken into consideration. The output of searching was not efficient because many times less useful data or information is displayed earlier while some useful information might be available at the last page of search result. Moreover they were less secure and do not take any feedback from the users after the completion of searching process. The searching techniques used by the many websites are limited to the traditional SQL queries which are not as much responsive and efficient for the big data where lots of petabytes of data are available. Fig 5: Flow diagram of Map reduce algorithm http:///d.htm 1.3 Web Intelligence A system of interlinked hypertext documents accessed via the internet is known as web or World Wide Web. The term intelligence is referred to as the capacity to learn and solve the real life problems. Artificial intelligence is a technology or branch of computer science in which we study and develop intelligent machines and software.a web search engine is a software system that is designed to search data and information on the web. Page12
The Web intelligenceis an area of study and research of the application of artificial intelligence and information technology on the web in order to create the next generation of products, services and frameworks based on the internet. In short we can say that it is the study of artificial intelligence and information technology on the web management and traditional data processing[7]. Fig 6: Web Intelligence is a combination of IT and AI Watson is a very good example of Web Intelligence on big data and it is a system built by the IBM [5]. IBM applies advance natural language processing and machine learning techniques to build such a computing system (Watson) which is capable of open domain question answering. Fig 8: Watson and other participants in the Jeopardy Another good example of Web intelligence is face recognition on web. Face Recognition systemidentifies the person while putting the cursor on the face of that person. Fig 7: The actual vision of Watson Watson was first introduced in an American television game show Jeopardy as shown in figure 8. The show consists of a quiz competition in which contestants are presented with general knowledge clues in the form of answers, and they have to response in the form of the question to the clue or answer.and Watson was the winner of this game show. http:///d.htm Fig 9: Face Recognition on web The other usage of web intelligence at present and in future are like online advertisement, predicting interest of the user, gauging consumer sentiment, predicting the behavior of the user, detecting adverse events, predicting their impact, recognizing places and buildings etc. Page13
1.3.1 Machine Learning Machine learning is a branch of artificial intelligence, which concerns the study and assembly of systems that can learn from data [8]. For example, a machine learning system couldbe trained on email messages to learn to distinguish between promotional and private messages. After learning, it can then be used to classify new email messages into promotional and private folders. Another example of machine learning is in web searching that is to predict what a person actually wants if the person types rose then by the data machine predict that person wants general information about the flowers so machine will provide search results that have information about roses. But person types red rose then by the previous data machine understands that person is searching for buying the roses then machine will provide the result according to the online shops. Representation and generalization are the two main methods of machine learning [8]. The representation of data instances and functions evaluated on these instances are part of all machine learning systems. The property that the system will perform well on unseen data instances and the conditions under which this can be guaranteed is one of the studies in the subfield of computational learning theory which is known as generalization. There is a wide variety of machine learning tasks and successful applications. For example, optical image recognition, in which images and structures are recognized automatically based on the previously available instances. 1.3.2 Data mining as knowledge discovery Data mining can be seen as a result of the natural evolution of information technology. Data mining is a computational process of discovering interesting patters in datasets with the help of the methods of artificial intelligence, machine learning and database systems [9]. Data mining has attracted a great attention in the information industry as a whole in the recent years, due to the wide availability of huge amounts of data available through databases, data warehouses, or other information repositories such as web data and the need of turning such data into useful information and knowledge. The information analyzed and knowledge gained can be used for applications such forecasting market, duplicity detection, and customer detention. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Simply said, data mining refers to extracting or mining knowledge from large amounts of data. It is the process of discovering interesting knowledge from large amounts of data. 2. OUR PROPOSED APPROACH AND SYSTEM We are proposing a system: That thinks like Human, That thinks rationally, http:///d.htm That acts like Human, That acts rationally. The proposed system can be achieved with the help of web intelligence, machine learning and data mining. These systems may results in the better and more efficient outcomes. 3. APPLICATIONS OF OUR PROPOSED APPROACH AND SYSTEM Our Proposed system will have the following applications and outcomes: The required documents and information will be available after the searching in comparatively very less time. Data storage will not be traditional but it will be very advance by using many web intelligence techniques. Login form will be available to use our system and it will be helpful in monitoring and controlling each and every user. Caption Figure will be generated at the time of login into our system and it prevents the server from machine login. More secure than general systems. Outcomes of searching will be more appropriate and optimized because less useful information will come after the most useful information. Searching will be according to the user s choice and interest i.e. user s choice and interest will also be considered. The system will also help to show only those advertisements to the user or client which user like to buy or intended to buy. Feedback for the search results may also help to improve the efficiency of the system and understanding the user s nature. 4. IMPLEMENTAION DETAILS OF MACHINE LEARNING IN OUR DATABASE Suppose a user is searching for some information and he is using the keywords like flower, red, gift, cheap or combination of these keywords. Now the web will decide whether user is a surfer or shopper. As well as the web will also decide which advertisement has to show to the user. For solving the problems of the web in above mention case learning from the past experience is very important. For this Page14
purpose we collect all the historical data related to the searching of the flowers on the web. In table 1 the name of the columns represent the keywords that users typed and last column represents that users buy the product or not finally. In table 1 the symbol Y means yes and N means no. Table 1:Data of keywords used in searching operation Red Flower Gift Cheap BUY? N N Y Y Y Y N N Y N Y Y Y N N Y Y N Y Y N Y Y N N Detecting adverse events and predicting their impact without the help of any human but with the help of web intelligence and machine learning. Building more intelligent public services is also a field where these techniques will do a great work like water, energy and health facilities. Machine learning on the other hand is providing great influence on the local search engines as well as vast search techniques by learning from the previous data. Data mining helps to explore the vast data and find out interesting patterns which gives the useful information. The data can be used by the managers of enterprises to analyze the result from the data mining techniques, and then implement them to plan the future schedules and planning for the company. N Y N Y Y From this experiment we observe that user who used the keyword red flower gift but not cheap(row 3 of table 1) did not buy anything on the other hand user who used the keyword red flower cheap bought the product(row 4 of table 1) and so on. 4.1 Result By collecting and using past historical data our system is able to decide whether user is a surfer or shopper. As well as our system is able to decide whether or not to show advertisements to the user. 5. CONCLUSION The following are the conclusion from the above discussion: Web intelligence in handling big data is the main bottleneck at present time and there is a huge need to design and implement such algorithms and artificial intelligence techniques and data mining techniques in big data. Web application can use the techniques for predictive intelligence. Understanding human behavior might be helpful in better, more efficient and more optimized in searching. It also includes showing online advertisement to the interested consumer who is intended to buy product which is a basic goal of many enterprises. http:///d.htm Web Intelligence and Big Data present a lot of opportunities of research at present. Due to advancement in technology it is easy to implement such web techniques in real world with ease. These applications are for researchers who are willing to research in this area. 6. REFERENCES 2007 George E. Pake Prize Recipient. American Physical Society. 2007 Moore, Gordon E. (1965). Cramming more components onto integrated circuits. Electronics Magazine. Retrieved 2006-11-11. Data, data everywhere. The Economist. 25 February 2010. Retrieved 9Decemeber 2012. Kemal Akkaya and Mohamed Younis, A Survey on Routing Protocols for Wireless Sensor Networks, in Computer Science Bibliography (2005)[10]I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci:Wireless sensor networks: a survey Published by Elsevier Science B.V. (2002). Knight, Will: IBM creates world s most powerful computer, NewScientist.com news service, June 2007 Parallel and Distributed Computational Intelligence by Francisco Fernandez de Vega 2010 ISBN 3-642-10674-9 pages 65-68. Zhong, Ning; Liu Yao, JIming: Yao, Y.Y.; Ohsuga, S.(2000), Web Intelligence (WI), Web Intelligence, computer Software and Applications Conference, 2000. COMPSAC 2000. The 24 th Annual International, p.469 doi: 10.1109/CMPSAC.2000.884768, ISBN 0-7695- 0792-1 Wernick, Yang, Brankov, Yourganov and Strother, Machine Learning in Medical Imaging, IEEE Signal Processing Magazine,, vol. 27, no. 4, July 2010, pp.25-38. Data Mining Curriculum. ACM SIGKDD. 2006-04- 30. Retrieved 2011-10-28. Page15