A Rank Based Parametric Query Search to Identify Efficient Public Cloud Services

A Rank Based Parametric Query Search to Identify Efficient Public Cloud Services Ramandeep Kaur 1, Maninder Singh 2 1, 2 Lovely Professional University, Department of CSE/IT Phagwara, Punjab, India. Abstract: When we have a public cloud architecture, one of the major requirements is to search the cloud based on the user query. The presented work is inspired from the concept of WebCrawler as well as the search engine. However, in this present work, the search will be performed on the cloud services, and the ranking parameter will be based on the efficiency and the reliability factor. The work is divided into two main stages, In first stage the user query will be parsed by the cloud search engine, and it will perform a keyword based extraction process and in the second layer, a ranking algorithm will be performed based on significant parameters. These parameters are the user visit count and the interest. Along with these the response time and availability factors are also been analyzed. The Presented work will return the list of ranked cloud under the effects of reliability and efficiency. Keywords: Crawler, Optimization, Cloud, Prioritization, Ranking 1. INTRODUCTION Cloud service is the new trend of computing where readily available computing resources are exposed as a service. These computing resources are generally offered as pay-as-you-go plans and hence have become attractive to cost conscious customers. Apart from the cost, Cloud Services also supports the growing concerns of carbon emissions and environmental impact since the Cloud advocates better management of resources. We see a prospering trend of off loading the previously in-house service systems to the Cloud, based primarily on the cost and the maintenance burden. Such a move allows businesses to focus on their core competencies and not burden themselves with back-office operations. 1.1 Search Cloud Based search engine is a challenging task. Search engine index tens to hundreds cloud services involving a comparable number of distinct terms. They answer tens of a number of queries every day. Despite the importance of Large-scale search engines on the web, very little academic research has been conducted on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today was very different from three years ago. Here figure 1 is showing the typical search engine architecture. Figure 1: Typical Search Engine Architecture There are differences in the ways various search engines work, but they all perform three basic tasks: 1. They search the Internet or select pieces of the Internet based on important words. 2. They keep an index of the words they find, and where they find them. 3. They allow users to look for words or combination of words found in that index. A search engine finds information for its database by accepting listings sent in by authors who want exposure, or by getting the information from their "web crawlers," "spiders," or "robots," programs that roam the Internet storing links to and information about each page they visit. A web crawler is a program that downloads and stores web pages, often for a Web search engine. Roughly, a crawler starts off by placing an initial set of URL, So in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and put the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache. The most important measure for a search engine is the search performance, quality of the results and ability to crawl, and index the web efficiently. The primary goal is to provide high-quality search results over a rapidly Volume 2, Issue 1 January - February 2013 Page 181

growing World Wide Web. Some of the efficient and recommended search engines are Google, Yahoo and Teoma, which share some common features and are standardized to some extent. Web crawlers are also known as spiders, robots, worms, etc. Crawlers are automated programs that follow the links found on the web pages. 1.2 URL Server There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function. The indexer performs a number of functions. It reads the repository, uncompressed the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It passes out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. 2. REVIEW OF LITERATURE One of the major requirements over the web is about the selection of best service and service provider over the web. When we talk about cloud service, the work is more specific and the parametric. Many researchers performed a lot of work in the same direction. The work performed by different researchers is shown in this chapter. Hussein Issa (2010) has presented the problem of duplicate records and their detection, and addresses the issue of one type of records in specific, which is of great Interest in the business world: that of duplicate payments. An application of record matching techniques to the database of a telecommunication company is used as an illustration. He concludes the Duplicate payments, which can be defined as multiple representations of the same real-world object or entity, has a serious effect on the quality of audit and fraud detection systems. They can signify the presence of fraud, systematic errors arising from different database systems incompatibilities, or simply human errors. There is a plethora of cases in the literature showing the effect of duplicate payments on organizations, and the amount of money lost because of it [3]. Brett J. Peterson presented the multiple methods to find and eliminate erroneous duplicates using SAS, including a macro. It is a proactive approach, including a weekly production job that alerts clinical study team members of duplicates to be reconciled is also discussed [4]. K. Küspert et al (2008) found new model Detecting and deleting duplicates in the extended NF2 data model is a complex task both in terms of what needs to be offered to the user and in terms of implementation. Having introduced the notion of Uniqueness for tables, sets and lists in the extended NF2 model, they introduced three ordering relations for complex objects. The definition which uses cardinality of sets and repeated minima as ordering criterion was picked on the basis for the following discussion. We then reviewed existing, and newly created criterion was picked on the basis for the following discussion [5]. Bo Hong Plantenberg (2008) includes an analysis of selected real-world data sets that is aimed at demonstrating the space-saving potential of coalescing duplicate data. Our results show that DDE can reduce storage consumption by up to 80% in some application environments. The analysis explores several additional features, such as the impact of varying le block size and the contribution of whole file duplication to the net savings [6]. Tak W. Yan Hector Garcia-Molina (2007) proposed a Duplicate Removal Module (DRM) for an information dissemination system. The removal of duplicates operates on a per user, per document basis - each document read by a user generates a request, or a duplicate restraint. In wide-area environments, the number of restraints handled is very large. It considered the implementation of a DRM, examining alternative algorithms and data structures that may be used. We present a performance evaluation of the alternatives and answer important design questions [7]. In Year 2009, Georgia Koutrika presented a data cloud in which cloud search is performed on the basis of query summarization approach. The work presented by the author is a structural work in which the keyword extraction, and the summarization is performed by the researcher and on the basis of this navigation and visualization of the data is suggested. The implemented work is based on the basis of tag assignment to different kind of keywords and based on these tags, a query refinement is been performed. Finally, a flexible search over the database is performed to derive the outcome. The result analysis is based on the basis of effectiveness and efficiency of the cloud services [8]. In Year, 2012, Cengiz Orencik performed presented a rank based keyword search on the data cloud. In this work, the document retrieval is performed on the cloud server based on the keyword analysis, and the information search is performed relative to the defined information. The presented work is performed on the encrypted data that has improved the security and the reliability of the retrieval. On this basis, a secure protocol is suggested called Private Information Retrieval. The system will perform the query and present the final results on the basis of parametric ranking. The presented work is the Volume 2, Issue 1 January - February 2013 Page 182

efficient computation and communication of the requirement analysis [9]. In Year 2012, Mathew J. Wilson performed a work based on web search engine based for the keyword cloud. In this work, the clouds are represented by some tags called the Meta data. The Meta data defines the cloud with relative parameters in terms of its security, efficiency and the reliability criteria. On the basis of this, the keyword match is performed on different cloud keywords. The work includes the learning stage in the keyword extraction, and the comparative analysis is performed to extract the related cloud services from the system [10]. In Year 2011, Ju-Chiang Wang presented a content-oriented tag based search for the database search. In this work, the music database is selected for the query analysis. In this work, the multiple levels of preferences are defined based on desired clouds. In this work, the query performed by the user is analyzed and divided to different colors or the levels to perform the effective content based retrieval. In this work, the music retrial is been proposed. The probabilistic fusion model is defined based on Gaussian mixture model and the multinomial mixture model. The author evaluated the proposed system for the effectiveness of the user query and the related results [11]. In year 2011, Venkateshprasanna H.M. presented a work on enterprise search on the tag cloud. The tag is the information based on the keyword classification. It basically provides the categorization of the cloud based on its role in the business environment. On the basic of this information, the knowledge criterion is defined respective to the enterprise system. In this work, a novel approach is suggested based on the automated selection of the cloud on entries query system. The presented system is content based and integrated to the search system [12]. 3. RESEARCH METHODOLOGY The presented work is about to perform a search on the public cloud. In this work, we have integrated the concept of search engine along with the crawling. The work includes the creation of a database to represent the list of available cloud. Now as the user query will be performed the search will be performed on all the available services and listed them as the query results. The presented query based architecture is shown in figure 2 below. The proposed work is about the search on a user keyword oriented query over the listed Metadata of available cloud services. The query search engine first separates the keywords and performs the prioritization over the keywords. From these keywords, the Meta data words that are present in any cloud will be extracted. Once the query filtration will be done, we get exact query that will be passed through the cloud based search engine to perform the cloud service extraction. As the web document is a large database, the work is about to speed up the process. In this work, we will use document summary instead of using the whole document to perform the document comparison. The summarization process will be done by using the feature based text categorization approach. After the query analysis, a search over the Metadata will be performed to identify the related cloud services. Figure 2: Proposed Cloud Search Architecture As the final stage the indexing will be performed. The indexing mechanism here considered is based on the user cloud visits, and the cloud recommendation based. According to these parameters, the ranking formula is been generated and based on which the result of extracted cloud services will be listed. The flow chart of proposed work is shown in figure 3. The basic steps of the proposed work are listed as under. 3.1 Cloud Service Analysis Simply, SEO analysis is not the only method to get your site on the top of search engines. In case of cloud based search, we need the integration with all the cloud services and the extraction of Metadata from the services. It just gives you an overview of cloud service potential. Based on these keywords based analysis the cloud services will be selected and will be used in the system. 3.2 Web Crawler The Web crawler can be used for crawling through a whole site on the Inter-/Intranet. You specify a start-url, and the Crawler follows all links found in that HTML page. A site can be seen as a tree-structure, the root is the start-url; all links in that root-html-page are direct sons of the root. Subsequent links are then sons of the previous sons. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. Web crawling can be regarded as processing items in a queue. When the crawler visits a web page, it extracts links to other web pages. So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue. Volume 2, Issue 1 January - February 2013 Page 183

3.3 URL Filtration The focused crawler has three main components: a classifier, which makes relevance judgments on pages, crawled to decide on link expansion, a distiller which determines a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurable priority controls, which is governed by the classifier and distiller. 3.4 Indexing An Indexer is a program that reads the pages, which are downloaded by spiders. Indexing of web content is a challenging task assuming an average of 1000 words per web page and billions of such pages. Indexes are used for searching by keywords indexing starts with parsing the website content using a parser. Any parser, which is designed to run on the entire Web, must handle a huge array of possible errors. The parser can extract the relevant information from a web page by excluding certain common words (such as a, an, the - also known as stop words), HTML tags, Java Scripting and other bad characters. 3.5 Keyword Analysis Keyword research & analysis acts as the basis of a website that is in need of heavy traffic from various search engines. The purpose of this research work is to finalize on effective keywords or key phrases using which a website needs to be marketed. Identifying the phrases that drive in high-quality traffics from leading search engines to your site is the success of your research work. 3.6 Meta Tag Analysis We have a tag that is created in order to give the keywords, description details and other related information to spiders. It is invisible, and can be seen only by viewing the source of the page. 3.7 Cloud Search The final work is to analyze the entire retrieved cloud service match performed from the Meta Tag Analysis, Meta tags and the contents and to detect the Matched Web Pages. 3.8 Ranking After the matching, the rank algorithm will be implemented to perform the indexing over the searched pages. 4. CONCLUSION The presented work is a query based cloud searching and indexing process that will be performed on public cloud. The work is about to perform a ranked search in which the ranking is based on the user interest and the history search. The work also includes identifying the efficient and reliable cloud services to the user. Figure 3: Proposed Flowchart REFERENCES [1] http://www.emeraldinsight.com/authors/guides/write /literature.htm [2] http://en.wikipedia.org/wiki/literature_review [3] Hussein Issa Rutgers Business School, Rutgers University Application of Duplicate Records detection Techniques to Duplicate Payments in a Real Business Environment [4] Brett J. Peterson, Medtronic Inc., Minneapolis, MN Finding a Duplicate in a Haystack. [5] K. Küspert, G. Saake*, L. Wegner IBM Heidelberg Scientific Center, D-6900 Heidelberg, West Germany *on leave from TU Braunschweig, FB Informatik, D-3300 Braunschweig, West Germany Gh Kassel - Universität, FB Mathematik, D-3500 Kassel, West Germany Duplicate Detection and Deletion in the Extended NF2 Data Model. [6] Bo Hong Univ. of California, Santa Cruz hongbo@cs.ucsc.edudemyn Plantenberg IBM Almaden Research Centerdemyn@almaden.ibm.com Darrell D.E. Long Univ. of California, Santa Cruz darrell@cs.ucsc.edu Miriam Sivan-Zimet IBM Almaden Research Centermzimet@us.ibm.com Duplicate Data Elimination in a SAN File System. Volume 2, Issue 1 January - February 2013 Page 184

[7] Tak W. Yan Hector Garcia-Molina Department of Computer Science Stanford University Stanford, CA 94305 {tyan, hector}@cs.stanford.edu Duplicate Removal in Information Dissemination. [8] Georgia Koutrika (2009), CourseCloud: Summarizing and Refining Keyword Searches over Structured Data, EDBT 2009, March 24 26, 2009, Saint Petersburg, Russia. Pp. 1132-1135 [9] Cengiz Orencik and Erkay Savas, Efficient and Secure Ranked Multi-Keyword Search on Encrypted Cloud Data, PAIS 2012, March 30, 2012, Berlin,Germany. ACM, Pp 186-195 [10] Mathew J. Wilson and Max L. Wilson, Tag Clouds and Keyword Clouds: Evaluating Zero-Interaction Benefits, CHI 2011, May 7 12, 2011, Vancouver, BC, Canada. Pp 2383-2388 [11] Ju-Chiang Wang, Yu-Chin Shih1, Meng-Sung Wu, Hsin-Min Wang2 and Shyh-Kang Jeng, Colorizing Tags in Tag Cloud: A Novel Query-by-Tag Music Search System, MM 11, November 28 December 1, 2011, Scottsdale, Arizona, USA.ACM, Pp- 293-302 [12] Venkateshprasanna H.M, Rujuswami D. Gandhi, Kavi Mahesh and J. K. Suresh, Enterprise Search through Automatic Synthesis of Tag Clouds, COMPUTE 11, March 25-26, Bangalore, India AUTHOR Ramandeep Kaur received the B.Tech. Degree in Information Technology from Lovely Professional University in 2012 and pursuing M.Tech in Information Technology from the same university. Volume 2, Issue 1 January - February 2013 Page 185