NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application

NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application Hemn B.Abdalla, Jinzhao Lin, Guoquan Li Abstract Present Information technology has an enormous responsibility to deliver Efficient Data Management in all kinds of Industries and Online Service-oriented companies. Each and every day spends enormous amount of money for online data storage and maintenance, nobody has the time to spend work hours reviewing the data or entering it into new software and then hiring someone to maintain it. In this paper, it is focused on Deduplication of data storage that is a more economical process to keep unique and speed data access. The Deduplication will manage the various content, customer data technique to avoid Deduplication while processing of data storage then fast data access and secure data storage. It is used clustering method in this paper so that rapid retrieval of data from the database we are applying the Mapping technique for categories and linking data to access information. For data processing, we use MongoDB (NOSQL) its robust database, for apply various new method and algorithm in complex data (large scale data). Keywords-MongoDB; Deduplication; DataSecurity; Clustering; Mapping. I I. INTRODUCTION N the past couple years; we saw the number of mobile users day by day was increasing and growth in all the places in the world. We have experienced a significant shift in the way we access the internet today with mobiles that become the primary access point for internet usage. Before usually people s developed a mobile application with saved data or any files on (MySQL & SQL Server). NoSQL systems provide data partitioning and replication as built-in features [3], [4], [5]. So today we are creating a new theory of collecting data on MongoDB by using mobile application. MongoDB is Document database that is call as NOSQL Query then it provides the high performance that makes a read and writes fast. In that storage system, we apply the Deduplication method it present to find the duplication record as well as content, the focus of Deduplication process allows only original data, so finally we get high data Security. Today rapidly improve the complexity of massive amounts businesses have the space to store stacks of candidate resumes and applications. Nobody has the time to spend work hours reviewing the data or entering it into new software and then hiring someone to maintain it. Many people are not appropriate for your open positions; their information can be stored as a passive person in our database. Once they were loaded into our RESUME Database, they can be retrieved at a later date. We provide a secure professional system that allows us to select search criteria such as job applied for using date range, specific skills, and location. We can immediately provide lists of all your person s contact details, and their resumes (data) from our database. We can make these available to your business representatives for your critical reporting and assessment or data needs. II. METHODOLOGY A. Data Deduplication Process Deduplication is one of the Latest technologies in the current environment because of its ability to reduce the value of costs. But it comes in many flavors and organizations need to understand each one of them if they are to choose the one that is the best. Deduplication method can be applied to data or content of the data in earliest storage, support storage, cloud storage or data in flight for reproducing, such as Local area network and Wide area network transfers. So eventually, it offers the below benefits. Deduplication system chunks data stream firstly. It tries to find duplicate chunks from the already stored chunks, and only stores new chunks to disk. [6] The Deduplication system performs all Deduplication, provides all quality, and captures interactions between source data and result data. Algorithm1: Cryptographic hash value from the data chunks/blocks: Hash functions accelerate table or database lookup by detecting duplicate records in a large file, a Cryptographic hash function allows one to verify easily that some input data maps to a given hash value. In this algorithm which provides the file information like chunks information. In this paper, we implement the when data user can store the data mean we will find two major concepts (file. Chunk) (file. File). Now apply the algorithm to find the duplication occur or not that finds the file chunk size as well as Metadata, ((it adjusts the dimension target table so that those sets of duplicates already in the target table reduced to single records). This research work is supported by University Innovation Team Construction Plan of Chongqing, the National Science Foundation of China under Grant No. 61301124. ISBN: 978-1-61804-318-4 274

Figure 1. Flow Chart code implementation for Deduplication process B. Clustering Algorithm2: Hierarchical K-Meansclustering is the task of grouping a set of objects its general term so that in our paper apply the Hierarchical K-Means algorithm, in this algorithm provide group of data like cluster the data from MongoDB that converts into single group or group object, MongoDB is ready to push the general data into cluster data, if there is any data object, our database that data will store cluster data (Group data), and repeat this step to perform found data point and store MongoDB for example in our project consider the resume information in the resume data has different domain like Java, JSP and PHP based on that field data object (data) will cluster and Store (domain is call as data point). Hierarchical k-means clustering the algorithm divides the dataset recursively into clusters. Clustering analysis is an important technique. [7] The k-means algorithm is used by setting k to two to split the dataset into two subsets. Then, the two subsets are divided again into two subsets by setting k to two. The recursion terminated when the dataset was split into single data points, or a stop criterion is reached. [1] Figure 3. Shown Mapping our system chart C. Mapping In computing and warehousing management, storage mapping is the process of creating data element or value mappings between two return data models. Data mapping is used as a first step for a broad range of data integration tasks including Data transformation or data mediation between a data source and a destination. [2] 1) Identification of data relationships as part of the data lineage analysis. 2) The discovery of hidden sensitive data such as the last four digits social security number hidden in another user id as part of a data masking or de-identification project. 3) Consolidation of multiple databases into a single database and identifying redundant columns of data for consolidation or elimination. Figure 2. Hierarchical k-means clustering Hierarchical k-means has O (n) run time process. Such a run task is possible because both the k-means algorithm and all operations concerning trees are possible in O(n). Traversing a tree is always done via depth- or breadth- first search. Code Implementation for Clustering Process GridFS gfsphoto = new GridFS(db, i); GridFSInputFile gfsfile = gfsphoto.createfile(imagefile); gfsfile.setfilename(dbfilename); gfsfile.put("_id",data); gfsfile.put("title",n); gfsfile.setcontenttype("application/msword"); gfsfile.save(); D. Equations Dijkstra's original variant found the shortest path to two nodes, but a more common difference fixes a single node as the "source" node and finds shortest paths from the source to all other nodes in the graph, producing a shortest path tree. Algorithm3: In this algorithm mainly provide on Mapping of two data object so apply Dijkstra s we retrieve the data and find the record fast access is use full for large-scale database for example we consider the resume data apply the mapping point or node point Current location, role key Skill, domain that point are used find the database result in this outcome are shown by table format its very useful for fast retrieval and find the accurate data. Math Algorithm functions: Map reduction Input: A reduced Map G(V, E,f, R) where is R is set of output A cluster the data point into subset p P=Getcluster(v,RE){P=V/RE={A1,A2.As} where Ai[ai] //A1,A2 or subset Getreduced vertices(p)//shortest map path GetEdges(P) // get all domain value Create the reduced map or graph Gr=(Vr, Er,Fr,Rr) For Ai belong p, Ai 1 do ISBN: 978-1-61804-318-4 275

Get Result // resume data Figure 5. Data details in MongoDB server In figure 4 MongoDB server, we used the db.collection. Find() method for retrieve or shown our data s features from a collection. Figure 6. GridFS file details Figure 4. Flow Chart code Hierarchical k-means clustering E. MongoDB(NoSQL Database) MongoDB, are quickly grown to become a popular database for web applications and Mobile application, NoSQL is fast data access. [8] Is an entirely appropriate for a Node.JS applications let us write a JavaScript to any client, back-end, and the database layer. MongoDB is being used in many critical projects and products, besides, the inserted support for location queries is a bonus that s hard to dismiss, faster to enter and retrieve data's. Its schema-less kind is a more serious mate to our permanently evolving data structures in web applications and mobile application, relatively new to the database market. Data becomes grown so fast, and users upload generated from varied sources like Videos, texts, speeches, log files, images, etc. Our Data is Location Based; MongoDB has special built-in functions, so finding related data from particular locations is fast and accurate. In figure 5 MongoDB server are shown our saving CV files and storing and retrieving files that exceed the BSONdocument with large files. II. ARCHITECTURE Figure 7. Architecture system diagram In figure 6 fourth representation processes: Process 1: User or Mobile User call as real world object, in this object, works to gather to access the database or Storage server. Process 2: Now her performing data collected from the user or data object, in this General information or information will Store database (MongoDB). Process 3: Object data store before that perform the deduplication process; deduplication Process shows the new policy and technique apply MongoDB. Process 4: after it completing duplication process, it s ready to store original data (Now use the mapping as well as clustering). ISBN: 978-1-61804-318-4 276

III. OUT PUT AND EXPERIMENTAL SETUP We implemented the Robust and Efficient Data Management on Deduplication Process without SQL and Query as a Mobile application system using Java language with JSP. The Mobile application interface for our system forms used open-source JavaScript library, we used MongoDB for saved our Data s information details. All experiments were run using Android system for the mobile application and machine with Intel Core 3CPU @2.9GHz, 4096G main memory, and running on Windows 8.1 Pro. In figure 8 representation of clustering or explaining the clustering the resume data (resume dataset). We consider the resume database is in this database that called as a dataset, so dataset has a several subset (cluster level and cluster point cluster point is call as group dataset). Finally in this graph output show the how many profile is there based on the domain (example MongoDB have 20 no profile) her MongoDB is one of the subsets (cluster or group of data). Figure 10. Implementation system diagram Figure 8. Our interface system In figure 7 representation in our system application how users execution and run our method, first users try to use user ID with password after first step user can upload his\her CV details, second steps our data details automatically save in MongoDB server with run Deduplication algorithm for reduce the value of costs and remove duplicate value, third steps data from MongoDB that converts into single group or group object by clustering, fourth steps Mapping process for value mappings between two return data models. Any times when users want to retrieve or find the his\her document after input his phone number directly can get his\her own CV files or by an administrator user. IV. SCOPE Our project scope provides a database to a very efficient data access or getting the output and also provides high security that becomes used to Deduplication technique. In this process reduce duplicated record entry as well as delicate storage also. After applying the clustering method to cluster our database to store like use clustering algorithm database divided into sub-database like subset value and also apply mapping methodology and map the resume data various keyword (call node value Ex: domain name,experience and current location..etc). Moreover, we use MongoDB in this database is documented database MongoDB is high performance and also the large storage capacity that is I have to store the more than 10mp file also store. V. CONCLUSION The number of users can access the database inefficiently because they apply the Deduplication method, meanwhile, original content only share or store the database (it adjusts the dimension target table so that those sets of Deduplicates already in the target table are reduced to single records). Finding the data point and using clustering technique, so we used Hierarchical K-Means clustering over our paper to implement the resume data and apply the concept group files based domain (domain name is subset of resume data set ). The group the data or data object into the database as of now use to resume database they get high-performance data it most familiar word for MongoDB (NoSQL). Finally, we apply the mapping technique map the cluster subset value and access the resume data based Dijkstra s algorithm and get fast data or dataset retrieval. Figure 9. Clustering result in graph view ISBN: 978-1-61804-318-4 277

ACKNOWLEDGMENT (HEADING 5) This research work is supported by University Innovation Team Construction Plan of Chongqing, the National Science Foundation of China under Grant No. 61301124. REFERENCES [1] G. Eason, B. Noble, and I. N. Sneddon, On certain integrals of Lipschitz-Hankel type involving products of Bessel functions, Phil. Trans. Roy. Soc. London, vol. A247, pp. 529 551, April 1955. (references). [2] J. Mark W. Storer Kevin Greenan Darrell D. E. Long Ethan L. Miller, Secure Data Deduplication, StorageSS 08, October 31, 2008, Fairfax, Virginia, USA. Copyright 2008 ACM 978-1-60558-299-3/08/10 [3] R. Cattell, Scalable sql and nosql data stores, SIGMOD. [4] V. Benzaken, G. Castagna, K. Nguyen, and J. Siméon, Static and dynamic semantics of NoSQL languages, SIGPLAN Not., vol. 48, no. 1, pp. 101 114, Jan. 2013. [5] F. Cruz, F. Maia, M. Matos, R. Oliveira, J. a. Paulo, J. Pereira, and R. Vilaça, MeT: Workload aware elasticity for NoSQL, book title = Proceedings of the 8th ACM European Conference on Computer Systems, series = EuroSys 13, year = 2013, ISBN = 978-1-4503-1994- 2, location = Prague, Czech Republic, pages = 183 196, num pages = 14, publisher = ACM, address = New York, NY, USA.. [6] J urgen Kaiser, Dirk Meister, Andre Brinkmann, Sascha Effert, Design of an Exact Data Deduplication Cluster, 978-1-4673-1747-4/12 c 2012 IEEE. [7] Jia-Lien Hsu and Hong-Xiang Yang, A Modified K-means Algorithm for Sequence Clustering, 978-0-7695-3745-0/09 2009 IEEE. [8] Dileepa Jayathilake, Charity Sooriaarachchi, Thilak Gunawardena, Buddhika Kulasuriya and Thusitha, A Study Into the Capabilities of NoSQL Databases in Handling a Highly Heterogeneous Tree, 978-1- 4673-1975-1/12 2012 IEEE. ISBN: 978-1-61804-318-4 278