VL Text Analytics Accessing the Deep Web: A Survey Marc Bux, Tobias Mühl
Accessing the Deep Web: A Survey, 2007 by Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen Chuan Chang Computer Science Department University of Illinois at Urbana Champaign 2 / 31
The Deep Web Webinhalte, die nicht durch Suchmachinen indiziert sind. While the surface Web has linked billions of static HTML pages, it is believed that a far more significant amount of information is 'hidden' in the deep Web, behind the query forms of searchable databases [...]. Such information may not be accessible through static URL links. Accessing the Deep Web, He, Patel, Zhang, Chang 3 / 31
The Deep Web Dynamisch generierte Seiten (Forms, Benutzereingaben) Login geschützte Seiten Contextabhängige Seiten Multimedia Seiten (z.b. Flash) 4 / 31
5 / 31
6 / 31
7 / 31
8 / 31
2000er Studie Wie groß ist das Deep Web? ca. 43.000 96.000 Websites ca. 7,5 TB Daten ca. 500fach größer als das Surface Web 9 / 31
2000er Studie Probleme: Beschränkt sich auf Hochrechnungen bezüglich der Größe des Deep Webs Benutzt Overlap Analysis 10 / 31
2007er Studie IP Sampling Methode 2.230.124.544 mögliche IP Adressen Nehme zufällige 1.000.000 als repräsentativen Ausschnitt (sample) 11 / 31
IP Sampling Methode Technik: Sende HTTP Requests an 1.000.000 IPs (GNU Tool: wget) Downloade und analysiere die Webseiten Erkenne Deep Websites 12 / 31
IP Sampling Methode Erkenne Deep Websites Web server that provides information maintained in one or more back end Web databases Zugriff auf die Datenbanken per Formular 13 / 31
IP Sampling Methode Probleme: Virtual Hosting Nicht alle Arten an Deep Websites berücksichtigt 14 / 31
Entrance to the Deep Web Entrance is a query interface login, polling, registration, message posting and site search Depth is the number of operations to get from the root page to the query interface 15 / 31
Entrance to the Deep Web Methods: 100.000 of 1.000.000 IP samples deep crawled to depth 10 Findings: 94% of the web databases appeared within depth 3 Query interfaces located shallowly 16 / 31
Scale of the Deep Web Methods: All 1.000.000 IP samples crawled to depth 3 Depth 3 sufficicient since Deep Web is located shallowly Findings: 2256 Web Servers found in total 126 Deep Web sites with 190 Web databases and 406 query interfaces found 17 / 31
Scale of the Deep Web Extrapolation: 190 * (2.230.124.544 / 1.000.000) / 0,94 450.000 databases In a similar way, 307.000 Deep Web sites and 1.258.000 query interfaces have been estimated 18 / 31
Structure of the Deep Web Structured Data relationally represented in form of attribute value pairs (e.g. books on Amazon.com) Unstructured Data no specific order (e.g. CNN's recent news) Surface Web is mostly unstructured (HTML text) 19 / 31
Structure of the Deep Web Methods: Manual querying and inspection of the 190 found databases Findings: 43 unstructured and 147 strucutured databases Extrapolation: Data in the deep Web is mostly structured (3.4:1 ratio) 20 / 31
Subject Diversity of the Deep Web Surface Web consists of >80% commerce sites Methods: Manual categorization of the 190 found databases Taxonomy: 14 top level categories of Yahoo.com Findings: Large diversity of subjects Even distribution between commercial and non commercial Web databases 21 / 31
Distribution of databases over subject category 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Computers & Internet Entertainment Health Business & Economy News & Media Recreation & Sports Regional Education Science Government Society & Culture Arts & Humanities Reference Others 22 / 31
Suchmaschinen Wie gut indizieren google u.a. das Deep Web? 20 Deep Websites Suche mit google, yahoo und msn 23 / 31
Suchmaschinen 24 / 31
Searching the Deep Web: deep Web directories Online portal services supporting Deep Web database access Sort Web databases into different categories Enable online search in their categorized databases 25 / 31
Searching the Deep Web: deep Web directories Examples and their number of categorized databases: www.completeplanet.com (70.000+) www.lii.org (14.000) www.turbo10.com (2.300) www.invisible web.net (1.000) 26 / 31
Searching the Deep Web: deep Web directories Overall coverage is poor (<20%) considered that there are 450.000 Web databases Deep Web grows too fast to allow manual categorization 27 / 31
Searching the Deep Web: Future Search Engines Traditional Search Engines fail in the Deep Web Limitation of crawling (automated search and extraction) Databases updated too frequently to be indexed properly Search Engines can't exploit the Databases' structure 28 / 31
Searching the Deep Web: Future Search Engines Better idea: two tiered Search Engine Discovery: automated search for Web databases suiting the query Realized by crawling and indexing the databases' query interfaces No information on the databases internal data used 29 / 31
Searching the Deep Web: Future Search Engines Forwarding: database specific search in the discovered databases Using the databases query interface and internal structure 30 / 31
Nachweis Accessing the Deep Web: A Survey, Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen Chuan Chang, 2007 The Deep Web: Surfacing Hidden Value, Michael K. Bergman, 2001 31 / 31