HTML-we tools fo Web dt extction Thesis pesenttion 1 Student: Xvie Azg Supeviso: Andes Tho
Tble of contents Intoduction Dt Extction Pocess Dt Extction Tools Relized tests Futue Wok 2
Intoduction We e going to cente ou effot in HTML dt extction The pedominnt mkup lnguge fo web pges Kind of semi-stuctued dt Infomtion following nested stuctue Suppot fom W3C (Wold Wide Web Consotium) 3
Intenet gowth Intoduction 168 Million sites 1400 Million of Intenet uses Wikipedi The fee encyclopedi My 2008 Web Seve Suvey - www.netcft.com 4
Intoduction Puposes of Web dt extction Uses Quey Applictions Get infomtion fom the Web to be used in othe es o by pplictions Infomtion etievl ( e.g. Feeds, Web sech engines ) 5 Integtion Extction Web dt souce Let the use to ccess pticul dt fom the Web Economicl issues ( e.g. stock mket, shopping compison )
Min poblems Dt extction pocess Intenet ws designed s souce of dt fo humn use. Poblems ppe when we wnt to extct dt fom HTML Dt not pesented in HTML fomt: Psswod potected sites Cookies Sessions ID s Jvscipt Dynmic content 6 Deep esouces: Unlinked content Contextul web Limited ccess content
Types of content Dt extction pocess Fee text Stuctued text Semi-stuctued text 7 Ntul lnguge texts Pttens involving syntctic eltions between wods o semntic clsses of wods Textul infomtion following pedefined stict fomt Use of the fomt desciption Between unstuctued collections of textul documents nd fully stuctued tuples of typed dt Extction pttens e often bsed on tokens nd delimites
Dt extction pocess Wys to pefom dt extction Mnul API Wppe Pecise Tet elements individully Specific Web Sites Limited specifictions Set of methods Independent of souce Mnul Semiutomtic Automtic 8 Ad hoc code t tivil Eo-pone Suppot tool GUI suppot Less Eo-pone Mchine-lening techniques Supevised lening
Dt extction pocess HTML stuctue fo dt extction When speking bout HTML-we tools, befoe pefoming the extction pocess, these tools tun the document into psing tee Ech node epesents tg Oute tgs e leves Expessions to nvigte though ll the hiechy Mximum pecision is found on the content of leve 9
Dt extction pocess HTML poblems to extct dt (I) Pesenttion of the dt without following stuctue Logic, simple nd ognized content help to elize coect extctions Unognized content ffects the HTML tee stuctue Bd constucted HTML souce documents Bd plced tgs Repeted tgs closed tgs Nested dt elements Elements tht e nesting dt nd then element by element could contin diffeences 10
Dt extction pocess HTML poblems to extct dt (II) Poblems choosing the coect Web pge souce exmple Content stuctue could chnge depending on some fctos Exmple: Result pge of Web Sech Engines Poblems using scipts o dynmic content Hidden o chnging infomtion Syntx diffeent to HTML Jvscipt, PHP, AJAX o Flsh 11
Txonomy (I) Dt extction tools 12
Txonomy (II) Dt extction tools Lnguges fo wppe development Assist wppe constuction Altentives to genel pupose lnguges Ontologybsed Extction elying diectly on the dt NLP-bsed Bsed on syntctic nd semntic constints Wppe induction Modeling-bsed HTML-we 13 Rules deived fom given set of tining exmples Ty to locte in Web pges potions of dt tht implicitly confom to stuctue Rely on inheent stuctul fetues of HTML documents
Flow of dt Dt extction tools INPUT Dt extction pocess OUTPUT URL http:// XML, HTML, RSS/ATOM TEXT 14 Dt File Wppe Modules, CSV, emil, JSON, XSL, Google Mps, Flsh
Stuctue Dt extction tools 10 HTML-we tools Ctegoiztion of this tools using sevel citeis Test-bench scenios 15
Dt extction tools Used HTML-we tools Dppe Robomke Rodunne XWRAP Lixto Webhvest Goldseeke WinTsk Automtion Anywhee Web Content Extcto 16 Commecil nd non commecil tools Shell nd GUI suppot tools Sceen scpping nd non sceen scpping tools Linux nd Windows tools
Stuctue Dt extction tools 10 HTML-we tools Ctegoiztion of this tools using sevel citeis Test-bench scenios 17
Dt extction tools GUI GUI - Shell commnds - Configution files nd coding - Input files - Rodunne Integted bowse - Diect Intection between the tool nd the nvigtion bowse - Visulize infomtion of the Web elements - Lixto, Robomke, Web Content Extcto 18 Web bowse - Lods Jvscipt nd Dynmic content - Seption between the tool nd the window bowse - Automtion nywhee, Wintsk
Resilience Dt extction tools Cpcity of continuing to wok popely in the ocuence of chnges in the pges fo which they e tgeted Common chnges to: the dt the stuctue Add, ese o modify elements the visul design intoduce new technologies (AJAX, PHP, Jvscipt ) The esilience gd vies depending the used tool 19
Adptiveness Dt extction tools Gde of wppe fo built pges of specific Web souce on given ppliction domin to wok popely with pges fom nothe souce in the sme ppliction domin Fom ll of the txonomy of web dt extction tools only the Ontology-bsed tools fetue fully esilience nd dpttiveness popeties 20
Dt extction tools Scipting nd expessions The tomicity of the HTML psing tee is found in leve (oute tg) Necessity to extct infomtion in moe pecise wy Self-scipting syntx Regul expessions Pttens Othes Wintsk Web Content Extcto Goldseeke Lixto Robomke Remove specil Dte fomtting chctes Text eplcing 21 Robomke Lixto Robomke Robomke
Input vibles Dt extction tools In some cses we need input vibles to elize seches though Intenet: Eby Web sech engines Youtube Amzon We wnt to extct dt fom the esulting pges, we need tool suppot Robomke, Dppe, Lixto, Wintsk 22
Dt extction tools Input/Output fomts Input Fomts Output Fomts Input Fomts Output Fomts Dppe Robomke HTML HTML XML, RSS, HTML, Modules, Atom Feed, CSV,JSON,XSL, YAML, emil RSS/Atom Feed, REST Web Sevice, Web Clip WebHvest GoldSeeke WinTsk Automtion Anywhee HTML HTML nd documents HTML nd documents HTML nd documents XML Text File, Excel, DB File, Excel, DB, EXE 23 RodRunne XWRAP Lixto HTML HTML HTML XML, HTML XML XML Web Content Extcto HTML File, Excel, DB, SQL scipt File, MySQL scipt File, HTML, XML, HTTP submit
Dt extction tools Genel fetues (I) Intefce Complexity Resilience Execution time Fee Dppe Intenet bowse Low Good Vey Good Robomke Pogm GUI, Intenet bowse Medium Vey good Vey Good RodRunne Linux Shell Medium Poo Good YES, GNU GPL License XWRAP Intenet bowse Medium Good Good Lixto Pogm GUI, Intenet bowse Medium Good Vey Good, equies license 24
Dt extction tools Genel fetues (II) Intefce Complexity Resilience Execution time Fee WebHvest Pogm GUI High Good Good Goldseeke Intenet bowse Medium Good Poo, GNU LGPL License Wintsk Pogm GUI, intenet bowse Medium Poo Good Automtion Anywhee Pogm GUI, Intenet bowse Low Poo Good 25 Web Content Extcto Pogm GUI, Intenet bowse Low Poo Poo
Dt extction tools Advnced chcteistics Input vibles Scipts usge n sttic content pges Moe thn one pge Jvscipt o Dynmic content Dppe Good Robomke Good Rodunne Poo XWRAP Poo Lixto Good WebHvest Poo Goldseeke Poo Wintsk By scipt Good Automtion Anywhee Good 26 Web Content Extcto Good
Stuctue Dt extction tools 10 HTML-we tools Ctegoiztion of this tools using sevel citeis Test-bench scenios 27
Methodology Relized tests Ceted/Selected Web pge Coect esult Selected dt Compe Test Result Selected Tool Tool esult 28
Relized tests Web sech engines (I) One of the most used esouces of the Web Use of input vibles nd dynmic esult pges Yhoo! Sech uses live sech input fom 29
Relized tests Web sech engines (II) Google Sech Yhoo! Sech MS Live Sech Dppe Robomke Lixto WinTsk Automtion Anywhee Web Content Extcto 30
Relized tests Eby 31 The most impotnt uction shop of Intenet Use of input vibles nd dynmic esult pges Fields contining vible content Dppe Robomke Lixto WinTsk Automtion Anywhee Web Content Extcto Eby sech /
Relized tests Dynmic content Web pges Pgeflkes Dppe Robomke AJAX bsed stt pge Lixto WinTsk 32 Use of Dynmic content nd pesonlized use modules Automtion Anywhee Web Content Extcto
Relized tests Resilience tests (I) 1- Obtin esult pge of Amzon.com 2- Downlod the souce pge nd elted files 3- Uplod to test seve 4- Configue tools to extct 4 fields: title, book fomt, new pice nd vlution Fo ech test: Relize modifiction to the souce pge Uplod to test seve Execute the tool nd see if poblems ppe 33
Relized tests Resilience tests (II) Deleting content Modifying CSS style tgs Duplicting extcted dt Chnging ode of extcted dt Deleting content Exmple: Ese td[0] 34 Dppe Robomke Lixto Web Content Extcto
Relized tests Pecision tests (I) Designed published books Web Pge We e going to extct dt fom the Lst Published edition column with diffeent pecision ech time: All the infomtion of the ow Dte of the lst publiction Ye of the lst publiction 2 lst digits of the ye of the lst publiction 35
Relized tests Pecision tests (II) Done thee diffeent modifictions to the souce pge with diffeent chcteistics to: Extct dt fom fomtted text Extct dt using styled text (clss ttibute) Extct dt fom CSV fomtted text 36
Relized tests Pecision tests (III) Exmple: Extcting dt fom CSV souce All the infomtion of the lst published edition Dte of the lst publiction Ye of the lst publiction 2 lst digits of the ye of the lst publiction Dppe Robomke Lixto WinTsk Automtion Anywhee 37 Web Content Extcto
Futue wok Given Web souce which fetues the tool ccomplish. Useful to find the most suitble tool Testing with non visul GUI tools Relize detiled document tht contins ll the elized wok 38
Thnks fo you ttention! 39