Curriculum Vitae. Valter Crescenzi. February 2012

Curriculum Vitae Valter Crescenzi February 2012

Contact Info Valter Crescenzi Via della Vasca Navale, 79 I-00146 Rome, Italy Tel. +39 06 5733 3535 e mail: crescenz@dia.uniroma3.it Current Position: Assistant Professor at Università degli Studi Roma Tre Research Activities Research Positions Assistant Professor at Facoltà di Ingegneria of Università degli Studi Roma Tre (2005 Junior Researcher for Dipartimento di Informatica ed Automazione of Università degli Studi Roma Tre. (2003 2004) Research Fellow (project Gestione dei dati per i processi decisionali: acquisizione, integrazione e presentazione ) at Dipartimento di Informatica ed Automazione of Università degli Studi Roma Tre, under the supervision of Prof. Paolo Atzeni. (2002 2003) Education PhD in Computer Engineering received on february 2002 from Dipartimento di Sistemistica of Università degli Studi di Roma La Sapienza. Dissertation: On Automatic Data Extraction from Large Websites [40]. Supervisors: Prof. Paolo Atzeni and Prof. Giansalvatore Mecca. The main results has been published on international journal [2] and presented in international conferences [8]. Computer Engineering degree ( Laurea in Ingegneria Informatica ) in 1998 from Università degli Studi Roma Tre, with a thesis titled Un riconoscitore di grammatiche formali con gestione delle eccezioni, under the supervision of Prof. Paolo Atzeni and Prof. Giansalvatore Mecca. The main results has been published on a international journal [1]. 1

Research Topics During his master thesis, Valter Crescenzi developed interests for research topics related to information extraction from web sources. Initially (1998 1999) he was interested to the definition of a new formalism for manual yet effective specification of wrapper software modules, i.e. programs able at extracting structured information from unstructured web pages. He developed a formalism aiming at joining the advantages of declarative languages (such as grammars) and procedural languages (such as editing scripts) for expressing effective and precise extraction rules [1]. During his PhD studies (1999-2002), he researched how to further improve the level of automation of wrapper production for large website [24, 6, 25], whose pages are generally produced by quering an undelying database and embedding the query results into a fixed HTML template. Even if this websites contains a large number of pages, they can usually be classified in a relatively small number of classes [10] composed of structurally similare pages. This research activities produced results in two phases, from 1999 to 2002, and from 2002 to 2004: in the former phase (1999-2002) an innovative algorithm for inferring regular expressions has been proposed: the algorithm was based on a progressive and comparative analysis of sample pages obaying to the a regular grammar picked from a family of grammars crafted on purpose [8, 26, 9] in the latter phase (2002-2004) the relationships between this algorithm, presented to the data extraction community, and the learning algorithms presented inside the much more consolidated grammar inference community [41, 42] has been clarified, with interesting results for both communities [27, 2] Namely, it has been claried that many inference algorithms taking only positive samples as input (a paradigm known as identification in the limit [41]) that were studied by researchers of the grammar inference community, were not useful as a tool to produce wrappers. One of goals of that community is to study how to learn expressive class of languages, but the more expressive the class of languages inferred, the less likely is the availability of a representative and finite sample of pages [42]). Since a wrapper generation tool requires a non-expert user to provide these samples, they should be obtained by randomly picking a small number of sample pages [27]. A class of languages (called Prefix Mark-Up Languages ) identifiable in the limit has been 2

proposed as first example of class of languages suitable for wrapper generation and formally studied [2]. The following reasearches (2004-2008) can be summarized into two main lines: the grammar inference algorithm has been refined [29, 13, 11] to deal with many structures frequently occuring on the Web; the class of languages identifiable in the limit has been expanded maintaining the simplicity of the characteristic samples [4]. Following research studies (2002-2008) aimed at scaling out the extraction process to cover many classes of pages from several large websites. Many research issues arises, including the effective crawling of sample pages within a website [12, 15, 3], and the classification of downloaded pages into classes suitable from automatic wrapping. There have been pursued both approached base on the analysis of the regularities in the inner structure of pages [10, 14], and approaches based on the analysis of the regularities in the topology of large website [30, 3, 34, 18]. Recently, this reasearch line has been further expanded at the web scale [17, 31, 5] tackling the additional issues related to searching and retrieving websites publishing relevant information [36, 31], their integration [32], and the scalability of the overall approach [38]. In this context, naturally arises the idea of characterizing probabilistically the quality of the extracted information [37, 19, 20] and the accuracy of the involved sources [21, 33, 39, 22], even in presence of copiers amongst them. Most of this research activities has been developed in the context of international and national research projects. Partecipation to Research Projects international research project INTAS: Modeling and Management of Semi Structured Data for Dynamic World Wide Web Applications (1999 2000). national research project MURST (ex 40%) Data X: Gestione, Trasformazione e Scambio di Dati in Ambiente Web (1999 2000). FIRB-MIUR project MAIS: Multichannel adaptive information systems (2002 2006). european research project (Vfp) MOSES: MOdular and Scalable Environment for the Semantic web (2002 2006). national research project MIUR ECD: Tecnologie per arricchire e fornire accesso a contenuti (2002 2005). 3

national research project (PRIN) WISDOM: Ricerca Intelligente su Web basata su Ontologie di Dominio (2004 2006) principal investigator of a project for realizng an industrial demonstrator of a web data extractor. The project has been funded by progetto DOCUP Obiettivo 2 Regione Lazio Programma 2000-2006 sottomisura II.5.2. (2005 2007). national research project MIUR NGS: Nuove Tecnologie e Strumenti per l Interrogazione di Servizi di Ricerca su Web (2007 2009). project MORNING - Metodologie e strumenti per analizzare dati da sorgenti del Social Web. FILAS-RS-2009-1132, funded by CUP F87I10000750007, POR FERS Lazio 2007/2013 Asse I Attività I.1. (2009 2012) national research project (PRIN) EASE: Identificazione, riconciliazione, estrazione e integrazione di Entità dal Web (2010 2012). Other Collaborations Dal 1999 al marzo 2004 ha partecipato alla progettazione, creazione e gestione della versione XML del sito online di ACM Sigmod Record. In particolare si è occupato dell estrazione dei dati da sorgenti web ed il loro riversamento in formato XML. Il risultato dell iniziativa è stato oggetto di molti studi scientifici. Member of the Committee Program of several national and international conferences Workshop on Adaptive Text Extraction and Mining (ATEM 2003), Workshop on Adaptive Text Extraction and Mining (ATEM 2006), International Conference on Web Information Systems Engineering (WISE 2008), Sistemi Evoluti per Basi di Dati (SEBD 2012) External reviewer for many conferences including (SAC 2002, ACM SIGMOD 2003, ICWE 2004, VLDB 2004, ACM SIGMOD 2005, ICDE 2006, EDBT 2006, VLDB 2007) Reviewer for international journal such as Information Systems (Kluwer Publishers), Software: Practice and Experience (Wiley), Data And Knowledge Engineering (Elsevier), Journal of Intelligent Information Systems (Springer) 4

He has been the presenting author in these international conferences: SAC 2002 (Madrid, Spagna), WebDB 2003 (San Diego, USA), ATEM 2003 (San Josè, USA), WEBIST 2005 (Miami, USA), ICDE2006 Workshops (Atalanta, USA) Panelist during the Workshop on Adaptive Text Extraction and Mining (ATEM 2003) co-founder of an academic spin-off Chi-Technologies s.r.l. a company partecipated by Università degli Studi Roma Tre whose goal is the industrial enhancement of the research results on the automatic information extraction from the Web Teaching Experience Institutional Teaching Activities He has been tearcher of the following academic courses, Facoltà di Ingegneria, Università degli Studi Roma Tre : Sistemi Operativi II, 2003/2004, 2004/2005 Programmazione Concorrente, 2005/2006, 2006/2007, 2007/2008, 2008/2009, 2009/2010, 2010/2011 e 2011/2012 Elementi di Informatica, 2010/2011, 2011/2012 Programmazione Orientata agli Oggetti, 2004/2005, 2005/2006, 2006/2007, 2007/2008 He has been teaching assistant for the following courses, Facoltà di Ingegneria, Università degli Studi Roma Tre, Sistemi Operativi, academic year 2000/2001 Sistemi Operativi 1, Sistemi Operativi 2, 2002/2003 Programmazione Orientata agli Oggetti, 2002/2003 Ingegneria del Software, 2003/2004 Progetto di Sistemi Informatici, 2004/2005, 2005/2006, 2006/2007 e 2007/2008 5

He teached in the following second-level master courses of Università degli Studi Roma Tre : Basi di Dati Master Universitario in Economia e Tecnologia della Società dell Informazione academic years 2001/2002, 2002/2003, and 2003/2004 Basi di Dati Master Universitario in Governance, Sistema di Controllo e Auditing academic years 2005/2006 e 2006/2007 Programmazione orientata agli oggetti, Basi di dati ed XML, Metodi per lo sviluppo agile Master Universitario in Governo dei Sistemi Informativi: sviluppo, gestione, monitoraggio, 2007/2008 È stato docente per corsi di Basi di dati e Metodi per lo sviluppo agile Master Universitario in Governo dei Sistemi Informativi: sviluppo, gestione, monitoraggio 2009/2010 and 2011/2012 He is tutor of the following academic courses, Facoltà di Ingegneria, l Università Telematica Internazionale UNINETTUNO: Sistemi Informativi e Basi di dati Corso di Studi in Ingegneria Informatica ed Ingegneria Gestionale academic year 2011/2012 Ingegneria del Software e Programmazione ad Oggetti Corso di Studi in Ingegneria Informatica, 2011/2012 Other Institutional Teaching Activities During academic years 2004/2005, 2005/2006, 2006/2007, and 2007/2008 he designed and supervisioned the developmnet of a web application for partially automatizing the exams of several programming courses of Facoltà di Ingegneria, Università degli Studi Roma Tre, including Programmazione Orientata agli Oggetti, Fondamenti di Informatica I, Laboratorio di Informatica, and Programmazione Concorrente. Professional Teaching Experience He has been teacher for the following courses: Progettazione Banche Dati for Engineering Ingegneria Informatica SpA (2000 2003) 6

Progettazione Banche Dati, Il linguaggio XML, Sistemi Operativi for Direzione Corsi Elettronica, Optoelettronica ed Informatica for Ministero della Difesa (2003 2007) Basi di Dati for Scuola di Polizia Tributaria. Specialista Sviluppo Applicazioni Object Oriented, Analista Programmatore for Centro Italiano Opere Femmilili Salesiane - Formazione Professionale Il linguaggio UML, for Sudgest S.C.p.a Progettista di Siti Web, for ENAIP Lazio, 2010. Publications International Journals [1] V. Crescenzi and G. Mecca. Grammars Have Exceptions. Information Systems, 23(8): 539-565 (1998) [2] V. Crescenzi and G. Mecca. Automatic information extraction from large websites. Journal of the ACM, 51(5): 731-779 (2004) [3] V. Crescenzi, P. Merialdo and P. Missier. Clustering Web pages based on their structure. Data & Knowledge Engineering, 54(3): 279-299 (2005) [4] V. Crescenzi and P. Merialdo. Wrapper Inference for Ambiguous Web Pages. Applied Artificial Intelligence, 22(1):21-52, (2008) [5] L. Blanco, V. Crescenzi and P. Merialdo. Structure and Semantics of Dataintensive Web Pages: an Experimental Study of their Relationships. Journal of Universal Computer Science. Special Issue on Wrapping Web Data Islands. International Conference Proceedings [6] G. Mecca, P. Merialdo, P. Atzeni and V. Crescenzi. The (short) Araneus Guide to Web Site Development. Second Workshop on Databases and the Web (WebDb 99) in conjunction with ACM SIGMOD 99, Philadelphia (Pennsylvania), (giugno 1999). 7

[7] V. Crescenzi, G. Mecca and P. Merialdo. The RoadRunner Project: towards Automatic Extraction of Web Data. International Workshop on Automatic Text Extraction Methods (ATEM 2001) in conjunction with Seventeenth International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle (Washington), (2001). [8] V. Crescenzi, G. Mecca and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), Roma (Italy), pag. 109 119, Morgan Kaufmann, (2001). [9] V. Crescenzi, G. Mecca and P. Merialdo. Automatic Web Information Extraction in the RoadRunner System. International Workshop on Data Semantics in Web Information Systems (DASWIS 2001) in conjunction with 20th International Conference on Conceptual Modeling (ER 2001), Yokahama (Japan). Lecture Notes in Computer Science 2465 Springer, (2002). [10] V. Crescenzi, G. Mecca and P. Merialdo. Wrapping-oriented classification of web pages. ACM Symposium on Applied Computing (SAC), 10-14 Marzo, 2002, Madrid (Spain). ACM Press (2002). [11] L. Arlotta, V. Crescenzi, G. Mecca and P. Merialdo. Automatic annotation of data extracted from large Web sites. Sixth Int. Workshop on Databases and the Web (WebDb 99) in conjunction with ACM SIGMOD 03, San Diego (California), (giugno 2003). [12] V. Crescenzi, P. Merialdo and P. Missier. Fine-grain Web Site Structure Discovery. Fifth ACM CIKM International Workshop on Web Information and Data Management (ACM WIDM 2003), Novembre 2003, New Orleans (Lousiana). ACM Press (2003). [13] V. Crescenzi, G. Mecca and P. Merialdo. Handling irregularities in roadrunner. The AAAI-04 International Workshop on Adaptive Text Extraction and Mining (ATEM 2004), July 26th, 2004, San Jose (California) (2004). [14] V. Crescenzi, G. Mecca, P. Merialdo and P. Missier. An Automatic Data Grabber for Large Web Sites. Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), Settembre 2004, Toronto (Ontario, Canada) (2004). 8

[15] L. Blanco, V. Crescenzi, and P. Merialdo. Efficiently Locating Collections of Web Pages to Wrap. First International Conference on Web Information Systems and Technologies, May 2005, Miami (Florida) (2005). [16] V. Crescenzi, and P. Merialdo. Efficient Techniques for Effective Wrapper Induction. Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, April 2006, Atlanta (Georgia) USA. [17] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Flint: Google-basing the Web. 11th International Conference on Extending Database Technology, Nantes, France, March 2008. [18] C. Bertoli, V. Crescenzi, and P. Merialdo. Crawling Programs for Wrapperbased Applications. The 2008 IEEE International Conference on Information Reuse and Integration (IEEE IRI-08), July 13-15, 2008 - Las Vegas, USA. [19] L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo and P. Papotti. Exploiting information redundancy to wring out structured data from the web. The 19nd International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010. [20] L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo and P. Papotti. Redundancy- Driven Web Data Extraction and Integration. The 13th International Workshop on the Web and Databases, WebDB 2010, Indianapolis, Indiana, USA, June 6, 2010. [21] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources.The 22nd International Conference on Advanced Information Systems Engineering, CAiSE 10, Hammamet, Tunisia, June 2010. [22] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Automatically Building Probabilistic Databases from the Web The 20th International Conference on World Wide Web, WWW 2011, hyderabad, India, March 18-April 1, 2011. [23] M. Bronzi, V. Crescenzi, P. Merialdo and P. Papotti. Wrapper Generation for Overlapping Web Sources. Web Intelligence 2011, WebDB 2010, Lyon, France, August 22-27, 2011. 9

National Conference Proceedings [24] G. Mecca, P. Merialdo, P. Atzeni and V. Crescenzi. The ARANEUS Guide to Web Site Development. (versione estesa di [6]) Atti del Settimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD 99): pag. 167 177, Como, 23 25 giugno 1999. [25] G. Mecca, P. Merialdo, P. Atzeni and V. Crescenzi. Experiences in XML data management. Atti dell Ottavo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2000): pag. 109 119, L Aquila, 24 26 giugno 2000. [26] V. Crescenzi, G. Mecca and P. Merialdo. The RoadRunner Web Data Extraction System. Atti del Nono Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2001), Venezia, 27 29 giugno 2001. [27] V. Crescenzi, G. Mecca and P. Merialdo. Back to Gold s Age: Bridging the Gap Between Traditional Grammar Inference and Web Information Extraction. Atti del Decimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2002), Isola d Elba, giugno 2002. [28] L. Arlotta, V. Crescenzi, G. Mecca and P. Merialdo. Automatic annotation of data extracted from large Web sites. Atti dell Undicesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2003), Cetraro (CS), giugno 2003. [29] V. Crescenzi, G. Mecca and P. Merialdo. Improving the expressiveness of RoadRunner. Atti del Dodicesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2004), Cagliari, giugno 2004. [30] L. Blanco, V. Crescenzi, and P. Merialdo. Harvesting Structurally Similar Pages. Atti del Tredicesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2005), Bressanone, giugno 2005. [31] L. Blanco, V. Crescenzi, P. Merialdo. Searching Entities on the Web by Sample. Atti del Sedicesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2008), Mondello (PA), giugno 2008. [32] L. Blanco, V. Crescenzi, P. Merialdo. Data Extraction and Integration from Imprecise Web Sources. Atti del Diciassettesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2009), Camogli (GE), giugno 2009. 10

[33] L. Blanco, V. Crescenzi, P. Merialdo. Probabilistic Reconciliation of Records from Inaccurate Web Sources. Atti del Diciottesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD2010), Rimini, giugno 2010. Technical Reports [34] V. Crescenzi, P. Merialdo and P. Missier. Discovering the structure of large web sites. Rapporto Tecnico RT-DIA-89-2004, Università degli Studi Roma Tre, Dipartimento di Informatica e Automazione (2004). [35] L. Blanco, V. Crescenzi and P. Merialdo. Automatically Generating Reports from Large Web Sites. Rapporto Tecnico RT-DIA-90-2004, Università degli Studi Roma Tre, Dipartimento di Informatica e Automazione (2004). [36] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Searching Entities on the Web by Sample. Rapporto Tecnico RT-DIA-121-2007, Università degli Studi Roma Tre, Dipartimento di Informatica e Automazione (2007). [37] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. A Probabilistic Model to Characterize the Uncertainty of Web Data Integration: What Sources Have The Good Data? Rapporto Tecnico RT-DIA-146-2009, Università degli Studi Roma Tre, Dipartimento di Informatica e Automazione (2009). [38] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Exploiting Information Redundancy to Extract and Integrate Data from the Web. Rapporto Tecnico RT- DIA-151-2009, Università degli Studi Roma Tre, Dipartimento di Informatica e Automazione (2009). [39] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Probabilistic Models to Reconcile Complex Data from Inaccurate Data. Rapporto Tecnico RT-DIA- 170-2010, Università degli Studi Roma Tre, Dipartimento di Informatica e Automazione (2010). PhD Thesis [40] V. Crescenzi. On Automatic Information Extraction from Large Websites. Collana delle Tesi di Dottorato, Università degli Studi di Roma La Sapienza (2002). 11

Other Cited Publications [41] E. M. Gold. Language identification in the limit. Information and Control. 10(5), 447 474. [42] D. Angluin. Inference of Reversible Languages. Journal of the ACM. 29(3), 741 765. 12