ANADOLU ÜNİVERSİTESİ BİLİM VE TEKNOLOJİ DERGİSİ ANADOLU UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY Clt/Vol. : 9-Sayı/No: : 7-86 (008) ARAŞTIRMA MAKALESİ /RESEARCH ARTICLE INFORMATION EXTRACTION FROM E-COMMERCE WEBSITES USING SEQUENTIAL WORD GROUP FREQUENCIES Ferkan KAPLANSEREN, Tatyana YAKHNO, M. Kemal ŞIŞ ABSTRACT In ths paper we present an approach how to extract nformaton from e-commerce web stes. Automatc nformaton extracton s appled to e-commerce web stes to construct a descrpton of products. Ths descrpton contans set of features of the product and ther possble values. We mplement a new algorthm based on sequental word group frequences and syntactcal rules to extract the semantcs. Results are presented and nterpreted for future works toward desgnng an e-commerce shoppng agent. Keywords: Semantcs, Sequental word group frequency, Informaton extracton, E-commerce agent. ARDIŞIK KELİME GRUP FREKANSLARINI KULLANARAK E-TİCARET WEB SİTELERİNDEN BİLGİ ÇIKARIMI ÖZ Bu çalışmada e-tcaret web stelernden blg çıkarımının nasıl yapılacağına dar br yaklaşımı sunuyoruz. Otomatk blg çıkarımı, e-tcaret web steler üzernde uygulanarak ürünlern tanımlanması sağlanır. Bu ürün tanımlaması, ürünün özellkler kümesn ve muhtemel değerlern çerr. Anlam çıkarımını sağlamak çn ardışık kelme grubu frekanslarına ve söz dzmsel kurallara dayanan yen br algortma gerçekleştrlmştr. Sonuçlar gelecek çalışmalarda br e-tcaret alışverş ajanının tasarımı çn sunulmuş ve yorumlanmıştır. Anahtar kelmeler: Anlam, Ardışık kelme grubu frekansı, Blg çıkarımı, E-tcaret ajan.. INTRODUCTION The World Wde Web (WWW) s now the largest source for Informaton Extracton (IE) but, because of ts structure, t s dffcult to use ths nformaton n a systematc way. The scope of the data n web pages conssts of text, vdeo, mages, sounds and etc. Much of the nformaton s embedded n text documents and reles on human nterpretaton and processng, at the same slow pace as before (Hu and Yu, 005). Accordng to Moens (006) the use of the term extracton mples that the semantc target nformaton s explctly present n a text s lngustc organzaton,.e., that t s readly avalable n the lexcal elements (words and word groups),, Dokuz Eylül Ünverstes İşletme Fakültes, İşletme Bölümü 560 Buca, İZMİR. Fax: 0..45. 50.6, E-mal: ferkan.kaplanseren@deu.edu.tr Receved: 8 August 007; Revsed: January 008; Accepted: 5 February 008
74 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () the grammatcal constructons (phrases, sentences, temporal expressons, etc.) and the pragmatc orderng and rhetorcal structure (paragraphs, chapters, etc.) of the source text. Relaton between IE and the felds of Natural Language Processng (NLP), Knowledge Acquston (KA), Machne Learnng (ML) and Data Mnng (DM) of artfcal ntellgence s mportant to process the text n WWW. Statstcs and layout of html pages play a strong role n processng the unstructured text. Flatova and Hatzvassloglou (00) present a statstcal system that detects, extracts, and labels atomc events at the sentence level wthout usng any pror world or lexcal knowledge. Burget (005) proposes an alternatve nformaton extracton method that s based on modelng the vsual appearance of the document. Burget (004) also explans a herarchcal model for web documents that smply presents some structured data such as prce lsts, tmetables, contact nformaton, etc. Objectve of Mukherjee, Yang, Tan and Ramakrshnan (00) s to take a HTML document generated by a template and automatcally dscover and generate a semantc partton tree where each partton wll consst of tems related to a semantc concept. Hdden Markov Model (HMM) s one of the man methods that uses the probablty for lexcal analyss. Bkel, Schwartz and Weschedel (999) present IdentFnder, a HMM that learns to recognze and classfy names, dates, tmes, and numercal quanttes. Borkar, Deshmukh and Sarawag (00) present a method for automatcally segmentng unformatted text records nto structured elements. Ther tool enhances on HMM to buld a powerful probablstc model that corroborates multple sources of nformaton ncludng, the sequence of elements, ther length dstrbuton, dstngushng words from the vocabulary and an optonal external data dctonary. On the other hand, most web users expect more human-lke ntellgence from ther web applcatons. Recently, e-commerce s one of the most mportant and money makng engne of web based applcatons. The e-commerce platform wll eventually provde a platform where manufacturers, supplers and customers can collaborate (Chaudhury and Klboer, 00). The expectaton may be realzed, apart from other factors, by subscrbng more partcpants n ths collaboraton. The pont of nterest beng e- commerce, trustng one s personal computer (t s applcaton n realty) s a matter of ntellgent e-commerce agent that can extract the features of desred commodty. E-commerce shoppng (buyer) agents help customers to fnd the products and servces from nternet. Accordng to Leung and He (00), one of the thngs that an ecommerce agent can do s to montor and retreve useful nformaton, and make transactons on behalf of ther owners or analyze data n the global markets. Shoppng bots (buyer agents) agent of ebay lke ebay (http://www.ebay.com) fnd products and servces that customers are searchng for. Jango (http://www.jango.com) and adone (http://www.addone.com) are other examples for e-commerce agents (Chaudhury and Klboer, 00). Jango does comparatve prce shoppng. It s a wrapper vstng several stes for product queres and obtans prces. adone searches classfed advertsements lke used cars. Another shoppng bot that s used by akakce (onlne web store) s called as blbot (http://www.akakce.com) makes prce comparson (Karataş, 00). Ahmed, Vadrevu and Davulcu (006) mproved a system called Datarover whch can automatcally crawl and extract all products from onlne catalogues. Ther purpose s to transform an onlne catalogue nto database of categorzed products. Crescenz, Mecca and Meraldo (00) nvestgate technques for extractng data from HTML stes through the use of automatcally generated wrappers. Ther software RoadRunner tokenzes, compares and algns the HTML token sequences tag by tag. Arasu and Garca-Molna (00) study the problem of automatcally extractng the database values from such template-generated web pages wthout any learnng examples or other smlar human nput. Semantc web project has been around for over a decade, but t seems that ths project wll not be realzed for near future. Managng the unstructured nformaton of nternet mples dscoverng, organzng and montorng the relevant knowledge to share t wth other applcatons or users. To extract the semantcs from text documents wll be a matter unless web pages are desgned n a standard format. Therefore, dervng semantcs from unstructured or semstructured web wll be as mportant as desgnng a semantc web. In ths study, we propose a new algorthm based on nformaton extracton and knowledge dscovery from e-commerce web pages. We study lexcal contextual relatons to show that
Anadolu Unversty Journal of Scence and Technology, 9 () 75 relatve frequences of sequental word groups ncludng n-word(s) allow agent to derve semantcs from a collecton of e-commerce web pages. For that reason, the nput set of ths study s the set of web pages, WP = {wp, wp,, wp n }, from e-commerce web stes n nternet. The purpose s to produce the output whch s the defnton of a product n terms of possble features and values. A product s defned as a set of features lke P = {F, F, F,,F n } and a feature s defned as a set of values lke F ={V, V, V,,V k }. We developed a software called FERbot (Feature Extractor Recognzer bot) to evaluate our approach. FERbot learns the features of products automatcally by extractng sequental word groups () from web pages and fnds out the values of these features to automatcally construct a knowledge base for products. Ths algorthm can be consdered as an alternatve for the current statstcal and machne learnng technques for nformaton detecton and classfcaton. For ths reason, usage of the sequental word group frequences s the man dfference of ths study from the other agent based applcatons. Also, text mners can use our algorthm for knowledge dscovery wthn the concept of statstcal data mnng technques to operate the nformaton extracted from texts. The paper s organzed as follows. The detals of our algorthm how to extract nformaton and buld a knowledge base are explaned n secton. We present the archtecture of FERbot secton. Case studes and results are presented n secton.. In Secton. we present the comparson of results wth smlar works. Fnally n secton 4 we state our concluson and possble future extensons of ths study.. INFORMATION EXTRACTION FOR BUILDING KNOWLEDGE BASE Every product s defned by ts features. An advertsement about a product n an e-commerce ste frstly ncludes prce of that product and some addtonal nformaton lke sze, brand, power and etc. Because every e-commerce web page does not nclude the same standard notaton for nformaton about same product, t wll be nterestng to defne the general characterstcs of that product. Every product has some features and every feature has some values. For example, a car has some features lke color, fuel type and motor power, etc. A color feature of a car has some values lke red, blue and whte. If we denote a product by P and f F and V j refer to the th feature and j th value, respectvely, we defne a product as a set of features, P = {F,, F n } and a feature as a set of values, + F = { V,, V m },, j, n, m Z Fnally, a product can be represented as P = { F = V,, V },, F = V,, V }}, { k + n { n n t k, t Z. Every product may have dfferent number of features and every feature may have dfferent number of values. For example, feature and value representaton of product car can be shown as: Car = {color={red, blue, green},fuel type={benzne, desel, lpg}, producton year = {946, 004, 005, 006}}. Accordng to Grshman (997), the process of nformaton extracton has two major parts; frst the system extracts ndvdual facts from the text of a document through local text analyss. Second, t ntegrates these facts, producng larger facts or new facts (through) nference. Stevenson (006) emphaszes the mportance of usage of multple sentences for facts nstead of sngle sentence. Karankas (00) thnks that prmary objectve of the feature extracton operaton s to dentfy facts and relatons n text. Some knowledge engneers try to construct knowledge bases from web data based on nformaton extracton usng machne learnng (Craven, DPasquo, Fretag, McCallum, Mtchell, Ngam and Slattery, 000). To collect knowledge about a product, Lee (004) works wth specally desgned nterfaces to allow doman experts easly embed ther professonal knowledge n the system. Gomes and Segam (007) study the automatc constructon of databases from texts for problem solvers and queryng text databases n natural language, n ther research. The system that s offered by Shamsfard and Barforoush (004) starts from a small ontology kernel and constructs the ontology through text understandng automatcally. Assume that we have a set of e-commerce web pages WP. WP = {wp, wp,, wp n }, + n Z. Every wp s a par of words and keywords, wp = <W, K >, where W = [ w,, w ] p + and K = { k,, k s },, p, s Z. W s the lst of words where every w j appears n the body part of th html page, K s the set of keywords of
76 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () html pages where every k s defned n metakeyword-tag of th html j page. If we combne all lsts of words, W, successvely we create a new lst of words, S = [ W, W, W,, Wn ]. Order of words s mportant n lst S and a word can appear more than one tme n ths lst. If we collect all sets of keywords, K, n a sngle set, we create a new set of keywords, K = { K, K, K,, K n}. We analyze the lst S s n secton. to fnd the features and values of a product and we work over the set K n secton. to refne these features and values... Sequental Word Groups In our study, frequences of n-words groups are used to extract features and values from the lst S. We assume that frstly the name of the feature appears and then the value of ths feature appears n a text to explan a product. Also we assume that a feature can be represented by a sngle word or by a sequence of words and a value can be represented by a sngle word. r Let denote a sequental word group th consstng of r words, startng from word of lst S. The lst of sequental word groups that have only -word can be shown by where = [,, ] = S. The lst of sequental word groups that have -words can be shown as = [,, ] where n = w w+ Fnally, the lst of sequental word groups that have r-words can be shown as r r r = [,, n ] r r r r where = w w+ w+ r-. We analyze the lst S to fnd the frequences, f, of sequental word groups. We n keep sequental word groups ncludng r-word(s) and ther frequences n dfferent text fles. These text fles are bascally arrays of word(s) and ther frequences. A feature of a product can only be a sngle word or t can be a noun phrase or an adjectve n phrase. For example, features of a car are color, fuel type, motor power and etc. For extractng features represented by two-words, we use three lsts, and. We also use frequences of every, f n, for comparson of frequences of dfferent s. Let us consder that frst n unque s of whch have the hghest frequences are the features of the product P. Emprcally n can be set to an nteger number between 50 and 00. We compare frequency of frst n unque of and frequences of of begnnng wth the same word. If they are approxmately equal, the feature s represented by twowords nstead of sngle word. Formally, f f = f + t then j = w j w j+ F k = j where F k shows a feature of product P = {F,, F n }, j = wj+ = w j w and t s an nteger number to show that frequences are ap- j+ proxmately equal. After selectng a feature, t s tme to fnd the values of ths feature F k. For every from, f the frst two words of are equal to the feature, F k, then the thrd word that comes after these two words s set to a value of ths feature, F k. Formally, If j = Fk w j+ then V kt = w j+ where F k = w j w j+ and Vk t s one of the values of feature F k, Vk t F={V, V, V,,,V n }. Ths process contnues untl all features that nclude two-words and values of these features are found. To fnd the features ncludng one-word, agan we take nto account of n unque of whch have the hghest frequences. We elmnate the s whch are used for any feature ncludng two words or used for a value of these features. Formally, f P or F then F k = where F k becomes a feature of product P. Values of ths feature F k becomes the second words of, begnnng wth the feature F k. Formally, f j = F k w j+ then V kt = w j+ where F k w andv s one of the values of feature F k, = j k t
Anadolu Unversty Journal of Scence and Technology, 9 () 77 Vk t F. Ths process contnues untl to all features that nclude one-word phrases and values of these features are found. A descrpton of a product P, P={ F = { V,, V },, F {,, } k n = Vn Vn t }, s completed gnorng features ncludng more than two-words. Fgure shows how to fnd the features ncludng more than two words and values of ths feature. For example, n the followng D matrx, has a value of and means that second keyword of set K s the keyword of the frst web page, formally, k wp. 0 D = 0 0 0 0 0.. Number of s n a row gves the frequency, f k, of th keyword, k, used n web pages, wp. Frequency of a keyword can be computed as f = n k d j j =. mxn All frequences of keywords are computed and then keywords are elmnated from the set K f they have a frequency under the standard devaton of the average of keyword frequences. A new set whch s the subset of K can be shown as K = { k f f k > (A- σ )} where A s the average of the frequences of keywords and σ s the standard devaton of the frequences of keywords. A s represented by the formula Fgure : Pseudo code of fndng features ncludng more than two words and values of these features... Defnng Mostly Used Keywords to Refne Features and Values Features and values of a product can be refned by usng the set of keywords, K={k, k,, k n }, of html pages where k appears n metakeyword-tag of html page. D = [ d j ] mxn s a bnary matrx shows whch web pages use the keyword, k, n meta-keyword-tag. m shows number of keywords and n shows number of web pages. A keyword may nclude more than one word. Value of each entry of matrx D can be computed by the followng condtons., d j = 0, d f th keyword th occurs n j document f th keyword th doesn' t occur n j document A m = = Members of the set P are compared wth the members of set K. The words n the ntersecton set of these sets defne the mportance of the features and values. If k K and k P then k s certanly a feature of product P. If k K and k F then k s certanly a value of feature F.. ARCHITECTURE OF FERBOT To demonstrate that sequental word group frequences are effectvely used for nformaton extracton we developed a software called FERbot (Feature Extractor Recognzer Bot). FERbot classfes features and ther values nto dfferent categores usng the algorthm specfed n Fgure whch s based on sequental word frequences. There are three man parts of FERbot system as shown n Fgure, where IG represents the Informaton Gatherng part, UI represents the User Interface part and KM represents the Knowledge Modellng part. m f k
78 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () Fgure : Archtecture of FERbot: IG (Informaton Gatherng), KM (Knowledge Modelng), UI (User Interface). At the Informaton Gatherng part, FERbot vsts nternet pages and stores them nto a local machne to work over these web pages. The man questons for IG are Whch stes must be vsted? and Whch nformaton must be collected from each ste? Everyday the number of the web stes to be vsted s ncreasng. Processng every document n WWW s often expensve and unnecessary. Chuang and Chen (005) fght wth ths huge amount of data by combnng ther study wth search process of search engnes to extract features from the retreved hghly ranked search-result snppets for each text segment. Ther am s to organze text segments nto herarchcal structure of topc classes. Can (006) summarzes the nformaton exploson problem and llustrates some recent developments for nformaton retreval research of Turksh Language. He emphaszes some methods lke vector space models, term frequency, nverse document frequency and document query matchng for more effectve nformaton retreval aganst the nformaton exploson. FERbot knows whch page to vst by flterng the addresses of web pages from Google s man page. To obtan more specfc result page, FERbot sends the product name as query to Google Search engne by customzng some searchng optons. To vst just e-commerce stes that sell products, web ste addresses ncludng doman name extensons lke edu, ml, org, net, gov, aero, bz, coop, nfo, pro and fles havng fle extenson names lke pdf, doc, ppt, xls, jpg, jpeg, gf, pct, bmp, tf, png, co are elmnated from the result page. Google apples Latent Semantc Index (LSI) to every query. LSI can be used to fnd relevant documents that may not even share any search terms provded by the user (Berry, Dumas and O Bren, 995). Google search results nclude both the product names and the relevant terms of product names. Every search-result of Google ncludes the web ste name, web ste address and a small summary ncludng the quered word(s). Google uses dfferent styles to nsert hyperlnks for web ste names, addresses and summares. To leave alone the lnks showng the web ste address of the query, Google s search results are parsed. Parsed hyperlnks of web ste addresses are wrtten nto a text fle to keep them for reuse. Every lnk ncludes an dentfer showng ts order. The dentfer can be consdered as the ponter of that lnk. Sometmes Google produces less than 000 results. At that stuaton the hghest numbered ponter s the total results. Also t s possble to see same addresses at the result page. After downloadng lnks, same web ste addresses are deleted. Accordng to Google Search Engne results, most popular 000 lnks are refned and ready to
Anadolu Unversty Journal of Scence and Technology, 9 () 79 be vsted. Snce FERbot s a centralzed system, t begns to download web stes from the frst lnk to the last one. There are some reasons that prevent FERbot to download web pages. Frst, some web pages are protected for downloadng. Second, some web pages are not avalable and thrd, some downloadngs are not completed. At these stuatons FERbot skps the web page and downloads the next web page. At the Knowledge Modellng part whch s the most mportant, FERbot parses these downloaded html fles to have a structured form for data. After downloadng all web pages, to fnd the frequences of sequental word groups of the collected data, FERbot apples a regular process. All html fles have to be detagged. FERbot cleans all java, php, asp and etc. scrpts, and deletes all markups of html codes. FERbot wrtes all text wthout markups and scrpts nto a text fle havng word-space-word format. Cleanng mark ups of html fles s not the only thng to get a standard fle to work over t. FERbot changes all letters nto lower case n a document to make that document be more standardzed. Also, FERbot converts all text formats nto a same text format and deletes all punctuaton marks. There are some stopwords n a language to lnk sentences or phrases. For example n Turksh language they are çn, km, bu and etc. Cleanng these stopwords decreases the work on analysng whole data. FERbot cleans approxmately 80 types of stopwords of Turksh language from documents. Fnally, FERbot use the algorthm based on sequental word group frequences and keyword-document matrx. At the user nterface part, FERbot allows customers to present ther queres. Customers may select the query from the lst or they may wrte ther queres. After acceptng query, related answer s retreved from database and sent back to user... Case Studes of Sequental Word Group Frequences for Informaton Extracton FERbot collects the data from the web to extract the features of the product before presentng t to the customers. FERbot apples a regular process (collectng nformaton, knowledge modelng and presentng data) for every product. Let us llustrate the descrpton for the product araba ( whch means car n Turksh) that FERbot retreves. The frst step of FERbot s to fnd the web pages related to araba from the Internet usng Google search engne. Every search-result of Google ncludes the web ste name, web ste address and a small summary ncludng the quered word araba. Google uses dfferent styles to nsert hyperlnks for web ste names, addresses and summares. To leave alone the lnks showng the web ste address of the query araba, Google s search results are parsed. Hyperlnks of web page addresses about araba are wrtten nto a text fle to vst and download them one by one. FERbot creates a collecton whch conssts of 95 web pages related to araba, to fnd the frequences of sequental n-word(s) groups of these web pages. FERbot apples the regular process whch s mentoned n secton and secton. After ths process, FERbot creates an nput fle called S whch s the lst of words obtaned from collected web pages. FERbot analyzes the nput fle and wrtes all one-word, two-words, sx-words groups (for example a two-words group means two sequental words consderng order s mportant) and ther frequences nto separate text fles. Sectons of three of these fles about araba are presented n Table. Table. Sectons of,, and ther frequences, f, f, f A Secton from the fle of f and motor motora motorbke motorbsklet. f 58 4 A Secton from the fle of f and motor gücü 4 motor 6 motor ma motor kategorde. f A Secton from the fle of f and motor gücü 00 motor gücü 0 motor gücü 0 motor gücü 0. f 6 6 5
80 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () Table. Values and features of product araba FEATURE 'knc el' (second hand) 'motor gücü' (motor power) 'yakıt tp' (fuel type) 'slndr hacm' (cylnder volume) 'yılı' (producton date) 'renk' (color) 'ytl' (currency unt) VALUE 'araç'(vehcle), 'araba'(car), 'arabalar'(cars), 'blgsayar' (computer), 'cep'(moble), 'ford' (ford), 'knc' (second), 'mercedes' (mercedes), 'opel'(opel), 'oto'(auto), 'otomobl' (automoble), 'otomotv' (automotve), 'peugeot' (peugeot), 'renault' (renault), 'satılık'(for sale) '00' '0' '05' '0' '5' '0' '5' '0' '6' '40' '50' '6' '80' '90' '00' '0' '5' '5' '50' '00' '60' '65' '70' '75' 'benzn'(benzne) 'dzel'(desel)'lpg'(lpg) '00' '400' '490' '500' '598' '600' '800' '998' '000' '400' '500' '700' '800' 'cm' '96' '964' '97' '987' '989' '990' '99' '99' '99' '994' '995' '996' '997' '998' '999' '000' '00' '00' '00' '004' '005' '006' 'çelk'(steel), 'ateş'(fre), 'bej'(bege), 'beyaz'(whte), 'bordo' (maroon), 'füme'(smoke-colored), 'gökyüzü'(sky), 'gümüş' (slver), 'gr'(grey), 'kırmızı'(red), 'lacvert'(dark blue), 'mav' (blue), 'metalk'(metallc), 'nl'(nle green), 'sarı'(yellow), 'syah' (black), 'yeşl'(green) '' '0' '' '' '' '4' '5''6' '7' '8' '9' '0' '' '' '''4' '5' '6 '7' '8' '' '0' '4' '5' FERbot analyzes the nput fle and wrtes all one-word, two-words, sx-words groups (for example a two-words group means two sequental words consderng order s mportant) and ther frequences nto separate text fles. Sectons of three of these fles about araba are presented n Table. FERbot consders the frst sxty unque words whch have hghest frequency as the most related words for product araba. It searches relevant features among these sxty words (to get better solutons number sxty can be ncreased). FERbot extracts the features whch ncludes two-words by usng the algorthm mentoned n Fgure. For an example, the frequency of motor, f, and the frequency of two-words group begnnng wth the word motor, f, are compared, f they are approxmately equal, feature name becomes twowords group (motor gücü) nstead of motor. The frequency of motor gücü, f, and the frequency of three-words group begnnng wth motor gücü, f, are compared, f they are not equal or not approxmately equal, feature name remans same as motor gücü and values of ths feature becomes last words (..,00,0,0,0, ) of three-words group,. Words havng hgh frequences that do not take a place n the two-words feature group are consdered as sngle word feature and values of them were assocated lke values of two-words groups.
Anadolu Unversty Journal of Scence and Technology, 9 () 8 FERbot searches meta-keyword-tags to see the relaton between the product name araba and keywords of web pages about araba as mentoned n Secton. FERbot consders keywords havng a frequency hgher than the average of keyword frequences are potental features or values of the product araba. These potental features and values are compared wth the results that are found by the new algorthm to refne the results. The refned results about features and values of araba are presented n Table where frst column represents the man features of the product araba and second column represents the values of the features. Also frequences of these features and values are kept n another fle. By changng the threshold of any frequency, number of features and the range of the features (number of values) can be ncreased or decreased. To show the performance of FERbot, the features and values of the concept Renault and BMW whch are dfferent types of product araba (car) are presented n the Table and Table 4. These results are obtaned for the same thresholds from ndependent doman resources. We used a summary of the same features n both tables for better comparsons. Yakıt tp whch means fuel type s the feature of both cars that have dfferent values. Ths s the expected stuaton for all cars. We know that BMW has more powerful motors and hgher cylnder volumes than Renault has. Ths stuaton s observed for the values of features motor power and cylnder volume as shown n Table and Table 4. Renk (Color) features of these two knds of cars have almost the same values but the frequences may change. FERbot may extract meanngful nformaton about the relatons between dfferent products whch are n the same category as seen n Table and Table 4. For example: both car types have the same values for the feature yakıt tp. For ths reason these products are strongly related. Also, FERbot may show the range of the values of these products at the same tme. Table. Values and features of product Renault. FEATURE VALUE motor gücü 00, 0, 05, 0, 5, 0, 5, 40, 60, 65, 70, 7, 75, 80, 85, 90, 95, 98 yakıt tp benzn, dzel, lpg Slndr hacm 49, 00, 00, 90, 97, 98, 400, 46, 490, 500, 590, 598, 600, 700, 7, 870, 898, 900, 995, 998, 000, 500 Renk çelk, ateş, bej, beyaz,bordo, fume, gümüş, gr, kırmızı, lacvert, mav, sarı We evaluated our approach for some dfferent types of products such as, BOOK, TELEVISION, HOLIDAY, VILLA, PRIVATE LESSON etc. to compare results. We evaluate our approach for ndependent domans. Whenever we use doman specfc web pages for searchng the features and values of a product, the results becomes excellent lke the results of vlla from the doman http://www.emlak.turyap.com or özel ders from the doman http://zmr.ozelders.com/ogretmen/. Results of the products vlla and özel ders can be seen n Table 5, Table 6 and Table 7. Also, f the doman of any product s semstructured then our approach produces perfect results. Table 4. Values and features of product BMW. FEATURE VALUE motor gücü 00, 0, 05, 0, 5, 0, 5, 0, 6, 40, 6, 80, 90, 00, 0, 5, 5, 50, 00, 60,65 yakıt tp benzn, dzel, lpg slndr hacm 596, 598, 600, 796, 798, 800, 99, 995, 998, 000, 00, 498, 500, 800, 99, 998, 000, 4400, 4500 renk çelk, ateş, beyaz, bordo, füme, gümüş, gr, l, kırmızı, lacvert, mav, metalk, opsyonları, seçenekler, syah, yeşl
8 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () Table 5. Values and features of product vlla. FEATURE 'oda sayısı' (number of rooms) 'salon sayısı' (number of salons) 'kralık' (for rent) 'ste' (ste) 'kat' (number of floors) 'bahçe' (garden) 'bölge' (regon) 'mahalle' (dstrct) VALUE '' '' '4' '5' '6' '' '' 'emlak' (estate) 'emlaklar' (estates) 'vlla' (vlla) 'hartası' (map) 'çnde' (nsde) '' '' '' '4' 'çnde' (nsde) 'var' (exst) 'çengelköy' 'bolu' 'guzelyal' 'marmars' 'akpnar' 'bahcelevler' 'merkez' Table 6. Values and features of product vlla wth a dfferent treshold. FEATURE 'oda sayısı' (number of rooms) 'salon sayısı' (number of salons) 'kralık' (for rent) 'adres' (address) 'fyat' (cost) 'lçe' (town) 'kat' (number of floors) 'mutfak' (ktchen) 'banyo' (bathroom) 'bahçe' (garden) 'mahalle' (dstrct) VALUE '' '' '4' '5' '6' '' '' 'emlak' (estate) 'emlaklar' (estates) 'vlla' (vlla) '' '' '' '4' 'var' (exst) 'Merkez' Table 5 and Table 6 are obtaned for dfferent thresholds of frequences of values of features. When we decrease the threshold of frequency we obtan the results n Table 6 whch are better feature values than results n Table 5. Table 7 shows another case study of FERbot that s studed for the product özel ders whch means prvate course n Englsh. FERbot uses web pages only from the doman http://zmr.ozelders.com/ogretmen/. Results are qute satsfactory. Table 7. Values and features of product özel ders. 'lsans'(unversty) 'anadolu' 'celal bayar 'ege' '9 eylül' 'stanbul' 'zmr' 'uludağ' 'yaş'(age) 0' '' '' '' '4' '5' '6' '7' '8' '9' '' ''. 'cnsyet'(gender) 'bay'(male) 'bayan'(female) 'sm'(name) 'elf' 'emrah' 'erkan' 'pınar' 'meslek'(job) 'öğrenc'(student) 'öğretmen'(teacher) 'eğtm durumu'(educaton) 'ünverste'(unversty) 'lse'(hgh school) 'grup ders'(group lesson) 'evet'(yes) 'hayır' (no) 'saat ücret' (cost for an hour) '0' '5' '0' '40' 'ders mekanı'(place of lesson) 'öğrencnn ev' (house of student) 'etüd'(etude) 'grup'(group) 'bulunduğu şehr' (cty of teacher) 'bornova' 'buca' 'karşıyaka'
Anadolu Unversty Journal of Scence and Technology, 9 () 8.. Comparson of results wth smlar works Informaton agents generally rely on wrappers to extract nformaton from sem structured web pages (a document s sem structured f the locaton of the relevant nformaton can be descrbed based on a concse). Each wrapper conssts of a set of extracton rules and the code requred to apply those rules (Muslea, 999). Ac cordng to Seo, Yang and Cho (00), the nformaton extracton systems usually rely on extracton rules talored to a partcular nformaton source, generally called wrappers, n order to cope wth structural heterogenety nherent n many dfferent sources. They defne a wrapper as a program or a rule that understands nformaton provded by a specfc source and translates t nto a regular form that can be reused by other agents. Table 8. Exstences of features of araba for dfferent wrappers and FERbot. Feature Wrapper for Wrapper for Wrapper for gttgdyor.com sahbnden.com araba.com FERbot Marka 0 0 Model Yılı Renk Klometre 0 Motor Hacm Yakıt Tp Motor Gücü Model 0 0 0 Model Uzantısı 0 0 0 Araç Tp 0 0 0 İç Renk 0 0 0 Durumu 0 0 Vtes 0 0 Kasa Tp 0 0 0 4 x 4 0 0 0 Durumu 0 0 0 Türü 0 0 0 Takaslı 0 0 0 Kmden 0 0 0 FERbot s a knd of nformaton extracton system whch has the same purpose as wrappers have. The frst advantage of FERbot s ts own algorthm. It can be used for doman specfc or doman ndependent resources. Second advantage s the ablty to work wth the labeled or unlabeled web pages. The nput of FERbot can be any knd of web pages (They are desgned or not desgned by XML). FERbot understands the features of products by fndng ther frequences. FERbot uses labels to extract the keywords of web pages to construct the keyword-document matrx for refnng the results of features of products. Fnally, we may say that most of the wrappers are constructed for doman dependent and labeled documents but FERbot does not mply these knds of restrctons. In Table 8, frst column shows the features of araba that are obtaned as a result of constructon of wrappers for dfferent domans. means that the related feature s recognzed by the wrapper of that doman or FERbot. 0 means that the related feature s not recognzed by that wrapper or FERbot. FERbot generates the features from dfferent domans. Table 9 shows the correlaton matrx of wrappers and FERbot. As seen n Table 9, FERbot features have a strong postve relaton wth features of wrapper for araba.com. Also, Results of FERbot are mostly related features wth the features of wrapper for gttgdyor.com. In Table 0, FERbot generates the features from the domans zmr.ozelders.com/ogretmen and emlak.turyap.com.tr. When the features are compared t s seen that FERbot automatcally generates most of the features that are captured by the doman dependent wrappers.
84 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () Table 9. Correlaton matrx of wrappers and FERbot. Wrapper for gttgdyor.com Wrapper for sahbnden.com Wrapper for araba.com FERbot Wrapper for gttgdyor.com Wrapper for sahbnden.com -0.944059 Wrapper for araba.com 0.574857 0.944059 FERbot 0.45645465 0.086067 0.78460796 Table 0. Doman dependent Features for wrappers and FERbot. Wrapper for zmr.ozelders.com/ogre tmen FERbot Wrapper for emlak.turyap.com.tr FERbot sm: sm oda sayısı 'oda sayısı' yaş: yaş salon sayısı 'salon sayısı' cnsyet: cnsyet mahalle 'mahalle' meslek: meslek bahçe 'bahçe' saat ücret: saat ücret lçe 'lçe' grup ders: grup ders fyat 'fyat' bulunduğu şehr: 'bulunduğu şehr mutfak 'mutfak' ders mekanı: ders mekanı banyo 'banyo' rtbat telefonu: lsans toplam kat 'kat' verdğ dersler eğtm durumu ste çnde 'ste' alan sıtma kod güvenlk yapım yılı balkon sayısı Consstency: 80% Consstency: 7% 'bölge' 'kralık' 'adres' 4. CONCLUSIONS In ths study, we proposed a new algorthm based on nformaton extracton and knowledge dscovery from e-commerce web pages. We developed a software called FERbot (Feature Extractor Recognzer bot) to evaluate ths new algorthm. FERbot learns the features of products automatcally by extractng sequental word groups from hundreds of web pages and fnds out the values of these features to automatcally construct a knowledge base for products. Fnally, n secton we presented the results of experments about products araba, Renault, BMW, vlla and özel ders. Our algorthm produces perfect results f the nput set s mostly consst of sem-structured web pages nstead of unstructured web pages. Comparsons of results of FERbot and other wrappers are defned at the end of Secton. Ths work has shown that sequental word group (sngle, two-words,., n-words) frequences are effectve to derve semantcs from e- commerce stes to defne product features and feature values. We demonstrated that structure and semantcs of a text can be systematcally extracted by lexcal analyss. We presented a new algorthm to use the relatve frequences of sequental word groups for nformaton extracton. Ths algorthm s an alternatve for the current statstcal and machne learnng technques for nformaton detecton and classfcaton. Also, text mners can use our algorthm for knowledge dscovery wthn the concept of statstcal data mnng technques to operate the nformaton extracted from texts.
Anadolu Unversty Journal of Scence and Technology, 9 () 85 If ntellgent agents especally shop-bots that derve semantcs from semantc web stes have capablty to understand the non-standard web stes, they wll complete ther job to fnd the product features at the same tme. Such agents may suggest new product features as mportant keywords to be taken nto account for refned searches. If proposed new features are accepted by the user, the agent wll search and shop n Internet accordngly. These agents searchng the desred product n the Internet, may record possble new features at the same tme and nform the central server from whch they are dspatched and where the product feature fles are mantaned. Our algorthm and work n ths paper for extractng new features of products from e- shoppng web-stes wll be realzed as an agent based applcaton. One part of e-commerce agents deal wth the subject of nformaton gatherng whch s the act of collectng nformaton. Ths s the man dea of our study whch s collectng nformaton from Web and extractng meanngful relatons between products and ther propertes to nform customers. Ths wll also present new possbltes towards nformaton extractng from structured or unstructured html web stes. Most of the wrappers are constructed for doman specfc resources. For that reason, they get the sem structured or structured web pages as nput set to extract the nformaton. They manly segment the page layout and extract the labeled nformaton. We have searched how FERbot produces dfferent results accordng to the type of resource of web pages. FERbot s ready for every knd of web pages but, our algorthm produces perfect results f the nput set s mostly full of sem-structured web pages nstead of unstructured web pages. 5. ACKNOWLEDGMENTS We would lke to thank Dr. Adl Alpkocak for hs useful comments and remarks. REFERENCES Ahmed S, T., Vadrevu S., Davulcu H. (006). DataRover: An Automated System for Extractng Product Informaton, In Advances n Web Intellgence and Data Mnng, Volume, pp. -0, Sprnger Berln/Hedelberg Publshers. Arasu A., Garca-Molna H. (00). Extractng structured data from Web pages, ACM SIGMOD, pp. 7 48. Arasu A., Garca-Molna H. (00). Extractng structured data from Web pages, ACM SIGMOD, pp. 7 48. Bkel D. M. Schwartz R., Weschedel R. M., (999). An algorthm that learns what's n a name. Machne Learnng. 4, -. Borkar V. Deshmukh K., Sarawag S., (00). Automatc segmentaton of text nto structured records, Internatonal Conference on Management of Data Archve Proceedngs of the 00 ACM SIGMOD nternatonal conference on Management of data, ISSN:06-5808, pp. 75 86. Burget R. (004). Herarches n HTML Documents: Lnkng Text to Concepts, 5th Internatonal Workshop on Database and Expert Systems Applcatons, pp.86-90. Burget R. (005). Vsual HTML Document Modelng for Informaton Extracton, CZ, FEI VŠB, ISBN 80-48-0864-, Ostrava pp.7-4. Can F. (006). Turksh Informaton Retreval: Past Changes Future, Advances n Informaton Systems, Sprnger Berln Hedelberg New York, pp.-. Chaudhury A. Klboer J., (00). E-busness and E-commerce Infrastructure, Technologes Supportng the E-busness Intatve (Chapter ), McGraw Hll. Chuang S. L., Chen L.F. (005). Taxonomy generaton for text segments:a practcal web-based approach, ACM Transactons on Informaton Systems, pp. 6-96. Craven M., DPasquo D., Fretag D., McCallum A., Mtchell T., Ngam K., Slattery S. (000). Learnng to construct knowledge bases from the World Wde Web, Artfcal Intellgence, 8, pp. 69. Crescenz R., Mecca G., Meraldo P. (00). RoadRunner: Towards Automatc Data Extracton from Large Web Stes, 7th Internatonal Conference on Very Large Data Bases.
86 Anadolu Ünverstes Blm ve Teknoloj Dergs, 9 () Flatova E., Hatzvassloglou V. (00). Doman-Independent Detecton, Extracton, and Labelng of Atomc Events, Proceedngs of the Fourth Internatonal Conference on Recent Advances n Natural Language Processng. Gomes G., Segam C. (007). Semantc Interpretaton and Knowledge Extracton, Knowledge-Based Systems, Volume 0, Issue, pp. 5-60. Grshman R. (997). Informaton Extracton: Technques and Challenges, Materals for Informaton Extracton, Sprnger-Verlag. Hu B., Yu E. (005). Extractng conceptual relatonshps from specalzed documents. Data & Knowledge Engneerng. 54 (), 9-55. Karankas H., Theodoulds B. (00). Knowledge Dscovery n Text and Text Mnng Software, Techncal report, UMIST- CRIM, Manchester. Karataş K. (00). The Anatomy of an Internet Robot : blbot.. Meteksan Sstem ve Blgsayar Teknolojler A.Ş. Blkent, Ankara, Türkye de Internet konferansları VII., http://net tr.org.tr/netconf7/program/ 97.html. Lee W.P. (004). Applyng doman knowledge and socal nformaton to product analyss and recommendatons: an agent-based decson support system. Expert systems. (), 8-48. Leung H., He M. (00). Agents n E- commerce: State of the Art, Knowledge and Informaton Systems, pp. 57-8. Moens M.F. (006). Informaton Extracton: Algorthms and Prospects n a Retreval Context, (Chapter ), pp. -, Sprnger. Mukherjee S. Yang G., Tan W., Ramakrshnan I.V., (00). Automatc Dscovery of Semantc Structures n HTML Documents, 7th Internatonal Conference on Document Analyss and Recognton. Muslea I. (999). Extracton Patterns for Informaton Extracton Tasks: A Survey, The AAAI-99, Workshop on Machne Learnng for Informaton Extracton. Seo H., Yang J., Cho J. (00). Buldng Intellgent Systems for Mnng Informaton Extracton Rules from Web Pages by Usng Doman Knowledge, IEEE Internatonal Symposum on Industral Electroncs, pp. -7, Pusan, Korea. Shamsfard M., Bardoroush A.A. (004). Learnng Ontologes from Natural Language Texts. Internatonal Journal of Human- Computer Studes. 60 () 7-6. Stevenson M. (006). Fact dstrbuton n Informaton Extracton, Language Resources and Evaluaton, Sprnger, pp. 8-0. Ferkan KAPLANSEREN, Lsans eğtmn Ege Ünverstes Matematk Bölümü nde, yüksek lsans ve doktora eğtmn se Dokuz Eylül Ünverstes Blgsayar Mühendslğ Bölümü nde tamamlamıştır. Çalışmalarını akıllı ajanlar, elektronk tcaret ve yapay zeka üzerne yoğunlaştırmıştır. D.E.Ü. İşletme Fakültes nde öğretm görevls olarak görev yapmaktadır. Tatyana YAKHNO, Novosbrsk State Unversty, Computatonal Lngustc bölümü mezunudur. Aynı bölümde yüksek lsansını tamamlamıştır. Doktora eğtmn Sberan Branch of Russan Academy of Scences akedems blgsayar merkeznde tamamlamıştır. Çalışma alanları arasında yapay zeka, blg gösterm ve uzman sstemler, mantıksal programlama, kısıtsal programlama, çoklu ajan sstemler ve evrmsel algortmalar konuları yer alır. D.E.Ü. Blgsayar Mühendslğ Bölümü nde Prof. Dr. olarak görev yapmaktadır. M. Kemal ŞİŞ, İstanbul Teknk Ünverstes Elektrk Mühendslğ Bölümü nde lsans ve yüksek lsans eğtmn tamamlamıştır. Doktora eğtmn Polytechnc Unversty (NewYork A.B.D.) de btrmştr.çalıştığı konular arasında elektronk tcaret, sayısal ses şleme, blgsayar ağları ve quantum hesaplamaları yer alır. D.E.Ü. Blgsayar Mühendslğ Bölümü nde Öğr. Gör. Dr. olarak görev yapmaktadır.