Natural Language Processing for Verbatim Text Coding and Data Mining Report Generation

Natural Language Processing for Verbatim Text Coding and Data Mining Report Generation by Josef Leung and Ching-Long Yeh This paper highlights the techniques of natural language processing and its applications in survey research, particularly in verbatim text coding and data mining report generation. To exemplify the use of natural language analysis, a prototype system called Codia was developed for coding Chinese verbatim answers collected from survey projects. The verbatim answers of different respondents are grouped according to their similarity with a selected answer item or input text. When similar answers become neighbors, they can be easily selected and classified into groups. As an example for natural language generation, a text generator was developed to generate English description of data mining results. The text generator gets the data mining results and survey data definitions as the input and generates a description of the data mining results in English sentences, which are easier to read than the obscure notations obtained from ordinary data mining software.

Introduction This paper highlights some techniques of natural language processing (NLP) and their applications in survey research. Natural languages are the languages which ordinary people read, speak and/or write, such as English, French, German, Spanish, Chinese, and Japanese, etc. Robust natural language analysis helps extract information from the moderately ill-formed text, which is common in verbatim answers to open-ended questions. Techniques of natural language generation (NLG) have already been applied in specific domains to produce natural language texts. For instance, generation of English explanatory text from medical knowledge base may help patients understand their medical record [Cawsey et al. 95]. Automatic generation of technical documents provides advantages such as quicker creation, easier maintenance, higher consistency, better conformance to standards, tailored presentation, multilingual text, multiple visual formats, etc. [Reiter et al. 95]. While data mining techniques are found useful in the analysis of various kinds of data, many data mining systems give results in form of obscure rules or computer program codes. To help people understand and use the data mining results, it is preferable to generate the data mining reports in natural language. In this paper, we describe two important applications of natural language processing techniques in survey research: (1) coding verbatim text answers and (2) generating data mining reports in English. Understanding and Generation of Natural Language The use of natural language in our daily communication consists of two actions: one is to understand what people speak, and the other is to produce what we intend to say. Similarly, the processing of natural language in computer is concerned with understanding and generation. To provide background information, we briefly describe the general concepts and techniques involved in computer processing of natural language. Readers who are interested in the theoretical investigation of human-machine communication are encouraged to consult literature related to natural language processing or computational linguistics. Computer understanding of natural language The action of understanding of natural language in computer can be viewed as getting a sentence as the input and producing the meaning of the corresponding sentence in return. The internal processing is generally accomplished by a pipeline architecture as shown in Figure 1.

Input sentence Pre-processing Tokens Parsing Syntactic structures Semantic Interpretation Semantic representation Contextual Interpretation Knowledge representation Figure 1: Modules of natural language understanding The first block, pre-processing, recovers the fundamental elements from words in the input sentence. The words as we see in the input sentence can be formed by various forms of single words, such as love loves, or be created from old ones, such as friend friendly friendliness. After this processing, the parsing procedure takes the pre-processed sentence as input to construct the syntactic structure by consulting a syntax rule database. The syntax rules specify how several smaller constituents combine to form a larger one. For example, the following partial set of rules indicates that a sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase consists of an article (ART) and a noun (N) and a verb phrase consists of a verb (V) and a noun phrase. S N NP VP dog NP N ART N cat VP V V NP chased ART the The construction of a syntactic structure can be carried out by top-down and/or bottom-up strategy of parsing. The former strategy starts with the a S symbol (representing the sentence) as the root node and appends the symbols in the right-hand side of the S-rule as the children of

S. The process goes further down the remaining unexplored nodes in the tree until terminals (i.e., the words in the sentence) are reached. The latter strategy, on the other hand, matches each of the words in the sentence with the right-hand side of a rule, labels the parent node with left-hand side symbol, and continues to go upwards in this manner until S is successfully achieved. The result of sentence parsing is a parse tree as exemplified in Figure 2. S NP VP AR T N v NP AR T N the dog chase d the cat Fig. 2 A sample syntactic structure. This parse tree structure indicates the modification relationship among words, and the grouping of words to form phrases. For example, in Figure 2, both cat and dog are modified by article the, and they are grouped together to form a noun phrase. The verb chased and the noun phrase the dog form a verb phrase. More sophisticated syntactic structure bears additional information (see [Allen 95; Gazdar and Mellish 89] for details). The semantics component expresses the meaning of sentences in some semantic representations, among which logics (i.e., first-order predicate calculus) is widely employed [Allen 95]. Semantic interpretation is the process of mapping the syntactic structure of a sentence to the logic-based representation. The output of semantic interpretation is the context-independent meaning of sentence. The process of mapping the semantic representation to the knowledge representation, contextual interpretation, is then performed to obtain the way the sentence is used in particular context. For example, after contextual interpretation, the above sentence can be represented as chase(dog38, cat45), where dog38 and cat45 are the specific dog and cat in some context. Natural language generation in computer Natural language generation (NLG) proceeds in the opposite direction of natural language understanding. NLG starts with the semantic representation selected from the knowledge base and produces single or multiple sentences. The field of natural language generation has made a great deal of progress in the generation of multisentential text in recent years [Dale 92, McKeown 85, Maybury 92, Hovy 93]. A widely adapted architecture of natural language generation, as shown in Figure 3, consists of two components: a strategic, or what to say, and a tactical, or how-to-say component [Reiter 94]. The strategic component for text planning is concerned with selecting and organising the message content to be generated. Subsequently,

the tactical component for linguistic realisation maps the organized results into a sequence of surface sentences. Domain KB User's goal Planning Operators User Model Discourse Model Linguistic Rules & Lexicon Text Planning Linguistic Realisatio n Strategic Component Tactical Component Surface sentences Figure 3: A general architecture of natural langauge generation The primary concern of text planning is the conceptual integration of the selected content. Basically, the integration is achieved through a set of semantic relations which hold between sentences corresponding to the units in the message content [McKeown 85,Maybury 90, Hovy 93]. The semantic relations, originally obtained by the analysis of sample texts created by human experts, are operationalised as a set of rules to be used by the text planning component. The text planning component selects and organises the message content by consulting these rules. There are generally two approaches to the control of text planning: schema-based approach [McKeown 85; Paris 87] and plan-based approach [Dale 92, Hovy 93, Maybury 90]. The former pre-compiles the rules into a number of schemata, which are script-like networks. The control of text planning is basically the traversal of the appropriate network. The latter employs the concept of planning in artificial intelligence that takes the usergoal and attempts to build a plan to satisfy the goal. The rules for text planning are called planning operators. As shown in Figure 3, the system accepts whatever a user wants as the input goal. The text planning component then consults the planning operator library to get an appropriate operator that satisfies the goal. Then the operator is decomposed into other operators if it is not a primitive one. The decomposition process is performed repeatedly until primitive ones are reached. At the terminal nodes of the hierarchical structure built by the decomposition process the semantic content of the message units corresponding to the nodes are obtained from the domain knowledge base. In addition to the input goal, there are other constraints that will affect the selection of message content, e.g., the specific features of the user in question and the context where the generation task occurs. The linear sequence of message units which are generally correspondent to clauses in final result attaches to the terminal nodes and thus forms the message content. The linguistic realisation component then linearlises the message content which is retrieved from the hierarchical structure in depth-first order, determines some coherent factors (e.g., inserting adverbial connectives, deciding appropriate anaphoric expression) to generate a cohesive unit of text, and finally maps it into the surface sentences.

Natural Language Processing in Verbatim Text Coding Coding verbatim answers to open-ended questions Verbatim text coding is understood as a task of text content classification and code assignment. The verbatim answers to open-ended questions are usually classified and coded by human coders, even though there were reports of computer-assisted coding systems for qualitative research [Kelle 95; Weitzman and Matthew 95] and quantitative survey research [Luyens 95; Wong et al. 95] in recent years. To suit our needs for Chinese language processing, user-interface design and possible extensions, we developed a coding system prototype called Codia in the hope that the verbatim text coding can become easier. Implementation of Codia system Many of our natural language processing techniques were developed in Prolog before actual implementation of Codia in Borland Delphi under Windows 95/NT. In our Codia prototype (Figure 4), each window handles all the answers to a single open-ended question. The sizes and positions of windows can be changed as desired. Each window comprises an Answer panel (left) and a Classifier panel (right), which sizes are adjustable by moving a splitter bar between the two panels. While Answer panel provides a table view of the answers obtained from multiple respondents, Classifier panel provides a tree view of the hierarchical structure of the corresponding code list. To do coding, we only need to select and move specific answers from Answer panel to the appropriate categories in Classifier panel. Codia system will then assign codes to the answers. Recoding is as simple as moving answers from one category to another category in Classifier panel. Code frames (or lists of code category labels) of the past projects or other questions in the same survey are reusable by copying it between windows. Figure 4. A screen snapshot of Codia prototype system for coding Chinese verbatim text.

Need for natural language processing techniques We found that natural language processing can save time and effort of human coders. In general, the number of answers ranges from hundreds to thousands. To develop a code frame, human coders usually need to find similar answers in a long list of verbatim answers. This task is tedious even if we do it on computer. For instance, the small window/monitor screen limits our view of verbatim answers. It is difficult for us to find and group conceptually similar answers in such a small window/screen. Thus, it is preferable to have effective facilities to cluster similar answers. If it can be done, we do not need to spend so much time in scrolling Answer panel. In Codia system, we rank verbatim answers according to the text similarity with a representative answer or input text. As shown in Figure 4, most of the answers relevant to the category actericidal activities of toothpaste come together at the top of Answer panel. Such evaluation of text similarity requires natural language processing and information retrieval techniques. Simple method to group similar verbatim answers Codia ranks verbatim answers in accordance with the semantic categories of their constituent words. To achieve this, we need to segment the verbatim answers into meaningful words and find their semantic categories in a knowledge base. In our preliminary implementation, we modified a Chinese thesaurus by using the available Chinese linguistic resources [Mei 83; BLA 86; Liu 95] to provide a word list with semantic categories. Although the present system is implemented to classify Chinese verbatim answers, its method should be generally applicable to processing other languages. This text similarity evaluation method is outlined as follows: 1. We are given a list of verbatim answers as displayed in Answer panel. 2. We select a verbatim answer or input a text string which contains most representative words for a candidate concept. We press a button to extract keywords for text similarity evaluation. 3. Codia system automatically looks up the thesaurus (or knowledge base) and retrieve all the words of the same semantic categories. 4. Codia system calculates a similarity score (or relevance score) for each verbatim answer according to a similarity evaluation scheme which is based on the existence of the words representing the same semantic concept. 5. Verbatim answers are ranked according to the similarity scores in descending order and re-displayed in Answer panel. As a result, the answers similar/relevant to the key answer will be clustered at the top of Answer panel for easy transfer to the Classifier panel. Multiple answers can be selected in verbatim answer panel and transferred to Classifier panel at one time. The evaluation of text similarity scores is important in clustering verbatim answers. We have developed a few simple formulae of similarity scores (Tables 1 and 2). For evaluating weighted scores, users can set the weights to meet the requirements of specific situations.

Text similarity measures Basically, the text similarity (or relevance) scores are evaluated according to the matching frequency of words and their semantic categories. The higher is the similarity score, the more is the answer item similar or relevant to the conceptual category of our interest. To allow variations and choices in the evaluation of similarity scores, we devise five basic similarity scores (Table 1) as well as their derivative scores (Table 2). More sophisticated similarity evaluation schemes are being developed. = NFS / N = -log 2 (N / NFS) SFS / N -log 2 (N / SFS) NWS / N -log 2 (N / NWS) = SWS / N = -log 2 (N / SWS) = UWS / N = -log 2 (N / UWS)

Possible implementations of automatic text classification in Codia We are going to refine our similarity evaluation methods to incorporate case-based reasoning facilities. Actually, we are developing classification schemes (domain-specific knowledge base), which are based on a few representative survey research projects. Once we got these done, Codia system will incorporate artificial intelligence techniques such as genetic algorithms to do automatic classification. Neural networks are also useful for coding verbatim text, especially when the explicit knowledge of the coding process is still unavailable in computer. In any case, we think that the results of automated coding should be refined by human coders so that the quality of coding can be ensured. Data mining results as rules Natural Language Processing in Data Mining Report Generation Data mining discovers and extracts the patterns of multivariate data which are usually large in size and stored in a regular format such as databases. Its use has been demonstrated in various data analysis applications [Fayyad et al. 96], including marketing research [Ciesielski and Palstra 96; Stone et al. 96; Stone et al. 97]. Data mining is usually automatic and efficient. In almost all reports, data mining is found able to discover the data patterns which human experts overlook or cannot find at such speed. Although it is unlikely that data mining can replace the conventional data analysis without incorporation of domain knowledge, data mining may be performed in parallel with conventional data analysis methods in order to search non-obvious data patterns. Data mining results are usually not very readable to people who are not familiar with computer languages or notations. A majority of data mining software tools can generate rules to describe the data patterns. Such rules are usually in IF... THEN... format or executable program codes (e.g., Prolog), together with certainty values (e.g., the percentage of cases being covered by the rule). KnowledgeSeeker and SIPINA softwares can produce data mining results in both rule and Prolog code formats. Each rule represents a pattern of the data. For example, the following rule represents a data mining result: q12 = 4 and q31 = 6 and q35 = 3 then q38 = 3 We would need to look up the question variables and values in a code book for the meaning of the rule. When it is possible to make the variable names self-explanatory, the rule may look less obscure. For example, household_income = 4 and city = 6 and car_owner = 3 then user = 3 to understand this rule, we need to look up the code book to find out (1) what is the household income of code 4, (2) what is the city for code 6, (3) which type of cars of code 3, and (4) what is the product which the respondents are its users. Instead of looking up multiple files/records for the meaning of rule syntax and variable names, non-technical persons prefer reading brief description in natural language about the patterns of data. It is desirable to have a natural language generation system to produce such kind of data mining reports.

Need for natural language generation techniques When we got all the background information, we can describe the data mining results in natural language (e.g., English) so that users can understand the findings quickly. Translation of one or two simple rules seems to be trivial but it is too tedious to translate tens or hundreds of such rules into English. It will become a huge task if multilingual versions of the report are required. Writing data mining reports in natural language may take longer time than the data mining process. The report writing task diminishes the productivity advantage of data mining. As such, we tried to use natural language generation techniques in the hope that the data mining reports can be drafted automatically. Our findings will be reported in the subsequent sections. Implementation of a report generator Our report generator for producing English text is extended from Michael ElhadadFUF [Elhadad 92], which employs the technique of functional unification grammars. The system is developed by using Common Lisp and Edinburgh-standard Prolog. We run the report generator on a Sun SPARCstation (running Solaris 2.5.1) or a Pentium PC (running Linux 2.0) with Kyoto Common Lisp and Sicstus Prolog / SWI-Prolog systems. How does the report generator work To generate English sentences, we first input data mining results (rules), background information (e.g., variable names or categories of questions, etc.) in the report generator. Subsequently, we set a text goal so that the report generator can produce a English text accordingly. The generated text can be output to screen or a file in plain text or HTML format (for reading in WWW browsers). A text goal represents a method to describe the data mining results. With a text goal, the report generator can combine all the available information and produce English sentences to describe the data mining result. For generating a sentence as correct as possible, the report generator makes use of lower-level semantic representations and grammars, which are project-independent. The grammar writing procedures can be outlined as follows: Determine the input. Identify the types of sentences to produce. For each type of sentence, identify the constituents and their functions in the sentence to produce. Determine the output sentence in terms of constituent structures. Determine the difference between the input and the output. For each category of constituent, write a branch of the grammar. Besides compliance with grammars, the report generator can also randomly choose among the expressions of similar meaning to increase variations of description. The use of the report generator can be exemplified as follows. Suppose we got a rule (r1): q5 = 2 and q12 = 4 and q20 = 4 and q31 = 6 and q35 = 3 then q38 = 3 We can set a simple text goal as follows: say(feature,[r1]).

This simple text goal instructs the system to describe the features of the respondents who are covered by r1. With the text goal for r1, the report generator can generate a text like: r1 : The segment of respondents who are product X users is characterized by residence in Shanghai, consumption of brand Y cigarettes, overseas travel in the past twelve months, ownership of imported cars, and high monthly household income (RMB 10000-14999). where the word espondents can be replaced randomly by other words such as eople and the word egment can be replaced randomly by other words such as roup, etc. Multiple rule names can be put in the square brackets (e.g., say(feature,[r1,r2,r3,r4]).) for generating description of such rules in the same manner of expression. If we want to describe the results in different ways, we may set different text goals. For example, say(general,[r1]). say(likely,[r1]). say(reason,[r1]). These text goals will instruct the system to produce: r1 : Basically, the respondents who are product X users have residence in Shanghai, consumption of Brand Y cigarettes, overseas travel in the past twelve months, ownership of imported cars, and high monthly household income (RMB 10000-14999). r1 : It is likely that the people who have residence in Shanghai, consumption of Brand Y cigarettes, overseas travel in the past twelve months, ownership of imported cars, and high monthly household income (RMB 10000-14999) are product X users. r1 : The respondents who are product X users because they have residence in Shanghai, consumption of Brand Y cigarettes, overseas travel in the past twelve months, ownership of imported cars, and high monthly household income (RMB 10000-14999). Some of the wordings can be randomly chosen by the system. For instance, the word asically may be replaced randomly by words such as enerally, etc. The word ikely will be randomly replaced by the words such as robable, elieved, etc. Possible improvements of data mining report generator The present report generator is being improved. In our preliminary tests, we found it necessary to parse the rule variable names/labels for automatic detection of their syntactic structures.

Besides, it is desirable to analyze the relationship between rules so that their similarities or differences can be described automatically. We also need to implement some text plans to generate long reports with linguistic coherence. In addition to improving the English report generator, we are going to develop a Chinese text generator based on the recently developed Chinese language generation techniques [Yeh and Mellish 94; Yeh and Mellish 95a; Yeh and Mellish 95b; Yeh and Mellish 96]. Concluding Remarks Natural language processing (NLP) techniques are found useful in verbatim text coding and data mining report generation. With NLP, similar verbatim answers can be collected for semi-automatic classification in our Codia system. Data mining results can be described in English text by a report generator. To achieve automatic classification of verbatim answers, we are developing better facilities for robust syntax parsing and sophisticated text similarity evaluation. We are also writing re-usable grammars (declarative specifications) for producing even more readable data mining reports. While it is very difficult to construct a knowledge base to comply with so many explanatory models in marketing research [Ohaughnessy 92], it is feasible to generate natural language reports to describe basic patterns of data. We believe that data mining report generation is one of the niches in which natural language generation techniques will be very useful. References [Allen 95] James Allen. Natural Language Understanding. Second Edition, Benjamin/Cummings, U.S.A., 1995. [Cawsey et al. 95] Cawsey, Alison, Kim Binsted, and Ray Jones. Personalised Explanation for Patient Education. Proceedings of the 5th European Workshop on Natural Language Generation, pp. 59-74, 1995. [BLA 86] Beijing Linguistics Academy. Word Frequency Lists of Modern Chinese. Beijing Linguistics Academy Press, Beijing, China, 1986. [Ciesielski and Palstra 96] Victor Ciesielski and Gregory Palstra. Using a Hybrid Neural / Expert System for Data Base Mining in Market Survey Data. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 36-43, 1996. [Covington 95] Michael Covington. Natural Language Processing for Prolog Programmers. Prentice-Hall, U.S.A., 1995. [Dale 92] Robert Dale. Generating Referring Expression: Constructing Descriptions in a Domain of Objects and Processes. MIT Press, Cambridge, MA, U.S.A., 1992. [Elhadad 92] Michael Elhadad. Using Argumentation to Control Lexical Choice: A Functional Unification-based Approach. PhD Thesis, Department of Computer Science, Columbia University, New York, U.S.A., 1992. [Fayyad et al. 96] Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, Cambridge, MA, U.S.A., 1996.

[Gazdar and Mellish 89] Gerald Gazdar and Chris Mellish. Natural Language Processing in Prolog. Addison-Wesley, U.S.A., 1989. [Kelle 1995] Udo Kelle. Computer-aided Qualitative Data Analysis. Sage Publications, London, U.K., 1995 [Liu 95] Shu-Xin Liu. Descriptive Lexicology of Chinese. The Commercial Press, Beijing, China, 1995. [Luyens 95] Serge Luyens. Coding Verbatims by Computer. Marketing Research. 7(2) 21-25. [McKeown 85] Kathleen McKeown. Text Generation. Cambridge University Press, U.K., 1985. [Maybury 90] Mark Maybury. Planning Multisentential English Text Using Communicative Acts. PhD Thesis. Cambridge University, U.K., 1990. [Mei 83] Jia-Ju Mei. TongYiCiCiLin (The Thesaurus). Shanghai Cishu Press, Shanghai, China, 1983. [Ohaughnessy 1992] John Ohaughnessy. Explaining Buyer Behavior - Central Concepts and Philosophy of Science Issues. Oxford University Press, New York, U.S.A., 1992. [Paris 87] Cecile Paris. The Use of Explicit User Model in Text Generation: Tailoring to a UserLevel of Expertise. PhD Thesis. Columbia University, New York, U.S.A., 1987. [Reiter 94] Ehud Reiter. Has a consensus NL generation architecture appeared, and is it psycholinguistically plausible? In Proceedings of the 1994 International Natural Language Generation Workshop, 1994. [Reiter et al. 95] Ehud Reiter, Chris Mellish, and John Levine. Automatic Generation of Technical Documentation. Applied Artificial Intelligence. 9:259-287, 1995. [Stone et al. 96] Merlin Stone, Richard Sharman, Bryan Foss, Evangelos Simoudis, Richard Lowrie, and John Hallick. Managing Data Mining in Marketing - Part I. Journal of Targeting, Measurement and Analysis for Marketing. 5(2) 125-150, 1996. [Stone et al. 97] Merlin Stone, Richard Sharman, Bryan Foss, Evangelos Simoudis, Richard Lowrie, and John Hallick. Managing Data Mining in Marketing - Part II. Journal of Targeting, Measurement and Analysis for Marketing. 5(3) 247-264, 1997. [Wong et al. 95] Kam-Fai Wong, Haihau Pan, Boon-Toh Low, Chun-Hung Cheng, Vincent Lum, and Sze-Sing Lam. A Tool for Computer-assisted Open Response Analysis. Proceedings of 15th International Conference on Computer Processing of Oriental Languages. pp. 191-198, 1995. [Yeh and Lee 91] Ching-Long Yeh and His-Jian Lee. Rule-based Word Identification for Mandarin Chinese Sentences -- A Unification Approach. Computer Processing of Chinese and Oriental Languages, 5(2): 97-118, 1991. [Yeh and Mellish 94] Ching-Long Yeh and Chris Mellish. An Empirical Study on the Generation of Zero Anaphors in Chinese. Proceedings of the 15th International Conference of Computational Linguistics, Kyoto, Japan, 732-736, 1994. [Yeh and Mellish 95a] Ching-Long Yeh and Chris Mellish. An Empirical Study on the Generation of Description for Nominal Anaphors in Chinese. Proceedings of Recent Advances in Natural Language Processing, Velingrad, Bulgaria, 1995. [Yeh and Mellish 95b] Ching-Long Yeh and Chris Mellish. An Empirical Study on the Generation of Description for Nominal Anaphors in Chinese. In R. Mitkov and N. Nicolov (eds), Recent Advances in

NLP 1995, Series: Current Issues in Linguistic Theory, Vol. 136, John Benjamins, Amsterdam, Netherlands, 1995. [Yeh and Mellish 96] Ching-Long Yeh and Chris Mellish. An Evaluation of Anaphor Generation in Chinese. Proceedings of the 8th International Workshop on Natural Language Generation, Sussex, U.K., 1996. The Authors Josef Leung is Technical Manager at A.C.Nielsen-SRG, Hong Kong, China. Ching-Long Yeh is Associate Professor at Department of Computer Sciences and Engineering, Tatung Institute of Technology, Taipei, Taiwan.