INTEGRATED DEVELOPMENT ENVIRONMENTS FOR NATURAL LANGUAGE PROCESSING
|
|
|
- Gerald Doyle
- 10 years ago
- Views:
Transcription
1 INTEGRATED DEVELOPMENT ENVIRONMENTS FOR NATURAL LANGUAGE PROCESSING Text Analysis International, Inc Hollenbeck Ave. #501 Sunnyvale, California 94087
2 October 2001
3 INTRODUCTION From the time of the first Russian-to-English translation projects of the 1950s, a tremendous amount of research and development has been expended to automate the analysis and understanding of unstructured or "free" text. Children universally become fluent in their native language within the first few years of life, leading us to expect that language should be easy to process using computers. However, language processing has proved to be an extremely complex field of endeavor. Natural Language Processing (NLP) is the branch of Artificial Intelligence (AI) devoted to this area, along with its sister discipline, Computational Linguistics (CL). Knowledge Representation (KR) theories also play an important role in the use of linguistic, conceptual, and domain knowledge required for properly automating language processing. The past ten years have seen rapid and substantial growth in the commercialization of NLP, along with the widespread assessment that NLP is currently a critical technology for business. Two trends are largely responsible for this: the ascent of Natural Language Engineering (NLE) and statistical NLP. Both arose in response to the brittleness of previous approaches based on formal theories and crudely handcrafted systems. NLE emerged almost overnight, propelled by successes such as the FASTUS system [REF SRI, Hobbs et al, 1993] at the fourth DARPA Message Understanding Conference (MUC-4). Here, an internationally respected research group that had spent decades on a formal approach to language processing reversed course and implemented an "unprincipled" information extraction (IE) system that achieved among the highest scores in competitive blind testing. Statistical NLP approaches have yielded strong commercial successes in recent years, primarily in text categorization applications. The popularity of weak methods such as Bayesian statistics, machine learning, and neural networks is due to the ease of their implementation and the fact that developers don't have to "write the rules." However, systems based solely on such shallow methods are limited in terms of accuracy, extensibility, and applicability to NLP tasks. What is the next required step in NLP? To quickly build a fast, accurate, extensible, and maintainable NLP application, one requires excellent tools that integrate multiple methods and strategies. Currently, the field primarily employs discrete, ad hoc tools and components, with no overall framework for building a complete text analysis system. WHERE ARE THE TOOLS? 1.1 Requirements for an IDE In every well-established area of computer software, tools have sprung up to organize, accelerate, and standardize the development of applications. Prime examples are computer programming applications, which are now universally developed via integrated development environments (IDEs) such as Microsoft Visual C++ (c). IDEs offer universally recognized features such as:
4 A graphical user interface (GUI) that enables users to conveniently manage the task at hand A complete environment for building an application Helpful views of the application and components under construction Cross-references to provide instant feedback about changes to the application Prebuilt framework, components, and tools that work together seamlessly Open architecture for integrating multiple methods and external components Debugging facilities to step through the running of the application and identify errors Integrated documentation to help the developer answer technical questions quickly The IDE used in building an application doubles as a platform for maintaining and enhancing the application. The modular structure of an application built with an IDE enables developers other than the creators of the application to understand, maintain, and enhance it. In addition, nonexpert users of the application should be able to maintain and enhance it, to a reasonable extent, using the IDE. The benefits of IDEs include Accelerated development of applications Properly engineered, modular, and organized applications Applications with reusable components Highly extensible and maintainable applications Bottom line: reduced resources needed to build and maintain applications 1.2 NLP Tools While many excellent tools have been built for NLP, precious few merit the label "IDE." A variety of tool kits and software development kits (SDKs) exist, all claiming to substantially aid in the construction of NLP applications. In some of these offerings, the tool kit is little more than a bag of tricks that cannot easily be combined. Each program in the kit may perform its task well and quickly, but developers must cobble the programs together into a patchwork system of diverse components and data representations, and must provide their own programmatic "glue", somehow adding hand-built components so that the application can actually perform the tasks required of it. Typical tool kits may include programs for: Tokenizing Morphological analysis Spelling correction Part-of-speech tagging Entity recognition or simple extraction (names, titles, locations, dates, quantities) Constituent recognition (e.g., noun phrase recognition) It is often the case that these components are given as "black boxes." That is, they cannot be read, modified, or enhanced, but must be run as they are given. More integrated tool kits are also available, in which the component programs can run in multiple passes. For example, the part-of-speech tagger (that identifies nouns, verbs, and so on) may use the tokenizer's output, and so on. In this way, the programs can be cascaded so as to work in concert.
5 But in this integrated tool kit, at least some of the components are still black boxes. Furthermore, to place additional passes between the given programs requires programmers to build ad hoc passes that conform to the output of one program and the input of the other. Many critical components are noticeably absent from available tool kits, including: parsing, semantic analysis, and discourse analysis. All of these present inordinate difficulties for most tool vendors: they don't have an adequate solution and typically offer no framework or support for crafting a solution. Even with the integrated tool kit, the output of the final cascaded program typically bears no resemblance to the output required of the application. If developers need an excellent part-ofspeech tagger, perhaps they can use one that is provided, but then they must complete the application somehow adding the most difficult components, such as parsing, semantic, and discourse analysis -- with no assistance from the tool kit. And the only way to enhance the provided tagger is to wait for an upgrade from the vendor. 1.3 FASTUS One of the attractions of FASTUS, mentioned earlier, is that it supports the construction of information extraction applications. FASTUS uses a set of cascaded components that do the complete job. The multiple passes of FASTUS are as follows (depending on which historical version is being referenced): Name Recognizer Parser Combiner Domain Phase In essence, FASTUS looks very much like the integrated tool kits described earlier. But it supports the functionality needed for an end-to-end information extraction system and it provides modifiable components users may add rules and patterns to the system. While FASTUS may have been one of the better performers at the MUC conferences [REFERENCE], its rating in the information extraction task at MUC-6 (one of the later MUC conferences) of 44% recall and 61% precision leaves substantial room for improvement. To quote one of the FASTUS papers [FASTUS: Extracting Information from Natural Language Texts, by Jerry Hobbs et al.]: "While FASTUS is an elegant achievement, the whole host of linguistic problems that were bypassed are still out there, and will have to be addressed eventually for more complex tasks, and to achieve higher performance on simple tasks." 1.4 NLP Frameworks To support the creation of accurate and complete NLP systems, we require much more than a medley of components, no matter how good they are individually. We also require a more flexible arrangement than a hard-wired ordering of passes and algorithms, even when those
6 passes can be modified by the user. In short, we require a generalized framework for managing passes, algorithms, and their associated rules, data structures, and knowledge. Given such a flexible arrangement, one can organize an NLP system in the best possible way for any given application. Furthermore, the ability to insert passes into an existing set of passes enables a system to grow in a flexible and modular fashion. For example, some passes can be devoted entirely to syntax, others to lexical processing, others to domain specific tasks, and so on. Each pass may consist of a simple process such as segmenting into lines, or a complex subsystem such as a recursive grammar for handling lists. Also, by conditionally executing passes, a single NLP system may identify, process and unify a wide variety of input formats such as RTF, PDF, and HTML, in addition to plain text. Consider three enticing scenarios for a flexible multi-pass framework: High confidence processing. Imagine an NLP system that performs a complete set of processes, but only when the independent confidence of each action exceeds 99%. That is, it performs lexical, syntactic, semantic, and discourse processing only where actions can be performed with the highest accuracy. While not many actions may be actually executed, those that are performed will be extremely accurate. Furthermore, matched constructs will serve as context for subsequent actions. That is, after one complete cycle, lower the confidence threshold to 98% and repeat the processing cycle, then lower to 97% and so on, until some minimal threshold is reached. Each cycle works in more constrained contexts that are characterized with high confidence. In this way, a very accurate overall processing of text can occur. Simulating a recursive grammar. While recursive grammar approaches are typically plagued by problems of multiple interpretations and ambiguity, imagine a parser that constructs only a single parse tree. Each step in a multi-pass NLP system can assess and control the modifications to a parse tree, allowing implementation of a "best first" stepwise approach to simulating the work of a recursive grammar. Characterization. In addition to passes that modify a parse tree, one can intersperse any number of passes whose sole purpose is to characterize the text at various points, in order to gather evidence and preferences for subsequent processing tasks. For example, one can gather various types of evidence about a period and its context to ascertain its likelihood of being an end of sentence. Segmenting at ambiguous ends of sentence can be deferred until enough evidence has been accumulated for or against. THE SHALLOW ONES A host of companies in the market place claim that hand-building NLP systems is not needed. This may be so in the case of shallow applications such as categorization, summarization, and attribute extraction, where high accuracy is not as important as getting a system up quickly. In some simple applications, high accuracy has also been achieved, but as the tasks become more complex, the accuracy falls off rapidly. The primary technologies used for shallow processing are statistical NLP (e.g., Bayesian methods), probabilistic methods (e.g., Hidden Markov Models), neural networks (NN), and machine learning (ML).
7 These are excellent technologies that enable systems to be constructed automatically or by simpler set ups than handcrafting an NLP system. They should be part of a complete framework for NLP. However, these methods by themselves cannot support high precision and recall in many applications, and they cannot by themselves support deep, accurate, and complete analysis and understanding of text. The most advanced types of systems built exclusively with these methods are on the order of "name recognition" or "entity recognition." That is, if a text has items to be identified individually, then statistical methods may be able to support an accurate and complete solution. Thus, for example, a shallow system can be built to accurately find the pieces of a job description, including the job title, contact information, years of experience, and other discrete, non-repeating data types. But for texts that mix multiple events or repeating data items, such as a Wall Street Journal article or employment resume, statistical methods will in general not suffice to accurately extract and synthesize business event records or contact, employment, and education records, respectively. The more complex the domain of discourse and the more structured the desired information, the more NLP and customization will be required, relative to shallow automated methods. Implementing shallow methods is relatively cut-and-dried, and therefore competition is fierce among companies that critically depend on them for their technology and products. While substantial hype emanates from many such companies, customer experience is often at odds with the exaggerated claims. Categorization is an interesting application to examine, since it is widely implemented by text analysis vendors. To automatically distinguish sports and business articles accurately, for example, is feasible by all the shallow methods. However, as finer-grained or overlapping categories are specified, e.g., indoor versus outdoor sports, accuracy of automatically generated categorizers will quickly drop off. Similarly, processing texts that entangle multiple categories of interest will typically yield unsatisfactory results by these methods. Categories that share substantial numbers of keywords present difficult obstacles to these methods. One qualitative difference between these technologies and NLP for many applications is: with shallow methods, a person still has to read every text of interest, while deeper processing can convert text to a database or other structured representation, so that people need not read the voluminous text. Instead, they can query a database to immediately get precise answers to queries such as "list all the acquisitions in the last quarter valued at between $10M and $50M." Another grand claim of shallow methods is that they are "language independent." While this is true and is a positive feature in that the methods can be applied quickly, it also means that these methods can never suffice to understand language. They only work by gathering information about frequency, proximity, and other statistics relating to patterns of letters. While shallow methods may eventually be used to accurately perform complex language analysis tasks, the state of the art is years away from that point. Statistical NLP is still in its infancy and we cannot rely on it exclusively for commercial strength solutions to complex language processing tasks. It also seems to be the case that the more fine-grained the learning required of a system, the more manually prepared data is required. It is not at all clear that the preparation of such data will be less resource-intensive than manually engineering an NLP system.
8 INTEGRATED DEVELOPMENT ENVIRONMENTS The purpose of an IDE is to accelerate the development of an accurate, fast, robust, and maintainable application. To be sure, there has been a substantial effort to develop IDEs for NLP in recent years. Even going back to the 1980s, Carnegie Group's LanguageCraft was a notable early attempt to commercialize an IDE. But knowledge of NLP then was not what it is today and hardware was not as fast and capable, so LanguageCraft and some other early efforts did not succeed commercially. Several IDEs under development today are found in the universities and research organizations. VisualText is by far the most ambitious and groundbreaking commercial IDE developed to date. The reason we have not seen more effort in this area is that the shallow methods alluded to earlier have been the "hot topic" of the 1990s. Why build and use an IDE when you can automate the processing of text? As we have seen, the shallow methods are very useful, but are insufficient for engineering a complete and accurate NLP system for a task of substantial complexity. For most complex tasks, a combination of strategies is required. The ideal IDE will integrate shallow methods with methods based on tokens, patterns, grammars, keywords, logic, and finite state automata. It will integrate knowledge bases, expert systems, experts, planners, and other artificial intelligence systems with the overall framework. VisualText is unique in that it incorporates a programming language to help "glue" the various methods and systems, as well as to specify actions to undertake in the most general way possible. For example, an entire program can accompany the matching of a single pattern in NLP++. We stress that IDEs in no way conflict with shallow methods. IDEs should serve as a framework for incorporating and integrating shallow and deep methods for text analysis. They should further serve to support ever-greater degrees of automation in the construction of text analyzers. Similarly, IDEs do not conflict or compete with particular paradigms for NLP, such as HPSG (Head Driven Phrase Structure Grammar) and other popular approaches. IDEs constitute a "meta" approach that incorporates all possible methods. It has been claimed that handcrafting analyzers requires inordinate amount of time, effort, and bottom line cost. This is no doubt true when inexperienced developers attempt build text analysis solutions from scratch. The result in these cases is typically a brittle, under-performing, and inextensible analyzer that can only be maintained by its developers. In the last decade, the indications are strong that NLP has found a place in the market. Commercial-grade NLP systems have been built and are being built in such areas as medical coding, analysis, legal analysis, resume analysis, chat management, and more. But the downsides are still maintainability, cost to build, expertise, and resources.
9 NLP developers are not generally aware that IDEs for text analysis exist, but we expect that situation to change substantially in the coming years, with the commercial release of IDEs. The time is now. With several academic IDEs gaining substantial popularity, the window of opportunity for IDEs for NLP is expanding quickly. With the growing popularity of multi-pass frameworks such as FASTUS, further inroads into the marketplace are being established. We expect VisualText to play a major role in transforming the development and maintenance of NLP systems. IDEs support, organize, and dramatically accelerate the construction of accurate, fast, extensible, and maintainable text analyzers. Furthermore, because multi-pass analyzers can be constructed in modular fashion, each new analyzer built can reuse substantial parts of previously built analyzers. 1.5 A Practical Development Scheme Typically the most important measure of performance for an analyzer is recall, which is a measure of completeness and coverage of the analyzer. That is, if it accurately extracts all the required information and only the required information, then it has high recall. A second measure is precision. If every output that an analyzer produces is correct, even though it may not extract everything it needs to, then the analyzer has high precision. When one starts building an analyzer, it's recall and precision start at 0%. But the minute we output an extracted datum, precision can shoot up to 100% if that datum is always correct. Therefore, an obvious and very practical methodology strives to keep precision as high as possible, while successively increasing the coverage of the analyzer. It turns out that, as the analyzer extracts more information, or does more "work", mistaken outputs will necessarily increase. Therefore it is axiomatic that as an analyzer is developed, precision will decrease as recall is increased. However, by carefully crafting the analyzer and looking to shore up precision every time coverage is increased, the decrease in precision can be minimized relative to the increase in recall. Another important guideline for analyzer building is to have the analyzer assess its own confidence in the decisions it makes, actions it takes, and outputs that it produces. In this way, the user of the analyzer can control the precision and recall tradeoff by specifying a confidence threshold that all outputs must exceed. 1.6 Knowledge To achieve the highest levels of accuracy in automatically processing text, the analyzer must in general understand the text much as a human being does. Overlapping NLP is a field of endeavor called natural language understanding (NLU) that emphasizes this aspect of processing text. To understand a text, or to attempt to simulate human understanding of a text, many types of knowledge must be brought to bear, including at least: Linguistic knowledge of language, its use, and its processing Conceptual general knowledge of the world, or commonsense knowledge
10 Domain specific knowledge about the specialized domains of discourse touched on in a text. Some text analysis technologies emphasize knowledge over the practical aspects of processing texts. For example, CYC has expended tremendous resources to compile a comprehensive commonsense knowledge base. Our view is that knowledge is a tool to be used and balanced with other tools. It is easy to be swamped with too much knowledge. For example, a dictionary entry for the word "take" may list 50 distinct senses. But for a text analysis system to attempt to choose among these senses is ludicrous in most cases. One successful method for building text analyzers stresses "just in time" knowledge. That is, knowledge is incorporated by the developers only where it proves essential to make progress. This approach works very well for text analyzers in restricted domains (e.g., business and medical). While broad coverage analyzers exist for applications such as categorization, the full utilization of large-scale dictionaries has not yet been accomplished in practice. Another problem with knowledge is that it hasn't been standardized. Every dictionary and ontology has its own biases, and disagreements about standardization abound. IDEs must be flexible enough to enable the incorporation of diverse and sometimes competing knowledge representation schemes. IDEs may also serve as a platform for moving to practical and standardized representations of knowledge. 1.7 Beyond Text Analysis IDEs for NLP can be designed along two lines. The first is as a completely specialized environment for NLP and not much else. The second is as a general development environment augmented with specializations for NLP. Because the second design is a superset of the first, it is clearly preferable. Basically, it allows the IDE to serve additional needs of developers without their having to switch between a programming environment and an NLP environment. To support such a capability, the IDE should provide a programming language that not only addresses the needs of NLP but also of general programming. NLP++ is the first and only programming language, to our knowledge, that addresses both NLP and general programming needs. Though it is an evolving programming language, even in its current state it can serve to "glue" components and extend the processing capability of an NLP system in ways that no other development framework can match. Because NLP is extremely complex and difficult, applying NLP-grade technology to simpler problems such as pattern matching and low-level document processing should be "overkill." It should be easy to develop applications such as Compilers for programming languages and other formal languages Error checking and correcting compilers
11 General pattern matching General file processing Format conversion HTML generation (as performed by Perl and other CGI languages) NLP++ supports these types of applications very well. Indeed, NLP++ itself is parsed and compiled by an analyzer written in NLP++. A bootstrapping analyzer for a simple subset of NLP++ is built into the VisualText analyzer engine. This parses the full NLP++ specification, which then parses the user-defined analyzer. 1.8 NLP Programming Language What is a programming language for NLP, and what should it look like? What is lacking in Prolog, Lisp, C++, Java, and other programming languages, when building NLP systems? An NLP programming language should empower developers to engineer a complete text analyzer. It should support Processing at all levels, including tokenization, morphological, lexical, syntactic, semantic, and discourse Constructing all needed data structures at each level of processing Deploying all the methods needed for text analysis, including pattern, grammar, keyword, and statistical. Executing pattern matching mechanisms for all methods Attaching arbitrary actions to matched patterns Elaborating the heuristics that augment and control text analysis An NLP programming language should, at a minimum, support both patterns and actions associated with matching those patterns. It should also provide the "glue" we need to build a practical application, so that we don't need to cobble together ad hoc components of an overall analyzer. One early system still in use for combining grammars and general programming languages is YACC (Yet Another Compiler Compiler). Bison is a popular copycat version for Gnu/Linux systems. YACC is attractive in that, when a rule matches, one can specify arbitrary actions in a general programming language. However, YACC is best suited for processing formal grammars and languages, rather than natural languages. Some extensions to YACC address ambiguous languages, leading to somewhat better handling of natural languages, but these enhancements are not sufficient to support a complete NLP system. Furthermore, YACC is limited to a single recursive grammar executed by a single processing algorithm. Natural language requires a more flexible approach. As we've described earlier, the multi-pass methodology has proved ideal for building analyzers. In this way, each pass can employ its own processing algorithm rule set, knowledge, and actions. An NLP programming language should encompass the multi-pass methodology and all the objects and processes needed in each pass of such a methodology. The multiple passes of such a methodology must communicate with each other. The output of one pass should serve as the input to the next. A parse tree is universally recognized as an ideal data structure for holding at least a subset of such information. Further, the multiple passes should be able to post to and utilize information from a shared knowledge base. An NLP programming language should facilitate the management of parse trees and knowledge bases.
12 A reasoning engine should also be supported. Ideal might be to integrate an expert system with the multi-pass framework and the knowledge base. At least, the NLP programming language should support arbitrary heuristics and algorithms, in the absence of an expert system. Declarative and procedural knowledge should be segregated as far as possible, with the understanding that rules and their associated actions must be bound in some manner. To truly support the construction of a complete NLP system, the programming language must incorporate many of the features of a general programming language such as C++ or Java. On top of these, it must provide an additional layer or libraries that support the specialized needs of NLP. In these regards, NLP++ is by far the most advanced programming language for NLP. NLP++ is an evolving general programming language that "talks" parse tree and knowledge base, as well as supporting a spectrum of actions tailored to rule and pattern matching. Parse tree nodes and knowledge base objects are built-in NLP++ data types, enabling simple programmatic manipulation of these objects. NLP++ also enables dynamically decorating the parse tree and knowledge base with semantic representations. A key feature of NLP++ is that its multi-pass framework communicates via a single parse tree. In this way, combinatorial explosion is not an option. The text analyzer must be engineered so that the best possible parse tree is elaborated in stages. Further, constraining the analyzer to deal with a single parse tree helps guide the development of a fast and efficient text analyzer. Unlike recursive grammar approaches, the multi-pass methodology enables the elaboration of a parse tree that represents the best understanding of a text. It is not required to reduce every sentence to a "sentence node." Where the coverage is sufficient, such a node will be produced. But useful information and patterns may be found in sentences that are malformed, ungrammatical, terse, or otherwise not covered by the text analyzer's "grammar." Hence, the multi-pass approach also yields robust text analyzers that can operate on messy input (such as text from speech and from OCR). 1.9 Learning and Training We have described the recent popularity of statistical methods for text processing, including Bayesian, probabilistic, Hidden Markov Models, neural network, and machine learning. An IDE should support the use of statistical methods where appropriate and deeper NLP methods where they are best suited as well. In a multi-pass analyzer, a sequence of passes can be dedicated to automated training by statistical methods. Once training is complete, those passes can be deactivated or skipped so that the analyzer can be run against "real" input texts. Various VisualText applications have utilized statistical methods. For example, a generic analyzer framework has been built that incorporates statistical part-of-speech tagging of text. A related method that VisualText supports is an automated rule generation (RUG) facility. Developers and users may highlight text samples and categorize them (i.e., annotate text), letting RUG automatically generalize and merge the patterns represented by the samples, automatically
13 creating accurate rules. Various statistical methods require large bodies of annotated text. RUG is a powerful methodology for several reasons: Developers control rule generalization, merging, and generation Accurate rules are generated from relatively few samples High quality rules result from integrating NLP passes with RUG passes Generated rules are self maintaining. They adapt to changes in the analyzer We are aware of no capability comparable to RUG, other than the general similarity with statistical methods that utilize annotated (i.e., tagged) texts. The VisualText IDE exploits multiple methods to create novel advances such as RUG Existing IDEs PRODUCT ORGANIZATION DESCRIPTION Alembic Workbench MITRE Engineering environment for tagged texts GATE University of Sheffield General architecture for text engineering, including information extraction INTEX LADL For developing linguistic finite state machines with large-coverage dictionaries and grammars LexGram University of Stuttgart Grammar development tool for categorial grammars PAGE DFKI NLP core engine for development of grammatical and lexical resources Pinocchio ITC-Irst Development environment with rich GUI for developing and running information extraction applications VisualText Text Analysis International Comprehensive IDE supporting deep NLP, with integrated NLP++ language and knowledge base management The table above lists the major GUI environments for NLP available today. Some are the work of academic, rather than commercial, entities. VisualText is far and away the most advanced IDE for natural language processing ever developed and commercialized. By virtue of the NLP++ programming language, it is unique in its ability to support the development of a complete text analyzer entirely within a GUI environment.
Parsing Technology and its role in Legacy Modernization. A Metaware White Paper
Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
New Generation of Software Development
New Generation of Software Development Terry Hon University of British Columbia 201-2366 Main Mall Vancouver B.C. V6T 1Z4 [email protected] ABSTRACT In this paper, I present a picture of what software development
CSCI 3136 Principles of Programming Languages
CSCI 3136 Principles of Programming Languages Faculty of Computer Science Dalhousie University Winter 2013 CSCI 3136 Principles of Programming Languages Faculty of Computer Science Dalhousie University
Specialty Answering Service. All rights reserved.
0 Contents 1 Introduction... 2 1.1 Types of Dialog Systems... 2 2 Dialog Systems in Contact Centers... 4 2.1 Automated Call Centers... 4 3 History... 3 4 Designing Interactive Dialogs with Structured Data...
Chapter 1. Dr. Chris Irwin Davis Email: [email protected] Phone: (972) 883-3574 Office: ECSS 4.705. CS-4337 Organization of Programming Languages
Chapter 1 CS-4337 Organization of Programming Languages Dr. Chris Irwin Davis Email: [email protected] Phone: (972) 883-3574 Office: ECSS 4.705 Chapter 1 Topics Reasons for Studying Concepts of Programming
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Design Patterns in Parsing
Abstract Axel T. Schreiner Department of Computer Science Rochester Institute of Technology 102 Lomb Memorial Drive Rochester NY 14623-5608 USA [email protected] Design Patterns in Parsing James E. Heliotis
Software Engineering of NLP-based Computer-assisted Coding Applications
Software Engineering of NLP-based Computer-assisted Coding Applications 1 Software Engineering of NLP-based Computer-assisted Coding Applications by Mark Morsch, MS; Carol Stoyla, BS, CLA; Ronald Sheffer,
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
(Refer Slide Time: 01:52)
Software Engineering Prof. N. L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture - 2 Introduction to Software Engineering Challenges, Process Models etc (Part 2) This
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
Overview of the TACITUS Project
Overview of the TACITUS Project Jerry R. Hobbs Artificial Intelligence Center SRI International 1 Aims of the Project The specific aim of the TACITUS project is to develop interpretation processes for
An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System
An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System Asanee Kawtrakul ABSTRACT In information-age society, advanced retrieval technique and the automatic
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Evaluating OO-CASE tools: OO research meets practice
Evaluating OO-CASE tools: OO research meets practice Danny Greefhorst, Matthijs Maat, Rob Maijers {greefhorst, maat, maijers}@serc.nl Software Engineering Research Centre - SERC PO Box 424 3500 AK Utrecht
Introduction to Software Paradigms & Procedural Programming Paradigm
Introduction & Procedural Programming Sample Courseware Introduction to Software Paradigms & Procedural Programming Paradigm This Lesson introduces main terminology to be used in the whole course. Thus,
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Masters in Information Technology
Computer - Information Technology MSc & MPhil - 2015/6 - July 2015 Masters in Information Technology Programme Requirements Taught Element, and PG Diploma in Information Technology: 120 credits: IS5101
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
SOFTWARE TESTING TRAINING COURSES CONTENTS
SOFTWARE TESTING TRAINING COURSES CONTENTS 1 Unit I Description Objectves Duration Contents Software Testing Fundamentals and Best Practices This training course will give basic understanding on software
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Building a Database to Predict Customer Needs
INFORMATION TECHNOLOGY TopicalNet, Inc (formerly Continuum Software, Inc.) Building a Database to Predict Customer Needs Since the early 1990s, organizations have used data warehouses and data-mining tools
[Refer Slide Time: 05:10]
Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
The Phios Whole Product Solution Methodology
Phios Corporation White Paper The Phios Whole Product Solution Methodology Norm Kashdan Phios Chief Technology Officer 2010 Phios Corporation Page 1 1 Introduction The senior staff at Phios has several
Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages
ICOM 4036 Programming Languages Preliminaries Dr. Amirhossein Chinaei Dept. of Electrical & Computer Engineering UPRM Spring 2010 Language Evaluation Criteria Readability: the ease with which programs
NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR 1 Gauri Rao, 2 Chanchal Agarwal, 3 Snehal Chaudhry, 4 Nikita Kulkarni,, 5 Dr. S.H. Patil 1 Lecturer department o f Computer Engineering BVUCOE,
White Paper: 5GL RAD Development
White Paper: 5GL RAD Development After 2.5 hours of training, subjects reduced their development time by 60-90% A Study By: 326 Market Street Harrisburg, PA 17101 Luis Paris, Ph.D. Associate Professor
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
C LOUD E RP: HELPING MANUFACTURERS KEEP UP WITH THE TIMES
June 2013 C LOUD E RP: HELPING MANUFACTURERS KEEP UP WITH THE TIMES MORE I NNOVATION, L ESS C OST W ITH S AAS E RP Data Source In late 2012 and early 2013 Mint Jutras collected more than 475 qualified
Functional Decomposition Top-Down Development
Functional Decomposition Top-Down Development The top-down approach builds a system by stepwise refinement, starting with a definition of its abstract function. You start the process by expressing a topmost
Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens
Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens 1 Optique: Improving the competitiveness of European industry For many
NLUI Server User s Guide
By Vadim Berman Monday, 19 March 2012 Overview NLUI (Natural Language User Interface) Server is designed to run scripted applications driven by natural language interaction. Just like a web server application
BUSINESS RULES AND GAP ANALYSIS
Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
CS Standards Crosswalk: CSTA K-12 Computer Science Standards and Oracle Java Programming (2014)
CS Standards Crosswalk: CSTA K-12 Computer Science Standards and Oracle Java Programming (2014) CSTA Website Oracle Website Oracle Contact http://csta.acm.org/curriculum/sub/k12standards.html https://academy.oracle.com/oa-web-introcs-curriculum.html
Key Benefits of Microsoft Visual Studio 2008
Key Benefits of Microsoft Visual Studio 2008 White Paper December 2007 For the latest information, please see www.microsoft.com/vstudio The information contained in this document represents the current
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
GCE APPLIED ICT A2 COURSEWORK TIPS
GCE APPLIED ICT A2 COURSEWORK TIPS COURSEWORK TIPS A2 GCE APPLIED ICT If you are studying for the six-unit GCE Single Award or the twelve-unit Double Award, then you may study some of the following coursework
Analyzing survey text: a brief overview
IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining
Structure of Presentation. The Role of Programming in Informatics Curricula. Concepts of Informatics 2. Concepts of Informatics 1
The Role of Programming in Informatics Curricula A. J. Cowling Department of Computer Science University of Sheffield Structure of Presentation Introduction The problem, and the key concepts. Dimensions
Revel8or: Model Driven Capacity Planning Tool Suite
Revel8or: Model Driven Capacity Planning Tool Suite Liming Zhu 1,2, Yan Liu 1,2, Ngoc Bao Bui 1,2,Ian Gorton 3 1 Empirical Software Engineering Program, National ICT Australia Ltd. 2 School of Computer
Baseline Code Analysis Using McCabe IQ
White Paper Table of Contents What is Baseline Code Analysis?.....2 Importance of Baseline Code Analysis...2 The Objectives of Baseline Code Analysis...4 Best Practices for Baseline Code Analysis...4 Challenges
Model Driven Interoperability through Semantic Annotations using SoaML and ODM
Model Driven Interoperability through Semantic Annotations using SoaML and ODM JiuCheng Xu*, ZhaoYang Bai*, Arne J.Berre*, Odd Christer Brovig** *SINTEF, Pb. 124 Blindern, NO-0314 Oslo, Norway (e-mail:
Ten steps to better requirements management.
White paper June 2009 Ten steps to better requirements management. Dominic Tavassoli, IBM Actionable enterprise architecture management Page 2 Contents 2 Introduction 2 Defining a good requirement 3 Ten
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services
PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services David Ferrucci 1, Robert L. Grossman 2 and Anthony Levas 1 1. Introduction - The Challenges of Deploying Analytic Applications
What is a programming language?
Overview Introduction Motivation Why study programming languages? Some key concepts What is a programming language? Artificial language" Computers" Programs" Syntax" Semantics" What is a programming language?...there
Spend Enrichment: Making better decisions starts with accurate data
IBM Software Industry Solutions Industry/Product Identifier Spend Enrichment: Making better decisions starts with accurate data Spend Enrichment: Making better decisions starts with accurate data Contents
School of Computer Science
School of Computer Science Computer Science - Honours Level - 2014/15 October 2014 General degree students wishing to enter 3000- level modules and non- graduating students wishing to enter 3000- level
Software: Systems and. Application Software. Software and Hardware. Types of Software. Software can represent 75% or more of the total cost of an IS.
C H A P T E R 4 Software: Systems and Application Software Software and Hardware Software can represent 75% or more of the total cost of an IS. Less costly hdwr. More complex sftwr. Expensive developers
The Phases of an Object-Oriented Application
The Phases of an Object-Oriented Application Reprinted from the Feb 1992 issue of The Smalltalk Report Vol. 1, No. 5 By: Rebecca J. Wirfs-Brock There is never enough time to get it absolutely, perfectly
Lecture 1: Introduction
Programming Languages Lecture 1: Introduction Benjamin J. Keller Department of Computer Science, Virginia Tech Programming Languages Lecture 1 Introduction 2 Lecture Outline Preview History of Programming
2. MOTIVATING SCENARIOS 1. INTRODUCTION
Multiple Dimensions of Concern in Software Testing Stanley M. Sutton, Jr. EC Cubed, Inc. 15 River Road, Suite 310 Wilton, Connecticut 06897 [email protected] 1. INTRODUCTION Software testing is an area
A Case Study of the Systems Engineering Process in Healthcare Informatics Quality Improvement. Systems Engineering. Ali M. Hodroj
A Case Study of the Systems Engineering Process in Healthcare Informatics Quality Improvement By Ali M. Hodroj Project Report submitted to the Faculty of the Maseeh School of Engineering and Computer Science
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
Fourth generation techniques (4GT)
Fourth generation techniques (4GT) The term fourth generation techniques (4GT) encompasses a broad array of software tools that have one thing in common. Each enables the software engineer to specify some
Robustness of a Spoken Dialogue Interface for a Personal Assistant
Robustness of a Spoken Dialogue Interface for a Personal Assistant Anna Wong, Anh Nguyen and Wayne Wobcke School of Computer Science and Engineering University of New South Wales Sydney NSW 22, Australia
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
QUEST The Systems Integration, Process Flow Design and Visualization Solution
Resource Modeling & Simulation DELMIA QUEST The Systems Integration, Process Flow Design and Visualization Solution DELMIA QUEST The Systems Integration, Process Flow Design and Visualization Solution
Masters in Human Computer Interaction
Masters in Human Computer Interaction Programme Requirements Taught Element, and PG Diploma in Human Computer Interaction: 120 credits: IS5101 CS5001 CS5040 CS5041 CS5042 or CS5044 up to 30 credits from
Analysis and Design of Software Systems Practical Session 01. System Layering
Analysis and Design of Software Systems Practical Session 01 System Layering Outline Course Overview Course Objectives Computer Science vs. Software Engineering Layered Architectures Selected topics in
Preferred Strategies: Business Intelligence for JD Edwards
Preferred Strategies: Business Intelligence for JD Edwards For the fourth year in a row, Business Intelligence software tops the list for IT investments according to Gartner Research. If you are not currently
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
Visionet IT Modernization Empowering Change
Visionet IT Modernization A Visionet Systems White Paper September 2009 Visionet Systems Inc. 3 Cedar Brook Dr. Cranbury, NJ 08512 Tel: 609 360-0501 Table of Contents 1 Executive Summary... 4 2 Introduction...
SQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
Text Analytics Software Choosing the Right Fit
Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Text Analytics World San Francisco, 2013 Agenda Introduction Text Analytics Basics
Ontologies for Enterprise Integration
Ontologies for Enterprise Integration Mark S. Fox and Michael Gruninger Department of Industrial Engineering,University of Toronto, 4 Taddle Creek Road, Toronto, Ontario M5S 1A4 tel:1-416-978-6823 fax:1-416-971-1373
1 Introduction. 2 An Interpreter. 2.1 Handling Source Code
1 Introduction The purpose of this assignment is to write an interpreter for a small subset of the Lisp programming language. The interpreter should be able to perform simple arithmetic and comparisons
Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance
Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance 3.1 Introduction This research has been conducted at back office of a medical billing company situated in a custom
Enhancing Document Review Efficiency with OmniX
Xerox Litigation Services OmniX Platform Review Technical Brief Enhancing Document Review Efficiency with OmniX Xerox Litigation Services delivers a flexible suite of end-to-end technology-driven services,
Embedded Software Development with MPS
Embedded Software Development with MPS Markus Voelter independent/itemis The Limitations of C and Modeling Tools Embedded software is usually implemented in C. The language is relatively close to the hardware,
Masters in Computing and Information Technology
Masters in Computing and Information Technology Programme Requirements Taught Element, and PG Diploma in Computing and Information Technology: 120 credits: IS5101 CS5001 or CS5002 CS5003 up to 30 credits
Electronic Performance Support Systems (EPSS): An Effective System for Improving the Performance of Libraries
Electronic Performance Support Systems (EPSS): An Effective System for Improving the Performance of Libraries Madhuresh Singhal & T S Prasanna National Centre for Science Information Indian Institute of
Software Development Life Cycle
4 Software Development Life Cycle M MAJOR A J O R T TOPICSO P I C S Objectives... 52 Pre-Test Questions... 52 Introduction... 53 Software Development Life Cycle Model... 53 Waterfall Life Cycle Model...
Language Processing Systems
Language Processing Systems Evaluation Active sheets 10 % Exercise reports 30 % Midterm Exam 20 % Final Exam 40 % Contact Send e-mail to [email protected] Course materials at www.u-aizu.ac.jp/~hamada/education.html
Simplifying e Business Collaboration by providing a Semantic Mapping Platform
Simplifying e Business Collaboration by providing a Semantic Mapping Platform Abels, Sven 1 ; Sheikhhasan Hamzeh 1 ; Cranner, Paul 2 1 TIE Nederland BV, 1119 PS Amsterdam, Netherlands 2 University of Sunderland,
Apex Code: The World s First On-Demand Programming Language
WHIT EP AP ER Apex Code: The World s First On-Demand Programming Language Contents Extending the Power of the Apex Platform... 1 Multi-tenancy and Programming Languages... 1 Apex Code Design and Syntax...
Language and Computation
Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University [email protected] http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer
How to make the computer understand? Fall 2005 Lecture 15: Putting it all together From parsing to code generation Write a program using a programming language Microprocessors talk in assembly language
Scriptless Test Automation. Next generation technique for improvement in software testing. Version 1.0 February, 2011 WHITE PAPER
Scriptless Test Automation Next generation technique for productivity improvement in software testing Version 1.0 February, 2011 WHITE PAPER Copyright Notice Geometric Limited. All rights reserved. No
Software Development Kit
Open EMS Suite by Nokia Software Development Kit Functional Overview Version 1.3 Nokia Siemens Networks 1 (21) Software Development Kit The information in this document is subject to change without notice
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Emulated Digital Control System Validation in Nuclear Power Plant Training Simulators
Digital Control System Validation in Nuclear Power Training s Gregory W. Silvaggio Westinghouse Electric Company LLC [email protected] Keywords: Validation, nuclear, digital control systems Abstract
APPLYING CASE BASED REASONING IN AGILE SOFTWARE DEVELOPMENT
APPLYING CASE BASED REASONING IN AGILE SOFTWARE DEVELOPMENT AIMAN TURANI Associate Prof., Faculty of computer science and Engineering, TAIBAH University, Medina, KSA E-mail: [email protected] ABSTRACT
Data Integration Alternatives Managing Value and Quality
Solutions for Enabling Lifetime Customer Relationships Data Integration Alternatives Managing Value and Quality Using a Governed Approach to Incorporating Data Quality Services Within the Data Integration
Text Analytics. A business guide
Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application
Text Mining - Scope and Applications
Journal of Computer Science and Applications. ISSN 2231-1270 Volume 5, Number 2 (2013), pp. 51-55 International Research Publication House http://www.irphouse.com Text Mining - Scope and Applications Miss
Introducing Formal Methods. Software Engineering and Formal Methods
Introducing Formal Methods Formal Methods for Software Specification and Analysis: An Overview 1 Software Engineering and Formal Methods Every Software engineering methodology is based on a recommended
Chapter 13 Computer Programs and Programming Languages. Discovering Computers 2012. Your Interactive Guide to the Digital World
Chapter 13 Computer Programs and Programming Languages Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate between machine and assembly languages Identify
