HLT Reading Group 2002/2003
A brief introduction to the research on natural language technology in Harbin Institute of Technology is given. The focus will be demonstrations of a crummy Machine Translation System, the explanation to the technology in each module (morphological analysis, phrase bracketing, named-entity identification), and some application systems for information extraction, TTS etc.
Large coverage unification grammars are extremely time-consuming, expensive and difficult to obtain. Recognition with them also tends to run in exponential time when the grammars used are not equivalent to context-free grammars or when all possible parses are generated. We have designed an algorithm to automatically generate a large-scale LFG grammar, and have used very simple polynomial statistical parsing techniques on it. We automatically annotate a large corpus of newspaper text (The Penn II Treebank, Wall St. Journal section) with LFG f-structure equations, from which we can automatically extract a large coverage unification grammar. We have designed two parsing architectures which use this automatic annotation technique to automatically generate f-structures for new text.
In this talk I will present joint work with Josef van Genabith, Andy Way and Aoife Cahill. Large coverage unification grammars are extremely time-consuming, expensive and difficult to obtain. We have designed an algorithm to automatically annotate a large corpus of newspaper text (The Penn II Treebank, Wall St. Journal section) with LFG f-structure equations, from which we can automatically extract a large coverage unification grammar.
In this talk I will give a brief introduction to LFG and an overview of our automatic annotation algorithm.
Speech synthesis and recognition are already widely used in daily life e.g. voice dialling on mobile phones (recognition), synthesis of phone numbers when calling directory enquiries, and applications to aid the visually and physically impaired. However, there is still much room for improvement in both disciplines. A problematic area of speech synthesis is achieving natural sounding speech i.e. does the speech sound like it was produced by a human. If we can learn speaker characteristics then we can use them in our synthesis to produce natural and intelligible synthetic speech as produced by a specific speaker rather than a "generic" speaker. In the area of speech recognition, it has proven difficult to develop a recogniser that is independent of the speaker i.e. that will recognise speech spoken by any speaker. By removing speaker characteristic qualities, can we eliminate some of these problems?
LFG-DOT is a hybrid architecture for robust MT. This research improves upon the LFG-MT system, a constraint-based approach to translation, as well as DOT, a statistical MT system based on constructing new translations from pairs of source and target tree fragments in the system database.
Work on the topic of LFG-DOT has been well-received by researchers working with both the rule-based and statistical approaches to Machine Translation. However, the work carried out to date has been largely theoretical in nature. 'Proof of concept' of these models has been provided by testing the models on limited datasets but it is essential to determine whether these translation models maintain their viability when scaled up. The development of a large-scale LFG-DOT system will thus facilitate the study of the impact on system requirements given an increased number of fragments, allowing experimentation to determine the optimal system set-up for this type of translation engine. The merits of the LFG-DOT approach will gain further appreciation if it can be shown that this system design remains robust in the face of larger corpora.
In this talk I will present an ICALL system which I developed in my Master's thesis. The system enables the user to learn about syntactical structures of Spanish (using a demonstration module) and to enter own sentences (then analysed by a sophisticated analysis module) at the same time.
The analysis module, which contains a robust parser, analyses any simple Spanish sentence entered by the user. Being able to perform a sophisticated error analysis and feedback, this module is able to give plenty of hints to the user about the committed errors. The module is able to recognize agreement errors, syntactical and semantic errors. The recognition of errors is performed via unification operations, lexical-based phrase-structure templates and the use of mal-rules. Semantic features are used to control semantic adequacy. The data basis for the analysis module is an XML-based lexicon, which contains complex syntactical and semantic information.
Although the demonstration module does not contain NLP techniques, it represents nevertheless an important part of the software. Via this module (which consists of several Flash animations) the user has the possibility to get to know different syntactic structures of Spanish before using the analysis module. This is an important point because the teaching of syntax is often neglected in language teaching. Without any explicit explanation of possible syntactic structures, it could be quite frustrating to get a huge number of error warnings using the analysis module.
Many techniques from the field of Computational Linguistics (CL) could potentially be of benefit in Computer Assisted Language Learning (CALL) applications. However, up till now, there have been very few implementations that have successfully combined CL techniques in CALL. There are several reasons for this. CL specialists are not particularly interested in CALL, preferring to work on other areas of language. CALL specialists are constantly trying to determine what CALL technologies really work in the actually language learning situation and do not have the "luxury" of time to investigate how CL techniques could be of benefit to them. Furthermore, the integration of CL techniques in CALL applications is difficult and may not always lead to pedagogically sound applications.
This talk briefly reviews the topic of CL/CALL integration and addresses the question of "Can it be done?". Some ideas will be put forth on the question of "Is it worth the effort?" but input is welcomed from the floor. There will be plenty of time for discussion so get your thinking caps on.
In this talk I will present the details of the project I completed as part of my masters degree. The development of grammatical resources for languages other than English has in recent years become increasingly important. While grammars have been developed in various frameworks for various languages, this is frequently approached in an ad hoc manner where grammars for new languages are written from scratch rather than exploiting previous development efforts. The alternative to this is a systematic or parallel approach, which implies the use of a grammatical framework which facilitates cross-linguistic grammar development and conversion, together with the application of established techniques to the process of multilingual grammar construction. An examination of current and recent work involving the systematic development of grammars for the analysis of multiple languages reveals certain practices and techniques. We combined a number of these practices and applied them as part of a cross-linguistic grammar development study which involved the analysis of English and German noun phrases. LFG was chosen to encode the grammar as its language independent f-structure representation facilitates the production of cross-linguistic parallel functional analyses. After an initial contrastive analysis of the data set, a German grammar was written using annotated c-structure rules, which were then used as the basis for an English one. The resultant grammars were equivalent in terms of coverage and analyses. During the experiment, a number of measurements were recorded, including the time taken for development and the number of rules which had to be added, deleted or changed. Significantly, there was a reduction in the time and effort required for the development of the second grammar which may be attributed mainly to the use of rule sharing.
This presentation describes a multi-layered collaborative CALL project involving curriculum development, classroom teaching in a disadvantaged area and development and presentation of related CALL software. The target students, in coming from this background, had many special needs. The courseware was designed to address their specific needs and abilities. It demonstrates the benefits collaboration with various sources can have on the developmental and implementation stages of a CALL application. These sources - target students, their teacher and the school - provided many forms of continuous feedback, all of which were integrated into the ongoing and revised developments in the courseware.
Since the cognitive revolution in the late 50s and early 60s, human language has often been viewed as a cognitive module equipped with a language specific, abstract, and inherently symbolic rule-system. Rule-based processing, mostly syntactic in nature, was expected to form the cognitive basis of human language behaviour. In recent years, alternative models have been proposed, most of which are associative and analogical in nature, that is, they model language without abstract or symbolic rules. These models are inspired by the idea that language performance is based on the direct reuse (by extrapolation) of previous experience rather than on the use of abstractions extracted from that experience - an approach that mainly draws on work in AI, Linguistics, Computational Linguistics, and statistical pattern recognition (k-nn models). The best known associative models are connectionism and exemplar- or memory-based models, all of which work with generalisations by analogy and similarity. Generally, these approaches contrast with the Chomskian generative view, which conceives of language as a rule-governed device and discredits associations and analogy as a vague metaphor.
In my talk, I will discuss the basic concept behind analogical modeling of language and will briefly introduce one software package (TiMBL, the Tilburg Memory Based Learner), which implements that kind of modeling. Time permitting, I will present some ideas on how to incorporate this kind of modeling into my PhD.
Approaches to speech coding in which the speech signal is modelled as a sum of sine waves have proven to provide a flexible framework for the implementation of high-quality speech synthesis and modification algorithms. In this talk I will outline the original sinusoidal analysis/synthesis model.
This presentation will describe the development of an Irish language spellchecker. The spellchecker was implemented by representing a large word list as a finite state automaton (FSA). The presentation will focus mainly on the algorithms and data structures developed for the minimisation of a large FSA. The FSA minimisation process occurs offline allowing the minimised structure to be encoded in a compressed data file. From this data file the FSA can be efficiently reproduced online. Other issues that will be covered include the development of the lexicon, the algorithm used for generating spelling corrections and the process of interfacing the tool to Microsoft Office applications.
|Last update: 1st October 2010|