NCLT Seminar Series 2004/2005Johann Roturier will present the next seminar entitled "Controlled Language And Its Impact On Translation Automation" on Wednesday March 30th at 4pm in Room L2.21.
The schedule of presenters for the 2004/2005 series is as follows:
The Cross-Language Evaluation Forum (CLEF) organises an annual workshop comparing information retrieval systems on a range of European language retrieval tasks. DCU participated in a number of the tasks at CLEF 2004 including French and Russian retrieval, bilingual and multilingual retrieval, and cross-language image retrieval.
Kalman filtering (KF) is a probabilistic technique for producing optimal estimates of a system's hidden state given noisy measurements of the system. This technique can be applied to the problem of tracking vocal tract (VT) parameters from an acoustic speech signal. However, the KF requires that the state be linearly related to the measurements. This limits the KF to tracking linear prediction coefficients which are prone to instabilities.
"RoBerT" is an online machine translation system. It is bilingual and unidirectional, translating English into Irish Sign Language (ISL) for the domain of weather reports. The system is based on the transfer method, with special emphasis on robustness. This approach is rule-based and indirect, comprising three stages: analysis, transfer and generation. To ensure grammaticality of the user's input and output, the data is parsed according to the English and ISL grammars respectively. Between these parsing stages, the sentence structures are altered through the application of language-dependent transfer rules. The final translation, a playlist of the appropriate ISL videos, is generated from the output. In this presentation, we will present the principal modules of the system, discuss the animation process and demonstrate the translator in action.
The merits of combining the positive elements of the rule-based and data-driven approaches to MT are clear: a combined model has the potential to be highly accurate, robust, cost-effective to build and adaptable. While the merits are clear, however, how best to combine these techniques into a model which retains the positive characteristics of each approach, while inheriting as few of the disadvantages as possible, remains an unsolved problem. One possible solution to this challenge is the Data-Oriented Translation (DOT) model originally proposed by Poutsma(1998, 2000, 2003), which is based on Data-Oriented Parsing (DOP) (e.g. (Bod, 1992; Bod et al., 2003)) and combines examples, linguistic information and a statistical translation model.
Scaling wide-coverage, constraint-based grammars such as Lexical-Functional Grammars (LFG) (Kaplan and Bresnan, 1982; Bresnan, 2001) or Head-Driven Phrase Structure Grammars (HPSG) (Pollard and Sag, 1994) from fragments to naturally occurring unrestricted text is knowledge-intensive, time-consuming and (often prohibitively) expensive. A number of researchers have recently presented methods to automatically acquire wide-coverage, probabilistic constraint-based grammatical resources from treebanks (Cahill et al., 2002, Cahill et al., 2003; Cahill et al., 2004; Miyao et al., 2003; Miyao et al., 2004; Hockenmaier and Steedman, 2002; Hockenmaier, 2003), addressing the knowledge acquisition bottleneck in constraint-based grammar development. Research to date has concentrated on English and German. In this paper we report on an experiment to induce wide-coverage, probabilistic LFG grammatical and lexical resources for Chinese from the Penn Chinese Treebank (CTB) (Xue et al., 2002) based on an automatic f-structure annotation algorithm. Currently 96.751% of the CTB trees receive a single, covering and connected f-structure, 0.112% do not receive an f-structure due to feature clashes, while 3.137% are associated with multiple f-structure fragments. From the f-structure-annotated CTB we extract a total of 12975 lexical entries with 20 distinct subcategorisation frame types. Of these 3436 are verbal entries with a total of 11 different frame types. We extract a number of PCFG-based LFG approximations. Currently our best automatically induced grammars achieve an f-score of 81.57% against the trees in unseen articles 301-325; 86.06% f-score (all grammatical functions) and 73.98% (preds-only) against the dependencies derived from the f-structures automatically generated for the original trees in 301-325 and 82.79% (all grammatical functions) and 67.74% (preds-only) against the dependencies derived from the manually annotated gold-standard f-structures for 50 trees randomly selected from articles 301-325.
In this talk, I will introduce some work on NLP in our group. This work includes Chinese word segmentation (there are no separators between words in Chinese sentences) & part-of-speech tagging, syntactic parsing (including full parsing and shallow parsing), semantic analysis, Chinese language grammar theory, and Chinese-English Machine Translation systems based on rules and corpora.
Research in plurilingual teaching and learning of Romance languages has shown that a combined approach to teaching Romance languages is very promising. It can exploit the similarities between these languages in many ways in order to teach them contrastively. Thus far several European projects have been devoted to plurilingual teaching of Romance languages. However, materials for plurilingual learning of Romance languages almost exclusively focus on receptive skills and lack any kind of intelligent automatic analysis on learner input as well as flexible and dynamic feedback.
The development of large-scale rules and grammars for a Rule-Based Machine Translation (RBMT) system is labour-intensive, error-prone and expensive. Current research in Machine Translation (MT) tends to focus on the development of corpus-based systems which can overcome the problem of knowledge-acquisition.
Audio time-scale modification is an effect that alters the duration of an audio signal without affecting its pitch or timbre. In other words, the duration of the original signal is increased or decreased but the perceptually important features of the original signal remain unchanged; in the case of speech, the time-scaled signal sounds as if the original speaker has spoken at a quicker or slower rate; in the case of music, the time-scaled signal sounds as if the musicians have played at a different tempo.
This talk is concerned with the parsing of ungrammatical written English sentences. A 20,000 word corpus was developed which consists of ungrammatical sentences which were noticed while reading a variety of English texts. Each sentence in this corpus was corrected, producing a second corpus of grammatical sentences. In this talk I argue that the compilation of such a corpus is a useful computational linguistic resource, outline the methodological decisions which were made in compiling the corpus, present the results of a small questionnaire study which was used to investigate the reliability of the corpus data, and briefly describe three parsing applications of the corpus. The first is a parser which uses a bottom-up active chart parsing algorithm and an error grammar to parse ungrammatical sentences. The error grammar is derived from a conventional grammar and the differences between the sentences in the ungrammatical corpus and the corrected grammatical corpus are used to inform this derivation process. The second application is rooted in the linguistic framework of typed feature structures. An extended notion of a typed feature structure is presented which allows the inconsistent information contained in an agreement error to be stored. A form of relaxed unification is also defined which operates on these feature structures so that sentences containing an agreement error can be parsed. This idea was tested on corpus sentences by modifying the parser in the Linguistic Knowledge Base, a widely-used natural language parser/generator which employs typed feature structures as linguistic objects. The third application is a parser evaluation method which measures a parser's ability to parse ungrammatical sentences by comparing the parses it produces for the ungrammatical sentences from the corpus to the parses it produces for the equivalent grammatical sentences in the corrected corpus. This method is flexible enough to be applied to any type of parser, regardless of the linguistic framework used to encode analyses. The method was applied to two wide-coverage probabilistic parsers, and the results of the evaluation are presented.
Most of the freely available, wide-coverage Machine Translation systems on the Internet are based on a rather simple architecture which is often unable to correctly interpret complex sentences. Our project aims at boosting the quality of MT engines by reducing those complex structures to simple sentences, which, embedded within a minimal context, are to be spoon-fed to the MT system.
Drawing upon our work across a range of educational and cultural projects, I propose to show examples of the different ways that virtual reality software has been used with language learners in order to enhance their understanding and their abilities to use new technologies. My aim is to create intuitive three-dimensional learning spaces into which the new tools and resources widely available on the internet can be integrated to allow learners to learn effectively and creatively.
Statistical Machine Translation (SMT) typically takes as its basis a noisy channel model in which the target language sentence T is distorted by the channel into the source language sentence S. To recover the original target language it makes use of a language model Pr(T) and translation model Pr(S|T).
Johann Roturier is currently involved in a research project whose objective is to automate the translation process of technical documents in the field of computer security. Due to the time-critical nature of this type of communication, which needs to be promptly distributed in a number of languages, MT presents itself as a prospective candidate. The limitations of RBMT are often epitomized by its inability to process an unrestricted input to produce consistent translations of acceptable quality. However, the quality of this output can be significantly improved if writers create documents with MT in mind (Bernth & Gdaniec 2001). Previous initiatives showed that certain Controlled Language (CL) rules must be applied on the source text to achieve this objective. By applying lexical, syntactic and semantic restrictions, CL attempts to improve the clarity of the source text so as to reduce ambiguities during the automatic translation process (Kamprath et al, 1998). In this talk, I will first introduce the concept of CL and reflect on its relevance for translation automation. The findings of a preliminary study that was conducted to assess the effectiveness of CL rules on MT output will then be presented. Finally, I will discuss the opportunity that a CL environment creates for the possible automation of the Post-Editing (PE) process when the minimal PE tasks require no linguistic analysis.
|Last update: 1st October 2010|