National Centre for Language Technology

Dublin City University, Ireland

National Centre for Language Technology


Centre for Next Generation Localisation

School of Computing

School of Applied Languages and Intercultural Studies

School of Electronic Engineering


NCLT Seminar Series








Research Groups


NCLT Seminar Series 2005/2006

The (final!) NCLT Seminar will be presented by Yanjun Ma on Wednesday, 23rd August, at 4pm in Room L2.21
Thank you to everyone who participated in this year's seminar series!

The schedule of presenters for the 2005/2006 series is as follows:

November 9th 2005 Monica Ward Using Intelligence to Improve Feedback in CALL
November 15th 2005 Aoife Cahill Comparing Hand-Crafted with Automatically Acquired English Constraint-Based Parsing Resources
November 22nd 2005 Thomas Koller Tools and resources (not only) for French, Italian and Spanish
November 30th 2005 Mary Hearne Data-Oriented Natural Language Processing Using Lexical-Functional Grammar
December 7th 2005 John Judge Partial Prebracketing to Improve Parser Performance
January 11th 2006 Nicolas Stroppa Generative & Discriminative Approaches in NLP
January 18th 2006 Djamé Seddah Introduction to Lexicalized Tree Adjoining Grammar and Syntax Semantic Interfaces
January 25th 2006 Yafa Al Raheb Speaker/Hearer Representation in a DRT Model of Presupposition
February 8th 2006 Joachim Wagner Approximating Parse Probabilities with Simple Probabilistic Models
February 15th 2006 Grzegorz Chrupala Learning to Assign Grammatical Function Labels
March 1st 2006 Amine Akrout Arabic morphology and syntax within the frameworks of LMF and LFG
March 8th 2006 Hany Hassan GALE and UIMA
March 22nd 2006 Ines Rehbein German Particle Verbs and Pleonastic Prepositions
March 29th 2006 Dietmar Janetzko Dialogue-Based Authoring of Units of Learning
April 5th 2006 Ying Zhang Dictionary-based Query Translation in Chinese-English Cross-language IR
April 12th 2006 Ríona Finn Data-Oriented Parsing Incorporating Lexical Functional Grammar
April 19th 2006 Antal van den Bosch Constraint Satisfaction Inference for Discrete Sequence Processing in NLP
April 26th 2006 Karolina Owczarzak Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation
May 3rd 2006 Benoît Sagot (Atoll project, INRIA) Parsing French: resources, formalisms, parsers
May 10th 2006 Kepa Sarasola Technology is an effective tool to promote use of Basque. Strategies to develop HLT for minority languages.
May 24th 2006 Declan Groves Hybridity in MT: Experiments on the Europarl Corpus
May 31st 2006 Yvette Graham Services for Experimentation in the Human Sciences: An Online Experimentation Tool
June 7th 2006 ACL4 Presentations Conor Cafferkey: Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures

Gary Madden: Re-ranking Documents Segments to Improve Access to Relevant Content in Information Retrieval

Neasa Ní Chiarán: The Development of a Phonological Awareness Test to Screen for Dyslexia

John Tinsley:Morphological Analysis and Generation of Spanish using Xerox Finite-State Tools
June 14th 2006 EAMT06 Presentations Mary Hearne: Disambiguation Strategies for Data-Oriented Translation

Declan Groves:

Bart Mellebeek: A Syntactic Skeleton for Statistical Machine Translation
June 21st 2006 Sara Morrissey Lending A Hand: Sign Language Machine Translation
July 12th 2006 Stephen Armstrong Translating DVD subtitles using Example-Based Machine Translation
July 26th 2006 Yuqing Guo Automatic Treebank-Based Acquisition of LFG Grammar for Chinese
August 2nd 2006 AMTA presentations Andy Way: EBMT of the Basque language

Karolina Owczarzak: Wrapper Syntax for Example-Based Machine Translation

Bart Mellebeek: Multi-Engine Machine Translation by Recursive Sentence Decomposition
August 16th 2006 Jennifer Foster Beyond the Wall Street Journal: Improving the performance of probabilistic parsers on non-WSJ text
August 23rd 2006 Yanjun Ma Extracting equivalent chunks from Chinese-English bilingual corpus

Using Intelligence to Improve Feedback in CALL

Feedback is often given as one of the benefits of Computer Assisted Language Learning (CALL). The idea that learners can receive immediate feedback on their errors in a non-threatening environment as often as they like seems to make good pedagogical sense. However, the fact is that learners generally do not avail of feedback and it is an under-utilised component of CALL. Although there are many reasons why this is the case, one factor to consider is the quality of the feedback provided. On the surface it might seem easy to provide feedback, but the reality is quite different. In order to provide good quality, appropriate and personalised feedback, one must draw on the disciplines of Computation Linguistics (CL), Second Language Acquisition (SLA), Language Pedagogy, Artificial Intelligence (AI), Software Engineering (SE) and Human Computer Interaction (HCI). This talk reviews feedback in CALL and looks at how these disciplines can help improve the quality of the feedback provided.

Comparing Hand-Crafted with Automatically Acquired English Constraint-Based Parsing Resources View slides

Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such resources beyond toy fragments to unrestricted, real text is knowledge-intensive, time-consuming and expensive. We have developed a method for treebank-based, wide-coverage, deep, constraint-based grammar acquisition. The resulting PCFG-based LFG approximations parse the Penn-II treebank with wider coverage (measured in terms of complete spanning parse) and parsing results comparable to or better than those achieved by the best hand-crafted grammars, with, we believe, considerably less grammar development effort. In this talk, I will briefly describe our approach and outline some recent experiments to compare our automatically induced English parsing resources against the hand-crafted XLE LFG parser and RASP dependency parser.

Tools and resources (not only) for French, Italian and Spanish View slides

In this show & tell presentation I want to give you an overview of the existing tools and resources I have reused in my Ph.D. work and of the tools and resources I have developed for plurilingual learning of French, Italian and Spanish. Not all the tools and resources are restricted to French, Italian and Spanish, so my presentation may also be interesting for those of you who are working on other languages.
Tools developed thus far comprise an error-sensitive input analysis module, animated grammar presentations, an animation authoring tool and two dictionary tools. The analysis module - including a plurilingual island parser - is able to analyse learner input (phrases, simple sentences and paragraphs of simple
sentences) in all three languages and to dynamically create flexible and precise feedback. The analysis module contains a plurilingual general lexicon and a multilingual verb lexicon. Both lexicons can also be used independently. Animated grammar presentations dynamically present grammatical properties and processes. The animation authoring tool allows any user to easily integrate any kind of animated text into web-based learning materials (or to create general slide-based presentations). It is language- and topic-independent, so you can use it for any kind of information. The two dictionary tools enable the user to get broad word-by-word translation information for unrestricted text in French, Italian and Spanish. The multilingual dictionary tool can be adapted quite easily to other languages.
Resources created thus far comprise multilingual XML lexicons (43 topics, around 13000 lemmas per language -> German, English, French, Italian and Spanish), full-form verb conjugation lists for French, Italian and Spanish (such lists could also be created semi-automatically for German and English), a plurilingual lexicon with 1800 entries and verb lexicons with information about required prepositions and "cases" (transitive, intransitive, etc.) for each verb.
I will also give some information about the software architecture I am using and discuss the pros and cons of it. You can access most of the developed tools at

Data-Oriented Natural Language Processing Using Lexical-Functional Grammar View slides

While experimental results show that the Tree-DOP and Tree-DOT models achieve excellent parse and translation accuracy, the power of these models is nevertheless limited by the corpus representations they assume. It is known that these representations, which reflect surface syntactic phenomena only, do not adequately describe many aspects of human language. The Lexical-Functional Grammar (LFG) formalism, on the other hand, is known to be beyond context-free. There is clear motivation for the use of parsing and translation models which employ this type of linguistic analysis, hence the LFG-DOP and LFG-DOT models .
In this talk, we will outline some of the open questions for the data-oriented models which use LFG representations. We will look at how to define the root and frontier operations so that we get meaningful fragments while still handling recursive and re-entrant structures. We will also discuss the interaction between the fragmentation process and the need for the application of discard. We will discuss the application of constraints in the translation process - where constraints are bilingual and no representation is output - and investigate the possibility of learning which constraints will help to predict good translations. Finally, the fact that unification is an operation with global (rather than local) effects has implications for LFG-based models. We will discuss enforcement of the LFG well-formedness conditions, the issue of 'leaked' probability mass and the computation of sampling probabilities.

Partial Prebracketing to Improve Parser Performance View slides

Parse-annotated corpora (Treebanks) are crucial in the development of machine-learning and statistics-based parsing systems. Such systems induce the grammatical information used for parsing from treebanks like the Penn-II Treebank.
Treebanks of sufficient size to induce wide coverage, high performance grammars are available for some (but not all) major languages like English, German, Chinese, Arabic and French. However for other languages like Spanish, Urdu and Hindi there is none or only a relatively small treebank to use for grammar induction. Treebank construction is usually a semi-automatic process (Penn-II Treebank, NEGRA) whereby raw text is parsed and then hand corrected by human annotators.
In this talk I will present work on an alternative method for use in semi-automatically creating treebank trees. Rather than parsing followed by a post-editing phase, I use pre-editing to improve automatic parse quality which should reduce the need for post-editing. The basic idea is that we manually or automatically mark-up raw text with (a few) constituent structure boundaries which are respected by the parser so that any resulting parse of the text will contain the manually or automatically inserted constituent structure(s).

Generative & Discriminative Approaches in NLP View slides

Discriminative techniques have been successfully applied to flat classification tasks such as text categorization or word-sense disambiguation. It has been recently shown that very good results can also be obtained by using such techniques for solving more structured NLP tasks such as parsing and machine-translation: discriminative techniques now compete with or exceed generative techniques.
Lots of traditional generative techniques rely on alignments between inputs and outputs, which means that parts of input objects are mapped to parts of output objects. These alignments between different levels of representation thus corresponds to *inter-level mappings*. Simple models such as Hidden Markov Models account for these kinds of inter-level mappings; depending on the application, more complex models can be involved.
Many discriminative approaches rely on the following simple assumption: similar inputs lead to similar outputs. In this case, *intra-level relationships* (similarity of inputs and similarity of outputs) are exploited. Inputs and outputs are not decomposed into smaller parts which are linked or mapped: only the overall relationship (similarity) is mapped from the input space to the output space.
In this presentation, we present the differences and the relationships between discriminative and generative approaches. We also propose to extend traditional discriminative approaches by considering intra-level relationships that are more complex than surface similarity and, more importantly, that are designed to take into account the specificity of linguistic data. In particular, we show how to exploit the *paradigmatic organization* of linguistic data and how this can be applied to (example-based) machine translation.

Introduction to Lexicalized Tree Adjoining Grammar and Syntax Semantic Interfaces View slides

In this presentation, I will introduce some linguistic basis behind the LTAG formalism: how the need to lexicalize CFG results in TAG Formalism and why LTAG are well suited for the treatment of long distance dependencies. I will then introduce the basis of a compositional semantic model based on TAG grammars and I will try to explain why it is necessary to deal with both constituency structures (so called derived tree in TAG) and derivation structure (resp. Derivation tree) to obtain proper predicate argument structure.

Speaker/Hearer Representation in a DRT Model of Presupposition

This talk investigates the semantic/pragmatic interface in the treatment of presupposition and assertion in Discourse Representation Theory (DRT). It is argued that as a formal semantic theory, DRT, needs pragmatic enrichment in its treatment of presupposition. The pragmatic enhancement is achieved through defining presupposition in terms of beliefs, and re-interpreting the Gricean maxims in terms of ‘belief constraints’ on the making of presupposition and assertion. Additionally, presupposition is interpreted as a property of the communicative behaviour of the speaker and hearer. A series of checks is devised, which differentiate speaker generation of presupposition and assertion from hearer recognition. Further, recent advances in DRT are expanded in order to account for varying strengths of beliefs (weak and strong belief) as well as dialogue acts triggered by assertions. A new emerging Discourse Representation Structure (DRS) creates compatible speaker and hearer cognitive states representation. The linguistic content (presupposition and assertion) is linked with the beliefs and intentions of agents in dialogue in order to move DRT towards being ‘more pragmatic’.

Approximating Parse Probabilities with Simple Probabilistic Models View slides

The core component of my approach to detecting ungrammatical sentences is a model that can predict the probability of the most likely parse of grammatical sentences without using the whole amount of information available to the probabilistic parser. As data-driven, probabilistic parsers are extremely robust, they fail to reject ungrammatical input. The output of such a model might provide a threshold to the parse probability that allows us to distinguish grammatical from ungrammatical sentences. In this talk I will present results for a range of models I have studied so far. Features like sentence length, number of nodes of the parse tree and character trigrams prove to be very useful to get a prediction within the right order of magnitude and show that lexical information is important. However, ordinary probabilistic language models that focus on token frequencies do not perform well unless combined with the former models. I also present a model that uses the probabilities of the terminal rules of a PCFG, although the parser to be approximated is history-based. Combining this model with the previous models again improves results.

Learning to Assign Grammatical Function Labels View slides

The research described in this talk is part of a project whose aim is to induce multilingual probabilistic Lexical-Functional Grammar (LFG) resources from treebanks. For Spanish we use the Cast3LB treebank (Civit and Martí, 2004).
Some properties of Spanish and the encoding of syntactic information in Cast3LB treebank make it non-trivial to apply the method of automatically mapping c-structures to f-structures used by (Cahill et al., 2004), which assigns grammatical functions to tree nodes based on their phrasal category, the category of the mother node and their position relative to the local head. For Spanish and Cast3LB trees the difficulties are: (i) the order of sentence constituents is flexible and their position relative to head is an imperfect predictor of grammatical function; (ii) much of the information that the Penn-II Treebank encodes as tree configurations is encoded in Cast3LB in the form of function tags.
In order to leverage functional tag information (O'Donovan et al., 2005) train the parser to output complex category-function labels. This, however, has the disadvantage of inflating the number of unique labels and can deteriorate parse quality due to sparse data problems. The approach we adopt instead is to add functional tags to constituent tree parser output as a postprocessing step, following (Blaheta and Charniak, 2000). We train a machine-learning-based classifier to predict functional tags. From the training set we extract features encoding configurational, morphological and lexical information for the target node and neighboring context nodes. For each of the training examples we also extract the Cast3LB function tag, or assign the default null tag if no tag is present. We achieve a noticeable improvement in function assignment (approx. 5% f-score) over the baseline parser-based method.

Arabic morphology and syntax within the frameworks of LMF and LFG View slides

The talk aims at setting Arabic language within the frameworks of a previous seminal work that aimed at the adaptation of a tagged lexicon for French to Arabic, and the future work aiming at inducing Treebank-based LFG (Lexical Functional Grammar) resources for Arabic. The first part of the talk will provide an outline of the main characteristics of Arabic Morphosyntactic system. The second part will account for the previous work done under the scope of Lexical Markup Framework, with some illustrative samples. Finally, the presentation will highlight some of the main features that should be considered when inducing Treebank-based LFG resources for Arabic.

GALE and UIMA View slides

Part 1: I will talk about GALE "GLOBAL AUTONOMOUS LANGUAGE EXPLOITATION" project which is the largest DARPA funded research project. GALE has many areas including Speech Recognition, Machine Translation and Information Extraction. I will talk about different tasks in GALE.
Part 2: I will talk about the newly open source Unstructured Information Management Architecture and its potential deployment in various applications.

German Particle Verbs and Pleonastic Prepositions View slides

In German there are 9 two-way prepositions which can either govern the accusative or the dative. Those two-way prepositions are able to combine with a verb stem and form the so-called particle verbs (also called separable prefix verbs).
In my talk I will discuss the behaviour of German particle verbs formed by two-way prepositions in combination with PPs including the verb particle as a preposition. These particle verbs have a characteristic feature: some of them license directional prepositional phrases in the accusative, some only allow for locative PPs in the dative, and some particle verbs can occur with PPs in the accusative and in the dative. Directional particle verbs together with directional PPs present an additional problem: the particle and the preposition in the PP seem to provide redundant information.
The talk gives an overview over the semantic verb classes influencing the phenomenon, based on corpus data, and explains the underlying reasons for the behaviour of the particle verbs. It also shows how the restrictions on particle verbs and pleonastic PPs can be expressed in a grammar theory like Lexical Functional Grammar (LFG).

Dialogue-Based Authoring of Units of Learning View slides

Authoring learning content in XML-based e-learning specification languages like IMS-LD (Koper & Tattersall, 2005) is a tedious and error-prone task. A number of graphical editors have been designed to support this step in setting up an e-learning system. This paper describes DBAT-LD (Dialogue-Based Authoring of Learning Designs), which is a chat bot (natural language dialogue system) that interacts with authors of units of learning and secures the content of the dialogue in an XML-based target format of choice. DBAT-LD can be geared towards different specifications thereby contributing to interoperability in e-learning. In the case of IMS-LD, DBAT-LD elicits and reconstructs an account of learning activities along with the pedagogical support activities and delivers a Level A description of a unit of learning.

Dictionary-based Query Translation in Chinese-English Cross-language IR View slides View examples

Cross-lingual information retrieval (CLIR) allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. There are several approaches to implementation of CLIR. Perhaps the most popular is to use a dictionary to translate the queries into the target language and then use mono-lingual retrieval. As with other CLIR language pairs, in Chinese-English CLIR the accuracy of dictionary-based query translation is limited by two major factors: the presence of out-of-vocabulary (OOV) words and translation ambiguity.
A major difficulty for CLIR is the detection and translation of out-of-vocabulary (OOV) terms; for OOV terms in Chinese, another difficulty is segmentation. We have developed a new segmentation-free technique to identify Chinese OOV terms and extract English translations using the web, which has been demonstrated using the NTCIR 4 and 5 test collections. This OOV translation technique leads to a significant improvement in retrieval effectiveness, and can be used to improve Chinese segmentation accuracy.
Various techniques have been proposed to reduce the ambiguity and errors introduced during query translation. However, use of different data sets and language pairs has meant that it has not been possible to draw clear conclusions about the relative merits of the different disambiguation techniques. We also compared the effectiveness of the different disambiguation techniques on the same ChineseEnglish data sets. Our experiments show each of these techniques uses different models, formulae and parameters; nonetheless each achieved comparable results across multiple data sets.

Data-Oriented Parsing Incorporating Lexical Functional Grammar View slides

Data-Oriented Parsing (DOP) is a hybrid, language-independent, parsing formalism. Combining rules, statistics and linguistics, all parsing knowledge is learned from existing texts. However, the expressive power of the DOP model is limited by the corpus representations it assumes. DOP makes use of context-free phrase-structure trees which characterise phrasal and sentential syntax, but cannot reflect linguistic phenomena at deeper levels. The integration of Lexical Functional Grammar (LFG), which is known to be beyond context-free, enables a more linguistically detailed description of language.
In this presentation, I will discuss the DOP Model, LFG Theory and demonstrate how combining the two approaches forms a robust, linguistically accurate parsing methodology. I will present the results of recent experiments in Grammatical Function DOP (GF-DOP) and summarize some future work in this area.

Constraint Satisfaction Inference for Discrete Sequence Processing in NLP View slides

A pervasive type of task in NLP is one in which a sequence at one level is mapped to a sequence at another: letters to phonemes, sequences of words to part-of-speech tags, constituent markers and labels, entity markers, etc. There is no data-driven (machine learning or statistical) model that can learn to perform the full sequence-to-sequence mapping in one go; rather, most approaches decompose the task in making local decisions, and post-processing the local decisions into a global sequence. Probabilistic approaches are often considered the ideal method, since they produce output token distributions at their local decisions, and strong search algorithms such as Viterbi search are available to search through these lattices of output token distributions to find a likely global output sequence.
In this talk I review a method that achieves the same goal for non-probabilistic, discrete classifiers. The method is based on a local classification into overlapping trigrams of output tokens, which open up a space in which weighted constraint satisfaction inference can be applied to find the most likely output sequence - the one violating the least amount of constraints (which are data- driven, not based on expert knowledge). I demonstrate the algorithm on a wide range of NLP sequence processing tasks in morpho-phonology, syntax, and information extraction, comparing to state-of-the-art probabilistic approaches such as MEMMs and CRFs.

Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation View slides

It is a widely known problem that automatic evaluation metrics for machine translation like BLEU or NIST, based on string matching, are insensitive to admissible lexical and syntactic differences between the translation and reference. This is the underlying reason for using a number of reference texts, to increase the chances that the translation will match a part of at least one of them. A number of attempts have been made to remedy these shortcomings.
In this talk I present a novel method for deriving paraphrases during automatic MT evaluation using only the source and reference texts, which are necessary for the evaluation, and a word and phrase alignment software. By using target language paraphrases produced through word and phrase alignment, a number of alternative reference sentences are constructed automatically for each candidate translation. This method produces lexical and low-level syntactic paraphrases that are more relevant to the domain in hand than those produced by a thesaurus or WordNet, does not use external knowledge resources, and can be combined with a variety of automatic MT evaluation system.

Parsing French: resources, formalisms, parsers View slides

This talk will present three series of works that share a common goal: develop large-coverage parsers that are both efficient and linguistically relevant, without the help of hand-crafted resources. Indeed, I will focus on French, for which there are currently no such existing resourcess like PTB (both in terms of size and richness), and for which syntactic lexicons are still an active area of research.
Firstly, I will present and illustrate the general idea that statistic analysis of automatically generated data (raw corpora, parsing results from fully symbolic parsers) can be the basis for the development of rich and large-coverage resources (lexicons, grammars,...). To illustrate this point I will briefly describe two of the techniques I developed to apply this idea: a method to acquire automatically morphological lexicon, and an error mining technique in parsers output.
Secondly, I will present SxLFG, an efficient and robust parser generator for LFG developed in collaboration with Pierre Boullier. I will illustrate this by presenting our parsing system for French that relies on SxLFG, on a large-coverage LFG grammar for French we developed, and on the pre-syntactic processing chain SxPipe. In particular, I will show that the efficiency of SxLFG allows the parsing of large corpora without any probabilistic information, thus creating annotated corpora on which probabilistic models can be bootstrapped (for parsers, taggers,...). I will give an insight into ongoing work about learning and using such bootstrapped models into SxLFG.
Thirdly, I will briefly describe some limitations of standard two-stage formalisms such as LFG, TAG or others ("linear" syntactic backbone + unification-based decorations), and I will propose another possible approach, namely "non-linear" formalisms ("non-linear" meaning here "closed by intersection"). I will introduce Range Concatenation Grammars (RCGs), a non-linear formalism parsable in polynomial time, and explain why it is suitable to model natural languages. In particular, I will shortly describe the medium-coverage grammar I developed in a (still polynomial) syntactic extension of RCGs, and how standard analyses (constituency, dependency, topological boxes, predicate-argument semantics) can be obtained as partial projections of the full analysis.
I will conclude with possible ways to merge the efficiency, robustness and large coverage of our SxLFG-based parser with the non-linearity and linguistic relevance of RCGs.

Technology is an effective tool to promote use of Basque. Strategies to develop HLT for minority languages. View slides

The IXA Group ( was created in 1988 with the aim of promoting the modernization of Basque by means of developing basic computational resources for it.
As a result of our work six kinds of applications are currently available for common users: a spelling checker, a lemmatization based Internet/Intranet search engine, three lemmatization based on-line dictionaries (Spanish-Basque, French-Basque, and a monolingual Basque), a generator of weather reports and a first version of a Spanish to Basque Transfer Based Machine Translation system for texts and websites.
The spell-checker and the lemmatiser are particularly active tools in the ongoing standardization of Basque.
The scarcity of human and linguistic resources in minority languages motivates the design of different strategies to develop HLT:
Reusability and standardization of resources in different researches, tools and applications is always a need. Language foundations, tools, and applications need to be developed incrementally, in a parallel and coordinated way, in order to get the best benefit from them.

Hybridity in MT: Experiments on the Europarl Corpus View slides

Previous work (Groves & Way. 2005) has demonstrated that while a Marker-based EBMT system is capable of outperforming a phrase-based SMT system trained on reasonably large data sets, a hybrid 'example-based' SMT system, incorporating marker chunks and SMT sub-sentential alignments is capable of outperforming both baseline translation models for French-English translation.

In this presentation we will show that similar gains are to be had from constructing a hybrid ‘statistical EBMT’ system capable of outperforming the baseline system of (Way & Gough, 2005). Using the Europarl (Koehn, 2005) training and test sets we show that this time around, although all ‘hybrid’ variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in (Groves & Way, 2005), to create a hybrid ‘example-based SMT’ system outperforms the baseline SMT and EBMT systems from which it is derived. Furthermore, we provide further evidence in favour of hybrid systems by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality.

Services for Experimentation in the Human Sciences: An Online Experimentation Tool View slides

Traditionally, transfer rules for machine translation have been hand-coded. This hand-coding of transfer rules takes time and uses valuable resources. In this talk I present a method of automatically inducing such transfer rules from aligned bilingual corpora. The sentence-aligned corpora are annotated with LFG f-structure information. I discuss a prooposed method of automatically extracting the transfer rules by making generalizations about the structure of source and target language sentences.

ACL4 Presentations

Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures View slides

I explore a novel approach to the identification of semantic roles (such as agent, patient, instrument, etc.) in unrestricted text. Current approaches to the identification of semantic relationships predicate-argument relations) make use of machine-learning techniques (such as Support Vector Machines, Random Forests, and others) applied to the syntactic tree structures generated by a natural language parser. Such parsers are commonly trained on corpora such as the Penn Treebank. In this project, the Penn Treebank data used to train a history-based generative lexicalized parser is augmented with semantic relationship data derived from Prop-Bank. In this way, the parser itself performs the labeling of semantic roles and relations, constituting an integrated approach to semantic parsing, as opposed to the ‘pipeline’ approach employed by current techniques.

Re-ranking Documents Segments to Improve Access to Relevant Content in Information Retrieval View slides

This project processes a ranked list of search results in the following manner. It splits each document in the list into sub-sections based on the change in subject matter within the document using a method known as TextTiling. It then infers links between the generated sub-sections based on the sub-sections' contents. It finally re-ranks the sub-sections based on the links inferred between them in a manner similar to Google's PageRank. While this technique of re-ranking documents returned by an Information Retrieval system have been implemented, there is no such existing method of re-ranking sub-sections of documents returned by an Information Retrieval system. It is hoped that by doing this, the access to relevant content is greatly improved. An evaluation of the relevance of the system implemented from this project yields promising results and further work is proposed to further improve the relevance achieved by the system.

The Development of a Phonological Awareness Test to Screen for Dyslexia View slides

This project involves the development of a computer based test which will discriminate between preschool children in terms of their phonological discrimination ability. A low phonological discrimination ability is strongly associated with dyslexia and this test, known as ‘Foomy’ may be developed to the stage where it can identify likely dyslexics at a very early age and so enable adults to take remedial action. This paper details the development of the test from a technical viewpoint as well as from a psycholinguistic one. It details the piloting of the test and analyses the results attained from administering it to 104 children in the 3-5 year age group. The project succeeded in producing a test which discriminates between the children and it delivered results which are normally distributed. Its usefulness in detecting children with dyslexia has yet to be explored.

Morphological Analysis and Generation of Spanish using Xerox Finite-State Tools View slides

Morphological analysis is an integral element in many natural language processing tasks. Finite-state techniques have been successfully applied to computational morphology and are considered the leading method for doing so. This talk will describe an online morphological analyser and generator for Spanish implemented using the Xerox Finite-State Tools. The core of the system is a lexicon, compiled as a finite-state transducer, which handles all regular inflectional and derivational forms of Spanish. Irregularities are dealt with using replace rules, encoded as regular expressions, which make necessary alterations to surface forms. Both these elements are composed together to create a single two-level finite-state transducer. The system achieves approximately 85% coverage on unrestricted Spanish text based on evaluation performed using the 1,000 most frequent word forms in a general Spanish corpus of 100,000 words.

EAMT06 Presentations

Mary Hearne: Disambiguation Strategies for Data-Oriented Translation View slides

Declan Groves: View slides

Bart Mellebeek: A Syntactic Skeleton for Statistical Machine Translation View slides

Lending A Hand: Sign Language Machine Translation View slides

Sign languages are the first and preferred languages of the Deaf Community worldwide. As with other minority languages, they are often poorly resourced and in many cases lack political and social recognition. As with speakers of minority languages, Deaf people are often required to access documentation or communicate in a language that is not natural to them. In an attempt to alleviate this problem I am developing an example-based machine translation system to allow Deaf people to access information in the language of their choice. In this presentation, I will give an overview of my previous system and the issues that arose from its development, I will then discuss my current work in terms of system and corpus development. Finally, I will conclude by addressing the problems of using traditional machine translation evaluation metrics for sign languages.

Translating DVD subtitles using Example-Based Machine Translation View slides

Due to limited budgets and an ever-diminishing time-frame for the production of subtitles for movies released in cinema and DVD, there is a compelling case for a technology-based translation solution for subtitles (O’Hagan, 2003; Carroll, 2004; Gambier, 2005). Our research focuses on an EBMT tool that produces fully automated translations, which in turn can be edited if required. We will seed the EBMT system with a corpus consisting of existing human translations from DVD to automatically produce high quality subtitles for audio-visual content. To our knowledge this is the first time that any EBMT approach has been used with DVD subtitle translation. Schäler et al. (2003) propose that “… the time is ripe for the transformation of EBMT into demonstrators, and eventually viable products”. We attempt to answer their call with an EBMT approach to the translation of subtitles.

Automatic Treebank-based Acquisition of LFG Grammar for Chinese View slides

Previous work (McCarthy 2004, Cahill 2004 etc.) have showed that an automatic Treebank-based annotation algorithm can be used efficiently to acquire wide-coverage, deep, constrained-based grammars such as LFG grammar approximations for English. In this talk the methodology is extended to the task of Chinese LFG acquisition. Chinese varies from English drastically, which causes some difficulties in the current annotation method. The first part of the talk will present some main characteristics of Chinese grammar and the problems within the LFG framework. Then a preliminary result of the f-structure annotation on the Penn Chinese Treebank will be presented and analyzed. Finally, some alternative methods to improve the result are proposed.

AMTA Presentations

Andy Way: EBMT of the Basque language View slides

Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus ($270,000$ sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art approaches according to several common automatic evaluation metrics.

Karolina Owczarzak: Wrapper Syntax for Example-Based Machine Translation View slides

TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces a better output in terms of fluency and accuracy.

Bart Mellebeek: Multi-Engine Machine Translation by Recursive Sentence Decomposition View slides

In this talk, we present a novel approach to combine the outputs of multiple MT engines into a consensus translation. In contrast to previous Multi-Engine Machine Translation (MEMT) techniques, we do not rely on word alignments of output hypotheses, but prepare the input sentence for multi-engine processing. We do this by using a recursive decomposition algorithm that produces simple chunks as input to the MT engines. A consensus translation is produced by combining the best chunk translations, selected through majority voting, a trigram language model score and a confidence score assigned to each MT engine. We report statistically significant relative improvements of up to 9% BLEU score in experiments (English -> Spanish) carried out on an 800-sentence test set extracted from the Penn-II Treebank.

Beyond the Wall Street Journal: Improving the performance of probabilistic parsers on non-WSJ text View slides

The goal of NLP is to design software which allows computers to perform useful tasks such as translating from one language to another, producing summaries of articles, extracting relevant information from disparate sources, understanding spoken and written requests expressed in the natural language of the user, checking the grammar in a piece of text and helping people to learn a second language. In all of these tasks, parsing, or the analysis of the syntactic structure of sentences, has a helpful role to play. A parser's ability to analyse sentential structure depends on its access to grammatical information, whether in the form of hand-crafted grammar rules, or rules induced automatically from a large collection of previously parsed sentences. The parsing task is made simpler by assuming that the sentences which will be encountered by the parser conform to the structural standards of the language in question. Given the human propensity to err and given the fact that many people are forced to produce sentences in a language other than their native one, the reasonableness of this assumption may be questioned. A parser must be able to produce accurate analyses for sentences which are deviant according to human standards, yet which are routinely interpreted correctly by humans. State-of-the-art probabilistic parsers are generally robust to errors, and they will return analyses for most ungrammatical sentences. However, these robust analyses are not necessarily correct because they do not always reflect the meanings of the ungrammatical sentences. Moreover, the analyses produced by current probabilistic parsers do not explicitly recognize that an error has occurred nor offer potential corrections, something which is vital if the parser is being used to guide grammar checking or computer-aided language learning.
In the first part of this talk, an "error-aware" approach to probabilistic parsing is proposed. An "error-aware" probabilistic parser employs an explicit model of grammatical errors so that ungrammaticality can be detected and successfully analysed. Results of some preliminary experiments which use features of an input sentence and its most popular parse tree to determine whether or not the sentence contains a grammatical error, are presented. The automatic creation of ungrammatical sentences, to be used to test/train an error-aware probabilistic parser, is also described.
The second part of this talk is concerned with the problem of parser adaptation, i.e. extending the coverage of a parser beyond the domain of its training data. Two approaches to parser adaptation, self-training and the manual extension of training data, are briefly described. Current work on the parsing of British National Corpus data using Charniak’s WSJ-trained probabilistic parser and the creation of a BNC gold standard, are described.

Extracting equivalent chunks from Chinese-English bilingual corpus View slides

Based on the up to date research on chunking and alignment, we propose a model for chunk alignment between Chinese and English. Firstly, we use a high quality Chinese-English dictionary to get some reliable links between Chinese words and English words, which is called anchor word alignment. While the recall of word alignment is relatively low, we achieve very high precision, which will be useful for chunk alignment. Then a bilingual chunking will be carried out with marker hypothesis as the theoretical background. The results will be a bunch of marker phrases headed by marker words. But we can not guarantee the Chinese marker phrases can be aligned with one or more English marker phrases. After observation, we find that to add a baseNP chunker into the marker phrases will be helpful. So we can use the borders of baseNP to further divide the marker phrases into smaller fragments. Thus more chunks of Chinese can be aligned with English chunks, vice versa. Finally we align the chunks already identified using a two step processing. The first step is to align the non-ambiguous ones using some heuristics. Then ranking based on log-linear model will be carried out for the ambiguous phrases. Experiments show that we can get many equivalent chunks using this model. But the use of these chunks to a machine translation system needs to be tested in the near future.

Dublin City University   Last update: 1st October 2010