NCLT Seminar Series 2009/2010The NCLT seminar series usually takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).
The schedule of presenters will be added below as they are confirmed. Please contact Mohammed Attia if you have any queries about the NCLT 2009/2010 Seminar Series.
Some years ago, a number of papers reported an experimental implementation of an Example Based Machine Translation (EBMT) system using Proportional Analogy. This approach, a type of analogical learning, was attractive because of its simplicity; and the paper reported considerable success with the method. In this paper, we describe our attempt to use this approach for tackling English–Hindi Named Entity (NE) Transliteration and English - Chinese Translation, as a case study relevant to MT and especially since many statistical MT systems have difficulty handling unknown NEs. We have implemented an EBMT system using proportional analogy. We have found that the analogy-based system on its own has low precision but a high recall due to the fact that a large number of names are untransliterated with the approach. However, mitigating problems in analogy-based EBMT with SMT and vice-versa have shown considerable improvement over the individual approach.
The current state-of-the-art approach to Machine Translation scores translations using probabilities not necessarily related to the expected quality of its output translations. In addition, it has limitations which could be alleviated by the use of syntax-based models such as the Data-Oriented Translation (DOT) model. However, until recently, DOT has suffered from lack of training resources, and as a consequence it is currently immature and underperforms when compared to a Phrase-Based Statistical Machine Translation (PB-SMT) system. In this work we introduce a training mechanism which takes translation quality into account and which improves the translation quality of Machine Translation systems in general, and of PB-SMT and DOT in particular. We also propose to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems by reformulating DOT as a conditional model, which will allow additional features such as language model probabilities to be incorporated into the scoring model.
Source-Side Contextual Modelling in state-of-the-art Statistical Models of Machine TranslationRejwanul Haque, CNGL, DCU
The Phrase-Based Statistical Machine Translation (PB-SMT) model has recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context in PB-SMT. In this work, we introduce lexical syntactic descriptions in the form of supertags as source-side context features. We also show that position-independent syntactic dependency relations can be modeled as useful source context to improve lexical selection. These features enable us to exploit source-similarity in addition to target-similarity, as modelled by the language model. A series of experiments have been carried out by employing source-context features in the state-of-the-art PB-SMT model on different language pairs with varying training data sizes. However, automatic evaluation has revealed that improvements are not linear with respect to training data sizes.
Parsing techniques have recently become efficient enough for parsers to be used as part of a pipeline in a variety of tasks. Another recent development is the rise of user-generated content in the form of blogs, wikis and discussion forums. Thus, it is both interesting and necessary to investigate the performance of NLP tools trained on edited text when applied to unedited Web 2.0 text. McClosky et al. (2006) report a Parseval f-score decrease of 5% when a WSJ-trained parser is applied to Brown corpus sentences. In this talk I describe what happens when we move even further from the WSJ by investigating the performance of the Berkeley parser (Petrov et al., 2006) on user-generated content. Gold standard phrase structure trees for the posts on two threads of the BBC Sport 606 Football discussion forum forum are created. The sentences in one thread, the development set, are then parsed with the Berkeley parser under three conditions: 1) when it performs its own tokenisation, 2) when it is provided with gold tokens and 3) when misspellings in the input have been corrected. A qualitative evaluation is then carried out on parser output under the third condition. Based on this evaluation, some “low-hanging fruit” are identified and an attempt it is made to handle these either by transforming the input sentence or by transforming the WSJ training material. The success of these transformations is evaluated on the development and test sets, with encouraging results.
Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and FrenchLamia Tounsi, NCLT, DCU
In this talk we explore the problem of rare and unknown words in parsing for three languages that exhibit different levels of morphological expressiveness: Arabic, French and English. We first compare the extent of the unknown word problem across our three datasets. We present parsing results for our three languages when unknown words are handled by a very simple technique in which rare words in the training data are mapped to one UNKNOWN token and no morphological information is exploited. We then compare this to a more language-aware approach in which morphological clues or signatures are used to assign POS tags to unknown and rare words. For English and French, we use existing signature lists but for Arabic we define our own, We integrate information about Arabic affixes and morphotactics into our PCFG-LA parser and obtain state-of-the-art accuracy. Finally, we show how these morphological clues can be learnt automatically from a POS-tagged corpus.
This talk will give an introduction to the principles, methods and problems of the information retrieval of spoken documents and generally the interaction between IR and Speech Recognition systems.
The Impact of Source-Side Syntactic Reordering on Hierarchical Phrase-based SMTJinhua Du, CNGL, DCU
Syntactic reordering has been demonstrated to be helpful and effective for handling different word orders between source and target languages in SMT. However, in terms of hierarchial PB-SMT (HPB), does the syntactic reordering still has a significant impact on its performance? This paper introduces a reordering approach which explores the DE grammatical structure in Chinese. We employ the Stanford DE classifier to recognise the DE structures in both training and test sentences of Chinese, and then perform word reordering to make the Chinese sentences better match the word order of English. The annotated and reordered training data and test data are applied to a re-implemented HPB system and the impact of the DE construction is examined. The experiments are conducted on the NIST 2008 evaluation data and experimental results show that the BLEU and METEOR scores are significantly improved by 1.83/8.91 and 1.17/2.73 absolute/ relative points respectively.
English and Arabic have different syntax and morphology which poses many challenges for statistical machine translation (SMT) between Arabic and English. In this talk, we will try to address one of those challenges which is the morphological richness of Arabic in comparison with English. Arabic rich morphology increases the number of different word forms in the Arabic corpus and this causes data sparsity problem for SMT systems. In this talk, I will present an approach to alleviate data sparsity problem by the separation some of Arabic prefixes (proclitics) from Arabic words for Arabic to English and English to Arabic Phrase-Based and Hierarchical Phrase-Based SMT. Then I will present the experimental results of this approach and discuss its effect on translation quality.
With the aim of facilitating the automatic acquisition of didactic resources NLP techniques can be used. The detection of learning material in pedagogical books and the generation of new didactic material is a matter of research in NLP and ICALL areas. Specifically, the automatic generation of questions about domain contents or language learning are acquiring special importance in the last years. In this seminar I will give some general ideas about the automatic generation of Multiple-Choice Questions (MCQ) and I will focus on the challenge of generating incorrect options in a MCQ, i.e. distractors. I will also present some results of an evaluation carried out with learners to measure the quality of the automatically generated distractors.
In 1999, Kevin Knight wrote a tutorial "A Statistical MT Tutorial Workbook". Ten years later (last year), he wrote another one, i.e. "Bayesian Inference with Tears", where he explained fundamental terminology and core techniques involved in Bayesian inference using real-world applications with minimal mathematical derivations. In this talk, the speaker will give a brief introduction of this tutorial and share his reflection on this type of methods.
Probabilistic Context Free Grammars with Latent Annotations (PCFG-LAs) are an extension of PCFGs where grammar symbols are augmented with annotations that describe specific ditributional properties. This formalism is used in state-of-the-art phrase structure parsers like the Berkeley Parser and the LORG Parser developed at DCU. In this talk, I will present the inference methods and the specific parsing algorithms used with PCFG-LAs. Then, I will introduce the issues raised by unknown words in phrase structure grammars and how they are dealt with in the LORG Parser.
Statistical machine translation relies heavily on parallel corpora to train its models for translation tasks. While more and more bilingual corpora are readily available, the quality of the sentence pairs should be taken into consideration. Some work has already been done on the decoding-based data cleaning, which implies decoding procedure to be performed on each sentence pair in training corpus. However, it is not efficient enough for large corpora. By taking account into the word alignments extracted in the training phase, we present a novel lattice score-based data cleaning method. We use word alignments to create anchor pairs and source-side lattices, and then we expand target-side phrase networks to search for approximated decoding results, which are finally fed into BLEU score threshold-filter for the data cleaning purpose. In this seminar, we will give details of this method on data-cleaning, and also propose some other potential usages of this method for discussion.
A Lexical-Functional Grammar (LFG) f-structure is a labelled directed acyclic graph (DAG) providing an abstract dependency representation for a sentence. F-structures contain grammatical functions (labelled bilexical dependencies corresponding to the arc of the DAG, such as OBJ or SUBJ), and grammatical features (atom-valued attributes) which describe properties (sub-parts) of the f-structure such as tense, number, or adjunct type. Automatic identification of correspondences between the grammatical features used in LFG grammars for different languages is of some practical interest both for machine translation and grammar design. We present a method for automatically recognising correspondences between atomic features of LFG grammars for different languages.
We describe the design and implementation of large-scale data processing techniques for the automatic acquisition of lexical resources for Modern Standard Arabic (MSA) from annotated and un-annotated corpora and demonstrate their usefulness for creating a wide-coverage, general-domain lexicon. Modern lexicographic principles (Atkins and Rundell, 2008) emphasise that the corpus is the only viable evidence that a lexical entry still exists in a speech community. Unlike most available Arabic lexicons which are based on previous historical (and dated) dictionaries, our lexicon starts off with corpora of contemporary texts. The lexicon is encoded in Lexical Markup Framework (LMF) which is a metamodel that provides a standardized framework for the construction of electronic lexical resources. The aim of Automatic Lexical Resource Acquisition is Constructing a Lexicon of Modern Standard Arabic that is Compatible with the LMF specifications.
Transfer-Based SMT is composed of three parts, (i) parsing to deep syntactic structure, (ii) transfer from source language (SL) deep structure to target language (TL) deep structure and (iii) generation of TL sentence. In this talk, I describe a Transfer-Based SMT system that's trained on parsed bilingual corpora. For training, similar to Phrase-Based SMT we extract phrasal correspondences by firstly establishing a word alignment between pairs of sentences before extracting all phrases consistent with this alignment. In our case, the word alignment is carried out between nodes in dependency structures and phrases are in the form of pairs of dependency snippets with variables allowed at leaf level. The system includes a statistical beam-search decoder that uses a log-linear model to combine feature scores for ranking hypothesis TL structures. In the talk, I will describe experiment results for the system trained on Europarl and Newswire text and give some example translations.
The Phrase-Based Statistical Machine Translation (PB-SMT) model has recently begun to include source context modeling, under the assumption that the proper lexical choice of an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features such as words, parts-of-speech, and supertags have been explored as effective source context in SMT. In this paper, we show that position-independent syntactic dependency relations of the head of a source phrase can be modeled as useful source context to improve target phrase selection and thereby improve overall performance of PB-SMT. On a Dutch\u2014English translation task, by combining dependency relations and syntactic contextual features (part-of-speech), we achieved a 1.0 BLEU (Papineni et al., 2002) point improvement (3.1% relative) over the baseline.
The strict character of most of the existing Machine Translation (MT) evaluation metrics does not permit them to capture lexical variation in translation. However, a central issue in MT evaluation is the high correlation that the metrics should have with human judgments of translation quality. In order to achieve a higher correlation, the identiﬁcation of sense correspondences between the compared translations becomes really important. Given that most metrics are looking for exact correspondences, the evaluation results are often misleading concerning translation quality. Apart from that, existing metrics do not permit one tomake a conclusive estimation of the impact of Word Sense Disambiguation techniques into MT systems. In this paper, we show how information acquired by an unsupervised semantic analysis method can be used to render MT evaluation more sensitive to lexical semantics. The sense inventories built by this data-driven method are incorporated into METEOR: they replace WordNet for evaluation in English and render METEOR\u2019s synonymy module operable in French. The evaluation results demonstrate that the use of these inventories gives rise to an increase in the number of matches and the correlation with human judgments of translation quality, compared to precision-based metrics.
So far, many effective hypothesis alignment metrics have been proposed and applied to the system combination, such as TER, HMM, ITER and IHMM. In addition, the Minimum Bayes-risk (MBR) decoding and the confusion network (CN) have become the state-of-the-art techniques in system combination. In my presentation, I will talk about a three-pass system combination strategy that can combine hypothesis alignment results derived from different alignment metrics to generate a better translation. Firstly the different alignment metrics are carried out to align the backbone and hypotheses, and the individual CN is built corresponding to each alignment results; then we construct a super network by merging the multiple metricbased CN and generate a consensus output. Finally a modified consensus network MBR (ConMBR) approach is employed to search a best translation. Our proposed strategy outperforms the best single CN as well as the best single system in our experiments on NIST Chinese-to-English test set.
The presentation introduces MultiNet, a paradigm for meaning representation based on semantic networks, its tools and resources (e.g. syntacto-semantic parser computational lexicon). Four applications based on MultiNet are discussed: the NLI-Z39.50, a natural language interface for information providers on the internet targetting bibliographic databases; IRSAW, an open-domain question answering system integrating multiple streams of candidate answers produced by methods ranging from deep semantic analysis to pattern matching; DeLite, a readability checking tool for web pages, which computes a global readability score and identifies text passages which are difficult to read; and GIRSA-WP, a geographic information retrieval system for Wikipedia articles, combining approaches from question answering and information retrieval. All of these applications illustrate that complex natural language processing benefits from semantic processing.
Low-Resource Machine Translation Using MaTrEx: The DCU Machine Translation System for IWSLT 2009 Tsuyoshi Okita, CNGL, DCU.
We give a description of our MT system for IWSLT 2009. Two techniques are deployed in order to improve the translation quality in a low-resource scenario. The first technique is to use multiple segmentations in MT training and to utilise word lattices in decoding stage. The second technique is used to select the optimal training data that can be used to build MT systems. In this year's participation, we use three different prototype SMT systems, and the output from each system are combined using standard system combination method. Our system is the top system for Chinese-English CHALLENGE task in terms of Bleu score.
Extending the DCU-250 Arabic Dependency bank Gold Standard Hanna Bechara, NCLT, DCU
DCU has developed and used an Arabic LFG gold standard based (El-Raheb et al., 2006) on the Penn Arabic Treebank (Maamouri and Bies, 2004) for evaluation. This resource consists of 250 annotated sentences. In order to cover a larger set of grammatical phenomena and provide a more comprehensive reference, we decided to extend the existing dependency bank from 250 to 500 sentences. The process consists of 3 steps as follows: i) the random selection of 250 new sentences from the Penn Arabic Treebank, ii) the application of the Arabic Annotation Algorithm to automatically annotate the new 250 trees with abstract LFG functional information (Tounsi et al., 2009) and iii) the combination of old and new sets for a full evaluation. A number of inconsistencies have been identified and handled both automatically and manually. This investigation involves determining the case for nouns for ambiguous morphological instances, dealing with improperly marked traces, and extending the annotation scheme to cover specific cases for appositions, adjectives types and adverbs. We conducted a qualitative evaluation of the annotation, measuring inter-annotator agreement (S, Pi and Kappa) for 50 sentences and achieved a score of 0.98.
Template based EBMT Sudip Naskar, CNGL-DCU
Example-based machine translation (EBMT) is essentially translation by analogy. It takes a stance somewhere between RBMT and SMT. EBMT systems differ widely in the way they store the examples. In this talk I will give an overview of EBMT and the core tasks involved: matching, alignment and recombination. I will briefly discuss about different approaches to EBMT: run-time approach to EBMT, template-based EBMT, and tree-based EBMT. Then I will talk about generalized templates and Template based EBMT in detail.
In this talk I will tackle the knowledge acquisition bottleneck problem in the field of Computational Linguistics. This issue will be introduced and studied from a general point of view, with the aim of identifying key elements that could lead us a step forward. In this respect, I will argue in favour of highlighting Language Resources (LRs), Web 2.0 sources and representation standards. Subsequently, I will move to the specific by applying these guidelines to a case of study: the acquisition of Named Entities (NEs). I will present an automatic procedure to build a multilingual lexicon of NEs and to connect it to other LRs and ontologies. The different phases of this methodology and the techniques involved (e.g. text similarity) will be evaluated and, furthermore, I'll show the utility of the knowledge gathered by applying it to a real-world Question Answering scenario.
Transfer-Based SMT is composed of three parts, i) parsing to deep linguistic structure, ii) transfer from source language (SL) linguistic structure to target language (TL) linguistic structure and iii) generation of TL sentence. Each of the three steps uses a statistical model to select the best or n-best output. In this talk, I describe a Transfer-Based SMT system that uses the LFG Fstructure as the intermediate representation for transfer and is trained fully automatically on LFG-parsed bilingual corpora. For training, similar to Phrase-Based SMT we extract phrasal correspondences by firstly establishing a word alignment between pairs of sentences before extracting all phrases consistent with this alignment. In our case, the word alignment is between nodes in dependency structures as opposed to surface form sentences. In addition, the structure of phrases extracted are pairs of dependency snippets with variables allowed at leaf level to map missing arguments to the correct position in the TL. The system includes a statistical beam-search decoder that uses a log-linear model to combine feature scores for ranking hypothesis TL structures. In the talk, I will present preliminary experiment results for the system trained on Europarl and Newswire text for a restricted sentence length 5-15 words and tested on held-out data.
DCU employs systems that can automatically annotate Penn-II style trees and generate deep syntactic analyses based on Lexical Functional Grammar. This talk starts with a very brief overview of the existing automatic annotation algorithms developed at DCU. The remaining of the talk focuses on the English Annotation Algorithm, in particular, enriching and restructuring its output so that the resulting syntactic analyses also contain deep semantic representations.
We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on a set of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. Further, we study the possibility of combining lexical association measures and present empirical results of several combination methods that significantly improved the state-of-the art in this task. We also propose a model reduction algorithm significantly reducing the number of combined measures without a statistically significant difference in performance.
The long-term goal of our work is to develop a system which detects errors in grammar and usage so that appropriate feedback can be given to non-native English writers, a large and growing segment of the world's population. Estimates are that in China alone as many as 300 million people are currently studying English as a second language (ESL). In particular, usage errors involving prepositions are among the most common types seen in the writing of non-native English speakers. For example, Izumi et al., (2003) reported error rates for English prepositions that were as high as 10% in a Japanese learner corpus. Since prepositions are such a nettlesome problem ESL writers, developing an NLP application that can reliably detect these types of errors will provide an invaluable learning resource to ESL students. To address this problem, we use a maximum entropy classifier combined with rule-based filters to detect preposition errors in a corpus of student essays with a precision of 84%. In this talk, I will discuss the system as well as issues in developing and evaluating NLP grammatical error detection applications.
|Last update: 1st October 2010|