NCLT Seminar Series 2008/2009The NCLT seminar series usually takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).
The schedule of presenters will be added below as they are confirmed. Please contact Deirdre Hogan if you have any queries about the NCLT 2008/2009 Seminar Series.
We investigate the impact of the original source language (SL) on French-English PB-SMT. We train four configurations of a state-of-the-art PB-SMT system based on French-English parallel corpora which differ in terms of the original SL, and conduct experiments in both translation directions. We see that data containing original French and English translated from French is optimal when building a system translating from French into English. Conversely, using data comprising exclusively French and English translated from several other languages is suboptimal regardless of the translation direction. Unless the quality of training data is controlled, translation performance can decrease drastically, by up to 38% relative BLEU in our experiments.
One of the problems connected with LFG-based statistical machine translation is the risk of ending up with a "broken'" f-structure: after transfer some of the non-root f-structures may be detached from the whole because of the absence of an exact match in the training data. In this scenario it is possible to guess with some certainty which f-structure is the parent and which is the detached one. What can be harder to guess is the grammatical function of the detached f-structure. Using English and German Europarl data, we show that the Naive Bayes classifier is able to guess the missing grammatical function with reasonably high accuracy (82-91%), and that the approach improves the performance of a statistical machine translation system in terms of BLEU score.
This talk describes a hybrid machine translation approach that consists in integrating bilingual chunks obtained using example-based methods into the Apertium free/open-source machine translation platform, which uses a shallow-transfer translation approach. In the integration of bilingual chunks, special care has been taken so as not to break the application of the existing Apertium structural transfer rules, since this would increase the number of ungrammatical translations. The method consists of (i) the application of a dynamic-programming algorithm to compute the best coverage of the input sentence to translate given the collection of bilingual chunks available; (ii) the translation of the input sentence as usual by Apertium; and (iii) the application of a language model to choose one of the possible translations for each of the bilingual chunks detected. The extraction and filtering of the bilingual chunks together with preliminary results for the translation from English to Spanish, and vice versa, will also be discussed.
This talk is split in 2 halves. First I give an overview of my research on classifying sentences as either grammatical or ill-formed. The diverse methods I developed together with Jennifer and Josef over the past 3 years all outperform the baseline of parsability with the English ParGram LFG grammar (without reverting to robustness features like the fragment rule). Then I will talk about the CNGL computing cluster. The simplest way of using the cluster is to allocate a machine exclusively and to run a single process. However, more experiments can be run in the same time if tasks are run in parallel and if the resource allocation (CPUs and memory) is planned more carefully.
Language learners do not only write incorrect forms, but also produce correct instances of the target language, which both provide useful information on their strengths and weaknesses at any given time. Natural Language Processing tools, such as taggers, are useful resources to process automatically learners' written texts. However, the use of a part-of-speech tagger for learner language is likely to result in tagging errors due to incorrect forms produced by learners. Since error in pos tagging will result in larger errors in the analysis of incorrect grammatical or lexical forms, it is essential to encode all components in a given text with robust and consistent tags. Following a description of the creation and annotation of a learner language corpus, this presentation will explicate how the reliability of TreeTagger was increased for use with learner language. In particular, it will show how the tagging accuracy was improved by means of (a) identifying the lemmas that are unknown for the tagger, (b) checking the part-of-speech tags automatically obtained against an extended set of common-sense rules based on recurrent tagging errors and (c) crossreferencing the part-of-speech tags with the error-encoded tags. Evaluation results will then be presented and future directions for research discussed.
Suppose that, for a given test sentence, the possible valid translation output is not unique but tend to be enormously large in numbers. All the more, suppose that literal translation is not an ideal translation, but tends to be one of the worst translations or the most disfluent expressions in reality: adequacy tend to be low score. These two are in fact the situation between Japanese-English translations. Phrase-based SMT imposes two assumptions: 1) underlying distribution of sentences from which we assume that we sample a training set is unique (incorporating smoothing for phrases and n-grams due to the finiteness of our corpus) and 2) the distributions in training parallel corpus and in test corpus are identical. Needless to say, the above two phenomena are out of scope in SMT. The consequence is that these explain the Bleu-4 scores of 15-40 since if the bag-of-words score, or Bleu-1 score, by phrase-based SMT is around 50 (In multi-class classification, the training accuracy is just 50 percents! Too bad!), then Bleu-4 score would stay around 15-40 because the bag-of-words score is an upper bound of Bleu-4 score. One approach would be to consider what is a non-literal translation in MT systems.
How should we design and gather a corpus that will meet the needs of linguists (computational and otherwise) others over the next ten years? The most-cited model for corpus design over the last two decades has been the British National Corpus. While its design was clearly excellent for its time and has served very well, it is now approaching twenty years old. It is from the pre-web world, when electronic text was in limited supply (and for many text types not available at all). We need new models for a world where electronic text is available in vast quantities, for most text types, so where corpora can be very large and very cheap to prepare. I will talk about two current projects. Both are for English. One (Big Web Corpus or BiWeC) concentrates on size, and the other (the New Model Corpus) concentrates on corpus structure, markup, and a collaborative model. Our hope is that the two strands will converge, giving a very large corpus which has many useful, large and well-specified subcorpora, which is richly marked up, which supports a wide range of research questions across the linguistics and language-technology worlds, and which is accessible as a web service, via a rich web-API, for researchers and developers to use remotely. The talk will include a demo of the Sketch Engine (a corpus query tool capable of handling multi-billion-word, richly-marked-up corpora) and also some comments on the relation between what we do, in corpus linguistics, and what Google and other commercial search engines do.
This talk explores the issue of automatically generated ungrammatical data and its use in error detection, with a focus on the task of classifying a sentence as grammatical or ungrammatical. I present an error generation tool called GenERRate and show how GenERRate can be used to improve the performance of a classifier on learner data. I also describe initial attempts to replicate Cambridge Learner Corpus errors using GenERRate.
Machine translation from English to Indian languages in general and Bengali in particular is challenging as Indian languages are generally free phrase order with the verb phrase usually at the end. Prepositions in English are translated to inflections and /or post positions in Indian languages. For English to Bengali machine translation, we have carried out experiments with a baseline phrase based model, reorder model where the input English sentences are reordered to follow the Bengali phrase order, factored model with root, POS and inflection as the factors and the Chunk factor model with chunk information as an additional factor. The Blue scores of the English - Bengali machine translation systems have improved as we have moved from the baseline phrase based mode to the Chunk factor model.
The NLP / MT Research activities in Computer Science and Engineering Department Jadavpur University, India started rather recently. The NLP subject is taught to both the undergraduate and postgraduate students in the department. The NLP/MT group is involved in three national level consortium projects: Cross Language Information Access for generating the snippets and summaries of the retrieved results and translating the generated snippets from English / Hindi to several Indian languages, English to Indian language Machine Translation for English to Bengali machine translation using TAG based system, SMT and Analysis and Generation methods and Indian language to Indian language machine translation for bidirectional machine translation systems involving Bengali and Hindi. The other areas in which research and development works are being carried out are Named Entity Recognition and Classification, NE Transliteration, Answer validation using Textual Entailment, Opinion Extraction and Summarization and Emotion Analysis among others.
The NLP / MT Research in India are quite widespread and coordinated with a number of consortium mode projects at the national level funded by the Government of India. The main players are the various IITs, IIITs, University Departments and various Government Research Laboratories. The industries are also playing a very supportive role. The various areas in which research and development works are going on are English to Indian language machine translation, Indian language to Indian language machine translation, Cross Language Information Access, Speech Processing, Optical Character Recognition and language resource creation. The Linguistic Data Consortium for Indian languages has been formed to carry on the development of linguistics resources for Indian languages.
Most current work in statistical Machine Translation uses log-likelihood based mechanisms to score translation decisions made. This includes the syntax-based Data-Oriented Translation system at DCU. However, log-likelihood scoring is not necessarily correlated with the quality of a translation decision. In this work we present a novel technique to rescore fragments in the Data-Oriented Translation model based on their contribution to translation accuracy. We describe three new rescoring methods, and present the initial results of a pilot experiment on a small subset of the Europarl corpus. This work is a proof-of-concept, and is the first step in directly optimizing translation decision scores solely on the hypothesized accuracy of potential translations the decisions derive.
Phrase-based Statistical Machine Translation (PB-SMT) models - the most widely researched paradigm in MT today - rely heavily on the quality of phrase pairs induced from large amounts of training data. There are numerous methods for extracting these phrase translations from parallel corpora. In this talk I will describe phrase pairs induced from percolated dependencies and contrast them with three pre-exsiting phrase extractions. I will also present the performance of the individual phrase tables and their combinations in a PB-SMT system. I will then conclude with ongoing experiments and future research directions.
Visiwords or visiterms was an approach originally pioneered by Barnard, Duygulu, Forsyth and others for the problem of automatically labeling (or indexing) unknown still images with words (or really terms) for which a previously labeled training set of limited size and quality exists. It has since proved highly effective on a range of multi-media labeling tasks, leading to over 500 citations (according to Google Scholar) of the best known paper on the topic. The technique has the potential to be used for other labeling task, including labeling out of vocabulary words in machine translation. The seminar will outline the approach and go on to suggest ways it could be used in MT.
In standard SMT systems, the phrases present in the phrase table are obtained by extending a word alignment over the parallel corpus and then extracting all phrases compatible with this word alignment. There are other methods by which these phrases could be obtained. Since significant improvements have been shown to be obtainable by adding resources extracted with different techniques, we are going to investigate the effects of allowing these phrases to play a privileged role in the decoding process. We investigate different methods of merging these resources with the ones obtained by an SMT system and evaluate the impact in translation quality and the percentage of phrases from each knowledge source that are used to obtain the final translations.
In this talk, I will describe 1) the MaTrEx system we used in WMT09 shared tasks; 2) some new research progress on system combination. For the current MaTrEx system, we adopted a combination-based multi-engine framework which includes three individual systems and demonstrates good performance. Regarding the system combination, we employed a three-tier strategy which consists of a Minimum Bayes-Risk decoder, confusion network decoder and a re-ranking module. Besides the shared translation task, we also participated in the system combination task , and performed very well. With respect to the progress on system combination, I proposed a new hypothesis alignment algorithm which can combine the different individual alignment metrics and output a refined re-ordering model. Some preliminary experimental results will be shown in my talk. However, the new algorithm needs to be investigated deeply in future.
A prerequisite for MT systems is text-based data. This simple fact is a complex obstacle when dealing with signed languages given their visual-gestural nature and the fact that there is no formal writing system for sign languages (SLs). While many representation formats do exist, there is no one form recognised and accepted by SL linguists and members of the Deaf communities. In this talk I will outline the various representation methodologies available for signed languages and discuss their pros and cons in the context of MT processing. In order to provide a worked example, I will provide highlights of my PhD thesis work on data-driven MT of SLs using hand-crafted gloss representations.
This talk reports joint work with Marie Candito, Pascal Denis and Djame Seddah on the supervised parsing of French. I will describe and motivate the overall parsing system we have developed in Paris these last two years that both provides constituent and some restricted dependency output. The talk will first report some experiments on adapting generative parsers first designed for English to French. We will report results on two sources of data : the French Treebank (FTB) and the Modified French Treebank (MFT) with 4 different parsers, showing that a parser from the unlexicalised paradigm is better for modelling both constituency and dependencies of French whatever the treebank. The second part of the talk will focus on several discriminative models we used for typing dependencies, this time taking advantage of bilexical parameters. I will show that our somewhat crude parsing setup already achieves satisfying results on French by comparison with both statistical and symbolic parsers known to us on different tests sets. I will finally raise some of our current questions on modelling issues for Parsing French, discussing on several ways of combining generative and discriminative models for parsing.
Online product reviews are becoming increasingly available, and are being used more and more frequently by consumers in order to choose among competing products. Tools that rank competing products in terms of the satisfaction of consumers that have purchased the product before, are thus also becoming popular. We tackle the problem of rating (i.e., attributing a numerical score of satisfaction to) consumer reviews based on their textual content. We here focus on multi-facet review rating, i.e., on the case in which the review of a product (e.g., a hotel) must be rated several times, according to several aspects of the product (for a hotel: cleanliness, dining facilities, centrality of location, etc.). We explore several aspects of the problem, with special emphasis on how to generate vectorial representations of the text by means of POS tagging, sentiment analysis, and feature selection for ordinal regression learning. We present the results of experiments conducted on a dataset of more than 15,000 reviews that we have crawled from a popular hotel review site.
Minimum Error Rate Training (MERT) is the the most commonly used method for parameter tuning in Statistical Machine Translation. In this talk we try to analyze and compare three different open source implementations of MERT - the old and new MERTs in Moses, and ZMert in Joshua. The analysis and comparison will mainly focus on the software engineering side, including their code structure, extendability and interoperability.
Giza++, an implementation of IBM model 4, has been a dominant toolkit for word alignment and for training various types of SMT system due to its high performance in alignment quality. However, IBM model 4 doesn't allow efficient parameter estimation procedures, and cannot provide posterior statistics that can be useful in phrase extraction. Consequently, word alignment and phrase extraction has to be carried out in pipeline. Encouragingly, in the recent two or three years, we have been able to see the trend that Giza++ can be replaced by simple and efficient models. In this talk, we describe two such refined HMM models that can consistently achieve competitive performance when evaluated using different language pairs and varied data sizes. More interestingly, these efficient models offer an opportunity to perform word alignment and phrase extraction simultaneously and the results are shown to be promising.
This talk will describe work I will present at CICLing 2009. Given the recent shift in focus of the field of Machine Translation (MT), it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in MT. Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. In this talk, I will describe how we exploit large automatically built parallel treebank to extract a set of linguistically motivated phrase pairs. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PB-SMT) system leads to significant improvements in translation quality. I will then describe further experiments on incorporating parallel treebank informationwithin the PB-SMT framework, such as word alignments. Finally, I will discuss the potential of parallel treebanks in other paradigms of MT.
In recent years, the area of sentiment analysis in text has become a focus of attention in the fields of theoretical and computational linguistics, investigating the production and processing of affective contours in text, the textual corollary of emotional prosody in speech. Research has drawn on text from many domains ranging from on-line film reviews to newspaper editorials to Dow Jones News Service headlines and much has focused on supposedly unequivocal, cross-domain markers of affect polarity in text, such as the terms \u201dgood\u201d and \u201dbad\u201d. This talk takes a step back from the applications of classifying text according to emotional criteria and aims to look at what do we mean by emotion, how is emotion represented in available lexical resources and how is lexicalised emotion distributed in English in general and in sub- or special languages of English. This discussion does not solve the problem of how to identify sentiment in text but it goes some way to establishing what we might be looking for and whether we might expect to find it.
Many tasks in NLP are mappings to output spaces representing complex structures. Yet, machine learning methods cannot learn mappings to structures with the level of complexity found in NLP (e.g. complete dependency graphs, or full translations). The consequence is that the larger tasks are partially solved at more local levels by machine learning classifiers, and in a second stage at the global level by a search or inference method that finds the most likely output structure. In this presentation I present constraint satisfaction inference, a theory-neutral inference method that accepts heterogeneous partial solutions to a structured prediction problem, using weighted constraint satisfaction as a means to quantify the success of global solutions. The approach is exemplified on two NLP tasks: dependency parsing and machine translation, using memory-based learning for the local classifiers. This presentation is based on collaborative work with Sander Canisius.
Language is not only used to transfer information about facts but also about beliefs, judgements and evaluations. Words, phrases and syntactic constructions expressing these are called subjective elements (Wiebe et al. 2004). In text classification for example we can use these subjective elements in order to classify sentences or texts like reviews or blogs according to their overall sentiment. Most of the work previously done in this field has focused on simple bag-of-word approaches (Pang,Lee&Vaithyanathan 2002), but attempts have been made to include structurally more complex features like for example parse trees (Matsumoto et al. 2005).
We are interested in two research questions: What kind of parse trees are useful in practice for this particular task? What is the best way to map the rich structure of a parse tree into a vector of feature/value pairs? This talk will give a brief introduction to subjectivity, polarity classification and the exploitation of syntactic information in this task, with a focus on tree kernels (Collins & Duffy 2002, Vishwanathan & Smola 2002).
Please feel free to give comments and ask questions during the talk! Since this is work in progress we're hoping for a lively discussion.
In this talk I will describe an open source tool for automatic induction of transfer rules. Transfer rule induction is carried out on pairs of dependency structures and their node alignment to produce all rules consistent with the node alignment. I will describe our definition for a consistent transfer rule, as well as an efficient algorithm for rule induction, and give details of how the tool can be used to train a Transfer-Based SMT system.
Despite a wealth of literature on statistical translation, many tradeoffs in the design of large-scale systems are not well understood. I introduce new theoretical and empirical techniques to identify the common elements and isolate the differences of competing systems, and assess the performance of individual components. First, I present a theoretical framework for search space analysis based on semiring parsing, using it to derive some surprising conclusions about phrase-based models and simplify the construction of new models. Next, I describe an empirical study on induction errors, which occur when good translations are absent from model search spaces. The results show that a common pruning heuristic drastically increases induction error, and prove that the high-probability regions of phrase-based and hierarchical model search spaces are nearly identical. Finally, I will outline new efforts to capitalize on these discoveries. This talk represents joint work with Michael Auli, Hieu Hoang, and Philipp Koehn.
Accurate text classification systems are successfully constructed using supervised learning and a large collection of labelled examples. There are, however, many real world domains where the task of labelling data is difficult or expensive. In such situations, labelled examples are scarce and supervised learning cannot be successfully applied.
This talk will present active learning methods that are capable of producing high accuracy classifiers from small amounts of training data. I will describe history-based active learning and adaptive pre- filtering, both of which utilise historical information to help select and label only the most informative examples to form the training data. In so doing, significant reductions in the number of labelled examples required to induce an accurate classifier can be achieved. Reducing the effort in constructing an accurate classifier allows for machine learning solutions to be applied to many more tasks and domains.
General Architecture for Text Engineering, is framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion. The GATE architecture enables the user not only to develop a number of successful applications for various language processing tasks (such as Information Extraction), but also to build and annotate corpora and carry out evaluations on the applications generated. In addition, GATE integrates existing mature open source technologies for Information Retrieval, Machine Learning, Ontology Support and Parsing amongst others into its framework. Finally, GATE can be used to develop applications and resources in multiple languages, based on its thorough Unicode support.
Up to date, data-driven MT has mainly considered and explored static data. The linear combination of weighted feature functions has become the principal methodology for the integration of heterogeneous feature systems. Despite this, it is still unclear how dynamically generated user data can be obtained, represented and interpreted in such a hybrid setting. The talk reviews past developments in data-driven MT and process-oriented translation research and proposes a methodology to harvest and investigate activity data of readers, translators and posteditors. We aim at elaborating a fine-grained user model which allows for a suitable interpretation and exploitation of the dynamic user data.
A large scale grammar is not only one of the primary resources for many NLP applications but also necessary to understand, to define and to represent the linguistic phenomena of the language in question in more formal ways. We aim at building a large scale grammar for Turkish in Lexical Functional Grammar (LFG) formalism, highly paying attention to computational aspects, such as the percentage of successful parses, without leaving aside interesting linguistic problems to be solved, such as the representation of valency changes in verbs, i.e., causatives and passives. The grammar is being implemented by using segments of complex words as parsing units in order to incorporate in a manageable way, the complex morphology and the syntactic relations mediated by morphological units, and to handle lexical representations of very productive derivations. This talk focuses on the major steps of the work accomplished, highlighting the approach used and its consequences.
A long-standing goal of computational linguistics is to build a system for answering natural language queries. An ideal QA system is able to determine whether the answer to a particular question can be inferred from another piece of text. For example, the system should recognize that the answer to a question such as Did Hillary win the nomination? is given by a sentence such as Hillary failed to win the nomination. None of the current search engines is capable of delivering a simple NO answer in such cases. But change is coming. Much progress has been made in computing textual inferences in recent years, much of it inspired by work presented at the Pascal RTE (Reconizing Textual Entailment) workshops.
Local textual inference is in many respects a good test bed for computational semantics. It is task oriented. It abstracts away from particular meaning representations and inference procedures. It allows for systems that make purely linguistic inferences but it also allows for systems that bring in world knowledge and statistical reasoning. Because shallow statistical approaches have plateaued out, there is a clear need for deeper processing. Success in this domain might even pay off in real money in addition to academic laurels because it will enable search engines to evolve beyond keyword queries.
The system I will describe in this talk is the Bridge system (a bridge from language to logic) developed at the Palo Alto Research Center by the Natural Language Theory and Technology group. I will first give a brief overview of the system and then focus on the way textual inferences are computed.
The inference algorithm operates on two AKRs (Abstract Knowledge Representation), one for a passage, the other for the question. It aligns the terms in the two representations, computes specificity relations between the aligned terms, and removes query facts that are entailed by the passage facts. If all the query facts are eliminated, the system will respond YES. If a conflict is detected, the system will respond NO. If some query facts remain at the end, the response is UNKNOWN. In some rare cases (John didn't wait to speak. Did John speak?) the response will be AMBIGUOUS indicating that the on one reading the answer is YES and on another reading NO.
The linguistic phenomena illustrated include hyponomy (hop => move), converse relations (win vs. lose, buy vs. sell), lexical entailments (kill => die), relations between simple predicates and their embedded complements (forget that S => S, forget to S => not S), and similar relations involving phrasal constructions (take the trouble to S => S, waste an opportunity to S => not S).
For a practical QA system logical entailment and presupposition are important notions but they are not sufficient to characterize all the inferences that a human reader will make. The Pascal RTE data contains many examples where a there is no logical entailment between sentences but the annotators have indicated otherwise. In our ordinary converstation we make such ''errors" all the time. John is very happy that he had a chance to read your paper does not actually entail John read your paper but in the absence of any contrary evidence, the hearer would certaincly conclude that John had read the paper. Characterizing such "invited inferences" is an interesting challenge for semantic and pragmatic theory and essential for practical applications.
With the abundance of semi-structured textual data on the Internet in recent years, there is an ever-increasing demand to model public sentiment towards given topics. This talk covers DCU's system for mining sentiment based on extraction of syntactic, lexicon and surface features in the context of information retrieval. Our system is evaluated as part of the Text REtrieval Conference Blog Track, where opinion finding and polarised opinion finding tasks are set and evaluated for the community.
I will introduce our work on incorporating Lexical Syntax into Phrase-based SMT systems using two approaches. The first is based on n-gram language modelling, while the second is based on incremental dependency language modelling. I will also briefly report on a recent trend of modelling SMT as a direct translation model.
Word Sense Disambiguation (WSD) constitutes an intermediate task in natural language applications and serves to ameliorate their performance. However, different applications have varying disambiguation needs which should have an impact on the choice of the method and of the sense inventory used. The problems posed by the exploitation of predefined semantic resources and the inappropriateness of application-independent WSD methods in some contexts have fostered the development of unsupervised, often application-oriented, sense induction and WSD methods.
We present here such a data-driven method of sense induction that operates in a bilingual context. The senses of an ambiguous source language word are identified by combining distributional and translation information coming from a parallel training corpus. This information serves to the clustering of the translation equivalents of the ambiguous word according to their semantic similarity. The created clusters are then projected on the source word and serve to determine its senses, to distinguish them according to their status and to identify their relations. The proposed method is unsupervised and fully data-driven; it is thus language-independent and enables the elaboration of sense inventories relevant to the domains represented in the corpus.
The inventory built in this way is exploited by a WSD method, in order to assign a sense to new instances of ambiguous words in context, and by a lexical selection method for suggesting their most adequate translation. We show, as well, how this sense inventory allows for a semantics-sensitive evaluation of the disambiguation and lexical selection results.View slides
Idiom processing in Machine Translation (MT) is a difficult task, as simply the storage of the idioms to the dictionary does not suffice for accurate translations. We make experiments within the German-to-English EBMT system METIS-II which is an innovative approach, as it makes use of a TL corpus instead of parallel corpora. Within METIS-II we focus on pattern matching/identification of idioms both with and without gaps. Firstly, we discuss the semantic and syntactic properties of idioms and secondly, we examine idioms' translation equivalence on the basis of their aforementioned properties. Focusing on idioms' syntax, various syntactic patterns are furnished and processed by means of the German topological field model. The evaluation, reaching more than 80% precision and recall, shows that the idiom processing in MT is not an unsolvable problem.View slides
|Last update: 1st October 2010|