NCLT Seminar Series 2005/2006The (final!) NCLT Seminar will be presented by Yanjun Ma on Wednesday, 23rd August, at 4pm in Room L2.21
Thank you to everyone who participated in this year's seminar series!
The schedule of presenters for the 2005/2006 series is as follows:
Feedback is often given as one of the benefits of Computer Assisted Language Learning (CALL). The idea that learners can receive immediate feedback on their errors in a non-threatening environment as often as they like seems to make good pedagogical sense. However, the fact is that learners generally do not avail of feedback and it is an under-utilised component of CALL. Although there are many reasons why this is the case, one factor to consider is the quality of the feedback provided. On the surface it might seem easy to provide feedback, but the reality is quite different. In order to provide good quality, appropriate and personalised feedback, one must draw on the disciplines of Computation Linguistics (CL), Second Language Acquisition (SLA), Language Pedagogy, Artificial Intelligence (AI), Software Engineering (SE) and Human Computer Interaction (HCI). This talk reviews feedback in CALL and looks at how these disciplines can help improve the quality of the feedback provided.
Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such resources beyond toy fragments to unrestricted, real text is knowledge-intensive, time-consuming and expensive. We have developed a method for treebank-based, wide-coverage, deep, constraint-based grammar acquisition. The resulting PCFG-based LFG approximations parse the Penn-II treebank with wider coverage (measured in terms of complete spanning parse) and parsing results comparable to or better than those achieved by the best hand-crafted grammars, with, we believe, considerably less grammar development effort. In this talk, I will briefly describe our approach and outline some recent experiments to compare our automatically induced English parsing resources against the hand-crafted XLE LFG parser and RASP dependency parser.
In this show & tell presentation I want to give you an overview of the existing tools and resources I have reused in my Ph.D. work and of the tools and resources I have developed for plurilingual learning of French, Italian and Spanish. Not all the tools and resources are restricted to French, Italian and Spanish, so my presentation may also be interesting for those of you who are working on other languages.
While experimental results show that the Tree-DOP and Tree-DOT models achieve excellent parse and translation accuracy, the power of these models is nevertheless limited by the corpus representations they assume. It is known that these representations, which reflect surface syntactic phenomena only, do not adequately describe many aspects of human language. The Lexical-Functional Grammar (LFG) formalism, on the other hand, is known to be beyond context-free. There is clear motivation for the use of parsing and translation models which employ this type of linguistic analysis, hence the LFG-DOP and LFG-DOT models .
Parse-annotated corpora (Treebanks) are crucial in the development of machine-learning and statistics-based parsing systems. Such systems induce the grammatical information used for parsing from treebanks like the Penn-II Treebank.
Discriminative techniques have been successfully applied to flat classification tasks such as text categorization or word-sense disambiguation. It has been recently shown that very good results can also be obtained by using such techniques for solving more structured NLP tasks such as parsing and machine-translation: discriminative techniques now compete with or exceed generative techniques.
In this presentation, I will introduce some linguistic basis behind the LTAG formalism: how the need to lexicalize CFG results in TAG Formalism and why LTAG are well suited for the treatment of long distance dependencies. I will then introduce the basis of a compositional semantic model based on TAG grammars and I will try to explain why it is necessary to deal with both constituency structures (so called derived tree in TAG) and derivation structure (resp. Derivation tree) to obtain proper predicate argument structure.
This talk investigates the semantic/pragmatic interface in the treatment of presupposition and assertion in Discourse Representation Theory (DRT). It is argued that as a formal semantic theory, DRT, needs pragmatic enrichment in its treatment of presupposition. The pragmatic enhancement is achieved through defining presupposition in terms of beliefs, and re-interpreting the Gricean maxims in terms of ‘belief constraints’ on the making of presupposition and assertion. Additionally, presupposition is interpreted as a property of the communicative behaviour of the speaker and hearer. A series of checks is devised, which differentiate speaker generation of presupposition and assertion from hearer recognition. Further, recent advances in DRT are expanded in order to account for varying strengths of beliefs (weak and strong belief) as well as dialogue acts triggered by assertions. A new emerging Discourse Representation Structure (DRS) creates compatible speaker and hearer cognitive states representation. The linguistic content (presupposition and assertion) is linked with the beliefs and intentions of agents in dialogue in order to move DRT towards being ‘more pragmatic’.
The core component of my approach to detecting ungrammatical sentences is a model that can predict the probability of the most likely parse of grammatical sentences without using the whole amount of information available to the probabilistic parser. As data-driven, probabilistic parsers are extremely robust, they fail to reject ungrammatical input. The output of such a model might provide a threshold to the parse probability that allows us to distinguish grammatical from ungrammatical sentences. In this talk I will present results for a range of models I have studied so far. Features like sentence length, number of nodes of the parse tree and character trigrams prove to be very useful to get a prediction within the right order of magnitude and show that lexical information is important. However, ordinary probabilistic language models that focus on token frequencies do not perform well unless combined with the former models. I also present a model that uses the probabilities of the terminal rules of a PCFG, although the parser to be approximated is history-based. Combining this model with the previous models again improves results.
The research described in this talk is part of a project whose aim is to induce multilingual probabilistic Lexical-Functional Grammar (LFG) resources from treebanks. For Spanish we use the Cast3LB treebank (Civit and Martí, 2004).
The talk aims at setting Arabic language within the frameworks of a previous seminal work that aimed at the adaptation of a tagged lexicon for French to Arabic, and the future work aiming at inducing Treebank-based LFG (Lexical Functional Grammar) resources for Arabic. The first part of the talk will provide an outline of the main characteristics of Arabic Morphosyntactic system. The second part will account for the previous work done under the scope of Lexical Markup Framework, with some illustrative samples. Finally, the presentation will highlight some of the main features that should be considered when inducing Treebank-based LFG resources for Arabic.
Part 1: I will talk about GALE "GLOBAL AUTONOMOUS LANGUAGE EXPLOITATION" project which is the largest DARPA funded research project. GALE has many areas including Speech Recognition, Machine Translation and Information Extraction. I will talk about different tasks in GALE.
In German there are 9 two-way prepositions which can either govern the accusative or the dative. Those two-way prepositions are able to combine with a verb stem and form the so-called particle verbs (also called separable prefix verbs).
Authoring learning content in XML-based e-learning specification languages like IMS-LD (Koper & Tattersall, 2005) is a tedious and error-prone task. A number of graphical editors have been designed to support this step in setting up an e-learning system. This paper describes DBAT-LD (Dialogue-Based Authoring of Learning Designs), which is a chat bot (natural language dialogue system) that interacts with authors of units of learning and secures the content of the dialogue in an XML-based target format of choice. DBAT-LD can be geared towards different specifications thereby contributing to interoperability in e-learning. In the case of IMS-LD, DBAT-LD elicits and reconstructs an account of learning activities along with the pedagogical support activities and delivers a Level A description of a unit of learning.
Cross-lingual information retrieval (CLIR) allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. There are several approaches to implementation of CLIR. Perhaps the most popular is to use a dictionary to translate the queries into the target language and then use mono-lingual retrieval. As with other CLIR language pairs, in Chinese-English CLIR the accuracy of dictionary-based query translation is limited by two major factors: the presence of out-of-vocabulary (OOV) words and translation ambiguity.
Data-Oriented Parsing (DOP) is a hybrid, language-independent, parsing formalism. Combining rules, statistics and linguistics, all parsing knowledge is learned from existing texts. However, the expressive power of the DOP model is limited by the corpus representations it assumes. DOP makes use of context-free phrase-structure trees which characterise phrasal and sentential syntax, but cannot reflect linguistic phenomena at deeper levels. The integration of Lexical Functional Grammar (LFG), which is known to be beyond context-free, enables a more linguistically detailed description of language.
A pervasive type of task in NLP is one in which a sequence at one level is mapped to a sequence at another: letters to phonemes, sequences of words to part-of-speech tags, constituent markers and labels, entity markers, etc. There is no data-driven (machine learning or statistical) model that can learn to perform the full sequence-to-sequence mapping in one go; rather, most approaches decompose the task in making local decisions, and post-processing the local decisions into a global sequence. Probabilistic approaches are often considered the ideal method, since they produce output token distributions at their local decisions, and strong search algorithms such as Viterbi search are available to search through these lattices of output token distributions to find a likely global output sequence.
It is a widely known problem that automatic evaluation metrics for machine translation like BLEU or NIST, based on string matching, are insensitive to admissible lexical and syntactic differences between the translation and reference. This is the underlying reason for using a number of reference texts, to increase the chances that the translation will match a part of at least one of them. A number of attempts have been made to remedy these shortcomings.
This talk will present three series of works that share a common goal: develop large-coverage parsers that are both efficient and linguistically relevant, without the help of hand-crafted resources. Indeed, I will focus on French, for which there are currently no such existing resourcess like PTB (both in terms of size and richness), and for which syntactic lexicons are still an active area of research.
The IXA Group (ixa.si.ehu.es) was created in 1988 with the aim of promoting the modernization of Basque by means of developing basic computational resources for it.
Previous work (Groves & Way. 2005) has demonstrated that while a Marker-based EBMT system is capable of outperforming a phrase-based SMT system trained on reasonably large data sets, a hybrid 'example-based' SMT system, incorporating marker chunks and SMT sub-sentential alignments is capable of outperforming both baseline translation models for French-English translation.
In this presentation we will show that similar gains are to be had from constructing a hybrid ‘statistical EBMT’ system capable of outperforming the baseline system of (Way & Gough, 2005). Using the Europarl (Koehn, 2005) training and test sets we show that this time around, although all ‘hybrid’ variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in (Groves & Way, 2005), to create a hybrid ‘example-based SMT’ system outperforms the baseline SMT and EBMT systems from which it is derived. Furthermore, we provide further evidence in favour of hybrid systems by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality.
Traditionally, transfer rules for machine translation have been hand-coded. This hand-coding of transfer rules takes time and uses valuable resources. In this talk I present a method of automatically inducing such transfer rules from aligned bilingual corpora. The sentence-aligned corpora are annotated with LFG f-structure information. I discuss a prooposed method of automatically extracting the transfer rules by making generalizations about the structure of source and target language sentences.
Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures View slides
I explore a novel approach to the identification of semantic roles (such as agent, patient, instrument, etc.) in unrestricted text. Current approaches to the identification of semantic relationships predicate-argument relations) make use of machine-learning techniques (such as Support Vector Machines, Random Forests, and others) applied to the syntactic tree structures generated by a natural language parser. Such parsers are commonly trained on corpora such as the Penn Treebank. In this project, the Penn Treebank data used to train a history-based generative lexicalized parser is augmented with semantic relationship data derived from Prop-Bank. In this way, the parser itself performs the labeling of semantic roles and relations, constituting an integrated approach to semantic parsing, as opposed to the ‘pipeline’ approach employed by current techniques.
Re-ranking Documents Segments to Improve Access to Relevant Content in Information Retrieval View slides
This project processes a ranked list of search results in the following manner. It splits each document in the list into sub-sections based on the change in subject matter within the document using a method known as TextTiling. It then infers links between the generated sub-sections based on the sub-sections' contents. It finally re-ranks the sub-sections based on the links inferred between them in a manner similar to Google's PageRank. While this technique of re-ranking documents returned by an Information Retrieval system have been implemented, there is no such existing method of re-ranking sub-sections of documents returned by an Information Retrieval system. It is hoped that by doing this, the access to relevant content is greatly improved. An evaluation of the relevance of the system implemented from this project yields promising results and further work is proposed to further improve the relevance achieved by the system.
The Development of a Phonological Awareness Test to Screen for Dyslexia View slides
This project involves the development of a computer based test which will discriminate between preschool children in terms of their phonological discrimination ability. A low phonological discrimination ability is strongly associated with dyslexia and this test, known as ‘Foomy’ may be developed to the stage where it can identify likely dyslexics at a very early age and so enable adults to take remedial action. This paper details the development of the test from a technical viewpoint as well as from a psycholinguistic one. It details the piloting of the test and analyses the results attained from administering it to 104 children in the 3-5 year age group. The project succeeded in producing a test which discriminates between the children and it delivered results which are normally distributed. Its usefulness in detecting children with dyslexia has yet to be explored.
Morphological Analysis and Generation of Spanish using Xerox Finite-State Tools View slides
Morphological analysis is an integral element in many natural language processing tasks. Finite-state techniques have been successfully applied to computational morphology and are considered the leading method for doing so. This talk will describe an online morphological analyser and generator for Spanish implemented using the Xerox Finite-State Tools. The core of the system is a lexicon, compiled as a finite-state transducer, which handles all regular inflectional and derivational forms of Spanish. Irregularities are dealt with using replace rules, encoded as regular expressions, which make necessary alterations to surface forms. Both these elements are composed together to create a single two-level finite-state transducer. The system achieves approximately 85% coverage on unrestricted Spanish text based on evaluation performed using the 1,000 most frequent word forms in a general Spanish corpus of 100,000 words.
Mary Hearne: Disambiguation Strategies for Data-Oriented Translation View slides
Declan Groves: View slides
Bart Mellebeek: A Syntactic Skeleton for Statistical Machine Translation View slides
Sign languages are the first and preferred languages of the Deaf Community worldwide. As with other minority languages, they are often poorly resourced and in many cases lack political and social recognition. As with speakers of minority languages, Deaf people are often required to access documentation or communicate in a language that is not natural to them. In an attempt to alleviate this problem I am developing an example-based machine translation system to allow Deaf people to access information in the language of their choice. In this presentation, I will give an overview of my previous system and the issues that arose from its development, I will then discuss my current work in terms of system and corpus development. Finally, I will conclude by addressing the problems of using traditional machine translation evaluation metrics for sign languages.
Due to limited budgets and an ever-diminishing time-frame for the production of subtitles for movies released in cinema and DVD, there is a compelling case for a technology-based translation solution for subtitles (O’Hagan, 2003; Carroll, 2004; Gambier, 2005). Our research focuses on an EBMT tool that produces fully automated translations, which in turn can be edited if required. We will seed the EBMT system with a corpus consisting of existing human translations from DVD to automatically produce high quality subtitles for audio-visual content. To our knowledge this is the first time that any EBMT approach has been used with DVD subtitle translation. Schäler et al. (2003) propose that “… the time is ripe for the transformation of EBMT into demonstrators, and eventually viable products”. We attempt to answer their call with an EBMT approach to the translation of subtitles.
Previous work (McCarthy 2004, Cahill 2004 etc.) have showed that an automatic Treebank-based annotation algorithm can be used efficiently to acquire wide-coverage, deep, constrained-based grammars such as LFG grammar approximations for English. In this talk the methodology is extended to the task of Chinese LFG acquisition. Chinese varies from English drastically, which causes some difficulties in the current annotation method. The first part of the talk will present some main characteristics of Chinese grammar and the problems within the LFG framework. Then a preliminary result of the f-structure annotation on the Penn Chinese Treebank will be presented and analyzed. Finally, some alternative methods to improve the result are proposed.
Andy Way: EBMT of the Basque language View slides
Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus ($270,000$ sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art approaches according to several common automatic evaluation metrics.
Karolina Owczarzak: Wrapper Syntax for Example-Based Machine Translation View slides
TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces a better output in terms of fluency and accuracy.
Bart Mellebeek: Multi-Engine Machine Translation by Recursive Sentence Decomposition View slides
In this talk, we present a novel approach to combine the outputs of multiple MT engines into a consensus translation. In contrast to previous Multi-Engine Machine Translation (MEMT) techniques, we do not rely on word alignments of output hypotheses, but prepare the input sentence for multi-engine processing. We do this by using a recursive decomposition algorithm that produces simple chunks as input to the MT engines. A consensus translation is produced by combining the best chunk translations, selected through majority voting, a trigram language model score and a confidence score assigned to each MT engine. We report statistically significant relative improvements of up to 9% BLEU score in experiments (English -> Spanish) carried out on an 800-sentence test set extracted from the Penn-II Treebank.
The goal of NLP is to design software which allows computers to perform useful tasks such as translating from one language to another, producing summaries of articles, extracting relevant information from disparate sources, understanding spoken and written requests expressed in the natural language of the user, checking the grammar in a piece of text and helping people to learn a second language. In all of these tasks, parsing, or the analysis of the syntactic structure of sentences, has a helpful role to play. A parser's ability to analyse sentential structure depends on its access to grammatical information, whether in the form of hand-crafted grammar rules, or rules induced automatically from a large collection of previously parsed sentences. The parsing task is made simpler by assuming that the sentences which will be encountered by the parser conform to the structural standards of the language in question. Given the human propensity to err and given the fact that many people are forced to produce sentences in a language other than their native one, the reasonableness of this assumption may be questioned. A parser must be able to produce accurate analyses for sentences which are deviant according to human standards, yet which are routinely interpreted correctly by humans. State-of-the-art probabilistic parsers are generally robust to errors, and they will return analyses for most ungrammatical sentences. However, these robust analyses are not necessarily correct because they do not always reflect the meanings of the ungrammatical sentences. Moreover, the analyses produced by current probabilistic parsers do not explicitly recognize that an error has occurred nor offer potential corrections, something which is vital if the parser is being used to guide grammar checking or computer-aided language learning.
Based on the up to date research on chunking and alignment, we propose a model for chunk alignment between Chinese and English. Firstly, we use a high quality Chinese-English dictionary to get some reliable links between Chinese words and English words, which is called anchor word alignment. While the recall of word alignment is relatively low, we achieve very high precision, which will be useful for chunk alignment. Then a bilingual chunking will be carried out with marker hypothesis as the theoretical background. The results will be a bunch of marker phrases headed by marker words. But we can not guarantee the Chinese marker phrases can be aligned with one or more English marker phrases. After observation, we find that to add a baseNP chunker into the marker phrases will be helpful. So we can use the borders of baseNP to further divide the marker phrases into smaller fragments. Thus more chunks of Chinese can be aligned with English chunks, vice versa. Finally we align the chunks already identified using a two step processing. The first step is to align the non-ambiguous ones using some heuristics. Then ranking based on log-linear model will be carried out for the ambiguous phrases. Experiments show that we can get many equivalent chunks using this model. But the use of these chunks to a machine translation system needs to be tested in the near future.
|Last update: 1st October 2010|