NCLT Seminar Series 2010/2011
NCLT seminar series usually takes place on Wednesdays from 4-5 pm in Room
L2.21 (School of Computing).
Jon Dehdari, NCLT, Dublin City University and Ohio State University (USA)
We investigate how morphological features in the form of POS tags impact parsing performance, using Arabic as our test case. The large, fine-grained tagset of the Penn Arabic Treebank (457 tags) is difficult to handle by parsers, ultimately due to data sparsity. However, ad-hoc conflations of treebank tags run the risk of discarding potentially useful parsing information. The main contribution of this paper is to describe methods to automatically detect which feature combinations help parsing. We first identify 15 individual feature sets from the Penn Arabic Treebank tagset. Either including or excluding these feature sets results in 32,768 combinations, so we then apply heuristic techniques to identify the combination achieving the highest parsing performance. Our results show a statistically significant improvement of 1.8% over the baseline provided by the Bies-Bikel Arabic POS mapping.
Robert Smith, CNGL, School of Computing, DCU
With ever increasing computing power and advances in 3D animation technologies it is no surprise that 3D avatars for sign language (SL) generation are advancing too. Traditionally these avatars have been driven by somewhat expensive and inflexible motion capture technologies and perhaps this is the reason avatars do not feature in all but a few user interfaces (UIs). SL synthesis is a competing technology that is less costly, more versatile and may prove to be the answer to the current lack of access for the Deaf in HCI. This paper outlines the current state of the art in SL synthesis for HCI and how we propose to advance this by improving avatar quality and realism with a view to ameliorating communication and computer interaction for the Deaf community as part of a wider localisation project.
Walid Magdy, CNGL, School of Computing, DCU
Arabic language is known to be one of the highly inflected languages that has many challenges for building technologies around it. This talk will go deeper about the Arabic language starting from providing some information about it and its history. Later, the talk will focus on three main dimensions of the challenges of this language. The first dimension will be the most famous one, which is the morphological nature of this language which creates challenges in technologies such as natural language processing (NLP) and information retrieval (IR). The second dimension will be the orthographic nature of the Arabic letters, which lead to large challenge in recognizing printed Arabic text using OCR. The last dimension will be the phonetic nature of the Arabic language which make it special for some technologies such as speech recognition. A quick summary will be provided about some of the solutions for each of these challenges.
Leonardo Campillos, NCLT, Dublin City University and Universidad Autónoma de Madrid (Spain)
Corpus linguistics has enriched the pedagogic material for language teaching or the research in Second Language Acquisition, but mainly with text data. Nowadays spoken corpus are growing and yielding data for descriptive analysis of language, and they pose a challenge so as to how they could be incorporated in the courseware. In this talk, first I will present an application of a spoken corpus for teaching Spanish as a foreign language integrated in a hypertext environment. Real spoken samples have been classified according to their difficulty level, their grammar, communicative or lexical contents, and they can be used for listening comprehension along with activities developed for each text. Secondly, I will talk about an ongoing research on spoken learner corpus for Spanish, in which participated 40 students with 9 different mother tongues. Transcriptions of learner speech are being annotated with XML error tags in order to retrieve the erroneous utterances and carry out the error analysis.
Jinhua Du, CNGL, School of Computing, DCU
Syntactic reordering on the source-side is an effective way of handling word order differences. The DE construction is a flexible and ubiquitous syntactic structure in Chinese which is a major source of error in translation quality. In this talk, I'll talk about a new classifier model . discriminative latent variable model (DPLVM) . to classify the DE construction in Chinese to improve the accuracy of the classification and hence the translation quality. We also propose a new feature which can automatically learn the reordering rules to a certain extent. The experimental results show that the MT systems using the data reordered by our proposed model outperform the baseline systems by 6.42% and 3.08% relative points in terms of the BLEU score on PB-SMT and hierarchical phrase-based MT espectively. In addition, we analyse the impact of DE annotation on word alignment and on the SMT phrase table.
Yifan He, CNGL, School of Computing, DCU
We propose a translation recommendation framework to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM. We describe an implementation of this framework using an SVM binary classifier. We exploit methods to fine-tune the classifier and investigate a variety of features of different types. We rely on automatic MT evaluation metrics to approximate human judgements in our experiments. Experimental results show that our system can achieve 0.85 precision at 0.89 recall, excluding exact matches. Furthermore, it is possible for the end-user to achieve a desired balance between precision and recall by adjusting confidence levels.
Rejwanul Haque, CNGL, School of Computing, DCU
Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present work, we introduce lexico-syntactic descriptions in the form of supertags as source-side context features in the state-of-the-art hierarchical phrase-based SMT (HPB) model. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from lexicalized tree-adjoining grammar (LTAG) and combinatory categorial grammar (CCG). We use a memory-based classification framework that enables the efficient estimation of these features. Despite the differences between the two supertagging approaches, they give similar improvements. We evaluate the performance of our approach on an English-to-Dutch translation task, and report statistically significant improvements of 4.48% and 6.3% BLEU scores in translation quality when adding CCG and LTAG supertags, respectively, as context-informed features.
Djamé Seddah, Université Paris-Sorbonne, France
Statistical parsing of morphologically-rich languages has long been under-represented in the literature, it's only with the recent availability of treebanks for languages such as German, Arabic, Hebrew or French that general issues related to sparse lexicons, free word order or treebank idiosyncrasies have started to emerge and contributed to explain why existing parsing models trained on these data perform worse than their Wall Street Journal counterparts. Some linguists would say that French, despite its rich verbal inflection, some limited forms of free word order and its "insane" past participle agreement rules, can hardly be considered to be an MRL when compared to German or Arabic. Nevertheless, its morphological properties, especially when expressed in a small treebank, make it a good candidate to explore different ways of coping with one of MRLs' most striking issues: very sparse lexicons. In this talk, we present the effects of data driven lemmatization of French on a parsing model highly optimized for English, the Charniak parser, and show that the morphological clustering brought about by the lemmatization process does bring some benefits indeed but its effect are counterbalanced by the mechanical increase of out of vocabulary words, especially when compared to other forms of word clustering.
Hierarchical Pitman-Yor Language Model for Machine Translation
Tsuyoshi Okita, CNGL, School of Computing, DCU
The hierarchical Pitman-Yor process-based smoothing method applied to language model was proposed by Goldwater and by Teh; the performance of this smoothing method is shown comparable with the modified Kneser-Ney method in terms of perplexity. Although this method was presented four years ago, there has been no paper which reports that this language model indeed improves translation quality in the context of Machine Translation (MT). This would be important for the MT community since an improvement in perplexity does not always lead to an improvement in BLEU score; for example, the success of word alignment measured by Alignment Error Rate (AER) does not often lead to an improvement in BLEU. This paper reports in the context of MT that an improvement in perplexity really leads to an improvement in BLEU score. We conducted experiments; HPYLM improved by 1.03 BLEU points absolute and 6% relative for 50k EN--JP, which was statistically significant.
Prof. Antonio Moreno Sandoval, Universidad Autónoma de Madrid (Spain)
Since 1990 the Laboratorio de LingĂĽĂstica InformĂˇtica at UAM has been a pioneer in compiling LR in Spanish, both in spoken and written format. The most important resources are the UAM Spanish Treebank and the Spanish C-ORAL-ROM corpus. We will briefly compare the main features of both corpora and the linguistic analysis extracted from them. In the last 10 years we have extended the methodology to other languages such as Arabic, Chinese or Japanese; to other linguistic varieties (child spontaneous speech); and to different applications (CALL, multimodal Information Retrieval).
Hala Al-Maghout, CNGL, School of Computing, DCU
We present a method to incorporate target-language syntax in the form of Combinatory Categorial Grammar in the Hierarchical Phrase-based MT system. We adopt the approach followed by Syntax Augmented Machine Translation (SAMT) to attach syntactic categories to nonterminals in hierarchical rules, but instead of using constituent grammar, we take advantage of the rich syntactic information and flexible structures of Combinatory Categorial Grammar. We present results on Chinese-English DIALOG IWSLT data and compare them with Moses SAMT4 and Moses Phrase-based systems. Our results show 5.47% and 1.18% BLEU score relative increase over Moses SAMT4 and Phrase-based systems, respectively. We conduct analysis on the reasons behind this improvement and we find out that our approach has better coverage than SAMT approach. Furthermore, Combinatory Categorial Grammar-based syntactic categories attached to nonterminals in hierarchical rules prove to be less sparse and can generalize better than syntactic categories extracted according to SAMT method.
Özlem Çetinoğlu, CNGL, School of Computing, DCU
This talk is about the recent progress on English LFG parsing technologies developed at DCU. The first part of the talk focuses on the latest improvements on the existing LFG pipeline system. We give results on Parc700 and long distance dependencies gold standards. The second part introduces LFG inspired dependencies. We convert the LFG pipeline output into a dependency representation. This new representation enables us to train dependency parsers. We compare dependency parsers to the LFG parsing pipeline which uses constituency parsers. We observe that the difference between the constituency and dependency parsers is small. Since dependency parsers are much faster than constituency parsers, LFG-inspired dependencies can be preferred for applications where speed is the most important concern. Finally, the talk briefly explains how to use the LFG technologies.
Antonio Toral, CNGL, School of Computing, DCU
This talk deals with the application of automatically acquired Named Entities (NEs) from Wikipedia to MT. The distributional properties of NEs (very low number of occurrences per single NE and very high amount of different NEs) together with their dynamic nature make manual approaches to include them in dictionaries impractical. Furthermore, attempts to learn their translation from parallel corpora suffer from their long tail distribution. A study is performed on the Apertium English<->Spanish RBMT system in which the performances of the system (i) without any NE, (ii) with handtagged NEs and (iii) with automatically acquired NEs are compared. Subsequently, the implications of a similar procedure for SMT are discussed.
Hanna Bechara, CNGL, School of Computing, DCU
The objective of this research is to investigate whether using statistical machine translation methods in the post-editing phase can improve the translation quality. So far post-editing techniques have been applied on the output of Rule Based MT systems. We set out to apply this task to statistical and machine-learning based MT systems in both stages. My talk will go over two sets of experiments that use a Phrase Based SMT system (Moses) to post-edit the output of the same system, both with and without context information.
Antonio Toral, CNGL, School of Computing, DCU
This talk presents an Italian to Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish-Catalan and Spanish-Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM.
Joachim Wagner, CNGL, School of Computing, DCU
Up to recently, researchers had to fill out a form to get access to computation nodes at the Irish Centre for High End Computing, www.ichec.ie. DCU bought a share of 8 nodes of the "stokes" cluster that DCU researchers now can use directly without an ICHEC project application. In this brief, talk I will show how to use these computation nodes, what other resources are available at ICHEC and how our own cluster 'maia' will be expanded over the next month.
Johannes Leveling, CNGL, School of Computing, DCU
The aim of this research is to explore query expansion techniques beyond the typical two-stage process of blind relevance feedback. For the participation of Dublin City University (DCU) in the Relevance Feedback (RF) track of INEX 2010, we investigated the relation between the length of relevant text passages and the number of RF terms. In our experiments, relevant passages are segmented into non-overlapping windows of fixed length which are sorted by similarity with the query. In each retrieval iteration, we extend the current query with the most frequent terms extracted from these word windows. In different experiments the number of feedback terms corresponds to a constant number, a number proportional to the length of relevant passages, and a number inversely proportional to the length of relevant passages, respectively. Results show a significant increase in MAP for INEX 2008 training data and improved precisions at early recall levels for the 2010 topics as compared to the baseline Rocchio feedback.
Junhui Li, CNGL, School of Computing, DCU
Given a sentence and a predicate (either a verb or a noun) in it, the task of shallow semantic parsing is to recognize and map all the word sequences in the sentence into their corresponding semantic arguments (roles) or non-argument. As a particular case of shallow semantic parsing, the well-defined semantic role labeling (SRL) has been drawing more and more attention due to its importance in deep natural language processing applications. Previous research has shown that the state-of-the-art SRL systems depend heavily on the qualities of parse trees, and that the performance of nominal SRL lags significantly behind that of verbal SRL. These two issues become more apparent when Chinese language is considered. This presentation will first improve the performance of nominal SRL with various kinds of verbal evidence, and then explores joint syntactic and semantic parsing to further improve the performance of both syntactic parsing and SRL.
Anton Bryl, CNGL, School of Computing, DCU
It is usual, when designing a modification for an existing MT system, to evaluate different variants of the system on the same dataset in order to see whether the changes led to any improvement. The objective of the present work is to see how much the use of MERT influences the reliability of such comparison results. We run the comparison of the same two systems on the same training and testing sets, but with 16 different, though uniformly extracted, development sets of four different sizes (200 to 1000 sentences). We show that, due to the data-dependance of MERT, different devsets lead to vastly different comparison results. This suggests that a comparison of two systems which use MERT, performed on a single devset, is in the gerneral case insufficient. The methods of statistical significance evaluation, such as bootstrap resampling, help to make sure that the results are not due to the randomness of the testset, but they in no way address the randomness of the devset and therefore offer no solution to the problem in question.
Jie Jiang, CNGL, School of Computing, DCU
Inspired by previous source-side syntactic reordering methods for SMT, this talk focuses on using automatically learned syntactic reordering patterns with functional words which indicate structural reorderings between the source and target language. This approach takes advantage of phrase alignments and source-side parse trees for pattern extraction, and then filters out those patterns without functional words. Word lattices transformed by the generated patterns are fed into PBSMT systems to incorporate potential reorderings from the inputs. Experiments are carried out on a medium-sized corpus for a Chinese-English SMT task. The proposed method outperforms the baseline system by 1.38% relative on a randomly selected testset and 10.45% relative on the NIST 2008 testset in terms of BLEU score. Furthermore, a system with just 61.88% of the patterns filtered by functional words obtains a comparable performance with the unfiltered one on the randomly selected testset, and achieves 1.74% relative improvements on the NIST 2008 testset.
Sergio Penkale, CNGL, School of Computing, DCU
Unstructured models of translation such as word- and phrase-based models benefit from strong translation-equivalence models, but are unable to capture long-distance reorderings. In contrast, syntax-based systems such as Data-Oriented Translation (DOT) define rich structural models which aid in the construction of target sentences, but suffer from a weak lexical equivalence model. In this paper we propose a new log-linear DOT model which is able to exploit features of the complete source sentence. We introduce a source lexical feature to this new model and make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our feature, reporting on their empirical performance. We report a 38.82% relative Bleu improvement over the DOT baseline when incorporating both our lexical feature and the LM on an English-to-Spanish Europarl translation task, and provide insights into why this approach works.
Xiaofeng Wu, School of Computing, DCU
Automatic Document Summarization (ADS) is one of the subfield of Natural Language Processing (NLP). It can be defined as a technology which is to summarize documents with the help of computer, or to represent the original documents with short but comprehensive texts according to the demands of customers. The research of ADS is of both theoretical and application-oriented values.
In this talk I will first give an introduction of ADS and the research work I did in my Ph.D . In the introduction, the Abstractive or Extractive methods, the various Features and Machine Learning algorithms, how to Evaluate and how to deal with Redundancy will be briefly discussed. Then I'll discuss a few thoughts about how and why I want to use Integer Linear Programming(ILP) in Cross-Lingual Document Summarization, which has strong relationship with MT, how I plan to combine sentence compression with sentence extraction under the ILP framework, and I also want to discuss how to get the compression quality prediction, which is a very hard problem
Last update: 28th April 2011