NCLT Seminar Series 2011/2012The NCLT seminar series usually takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).
The schedule of presenters will be added below as they are confirmed. Please contact Antonio Toral if you have any queries about the NCLT 2011/2012 Seminar Series.
We compare the use of edited text in the form of newswire and unedited text in the form of discussion forum posts as sources for training material in a self-training experiment involving the Brown reranking parser and a test set of sentences from an online sports discussion forum. We find that grammars induced from the two automatically parsed corpora achieve similar Parseval fscores, with the grammars induced from the discussion forum material being slightly superior. An error analysis reveals that the two types of grammars do behave differently.
We investigate how morphological features in the form of part-of-speech tags impact parsing performance, using Arabic as our test case. The large, fine-grained tagset of the Penn Arabic Treebank (498 tags) is difficult to handle by parsers, ultimately due to data sparsity. However, ad-hoc conflations of treebank tags runs the risk of discarding potentially useful parsing information. The main contribution of this paper is to describe several automated, language-independent methods that search for the optimal feature combination to help parsing. We first identify 15 individual features from the Penn Arabic Treebank tagset. Either including or excluding these features results in 32,768 combinations, so we then apply heuristic techniques to identify the combination achieving the highest parsing performance. Our results show a statistically significant improvement of 2.86% for vocalized text and 1.88% for unvocalized text, compared with the baseline provided by the Bikel-Bies Arabic POS mapping (and an improvement of 2.14% using product models for vocalized text, 1.65% for unvocalized text), giving state-of-the-art results for Arabic constituency parsing.
This paper reports on an evaluation experiment focusing on statistical machine translation (MT) software integrated into a more complex system for the synchronization of multilingual information contained in wiki sites. The experiment focused on the translation of wiki entries from German and Dutch into English carried out by ten media professionals, editors, journalists and translators working at two major media organizations who post-edited the MT output. The investigation concerned in particular the adequacy of MT to support the translation of wiki pages, and the results in- clude both its success rate (i.e. MT effectiveness) and the associated confi- dence of the users (i.e. their satisfaction). Special emphasis is laid on the post-editing effort required to bring the output to publishable standard. The results show that overall the users were satisfied with the system and regarded it as a potentially useful tool to support their work; in particular, they found that the post-editing effort required to attain translated wiki entries in English of publishable quality was lower than translating from scratch.
popularity of web-based mapping services such as Google Earth/Maps and
Microsoft Virtual Earth (Bing), has led to an increasing awareness of
the importance of location data and its incorporation into both
web-based search applications and the databases that support them, In
the past, attention to location data had been primarily limited to
geographic information systems (GIS), where locations correspond to
spatial objects and are usually specified geometrically. However, in
the web-based applications, the location data often corresponds to
place names and is usually specified textually.
We investigate the problem of parsing the noisy language of social media. We evaluate four Wall-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We ﬁnd that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically signiﬁcant improvement for all three parsers.
This talk reports experiments on adapting components of a Statistical Machine Translation (SMT) system for the task of translating online user forum data1 from Symantec. User-generated forum data is monolingual, and differs from available bitext MT training resources in a number of important respects. For this reason, adaptation techniques are important to achieve optimal results. We investigate the use of mixture modelling to adapt our models for this specific task. Individual models, created from different in-domain and out-of-domain data sources are combined using linear and log-linear weighting methods for the different components of an SMT system. The results show a more profound effect of language model adaptation over translation model adaptation with respect to translation quality. Surprisingly, linear combination outperforms log-linear combination of the models. The best adapted systems provide a statistically significant improvement of 1.78 absolute BLEU points (6.85% relative) and 2.73 absolute BLEU points (8.05% relative) over the baseline system for English-German and English-French translations, respectively.
Hierarchical Phrase-Based (HPB) Machine Translation (MT) system extracts synchronous Context-Free Grammar (CFG) from a parallel corpus without using any syntactic information. Nonterminals in HPB rules act as placeholders which are replaced by other phrases during decoding. In the baseline HPB system, there are no syntactic constraints imposed on nonterminal replacement during decoding. Methods such Syntax-Augmented Machine Translation (SAMT) try to constraint phrases allowed to replace nonterminals in HPB rules by labelling them with syntactic labels extracted using phrase-structure grammar. However, the effect of using such constraints is limited because their application does not cover all the levels of the derivation for two reasons. First, these syntactic constraints are applied on hierarchical rules only and do not include glue grammar rules, which perform monotone phrase concatenation in HPB SMT system. Second, phrases which fail to have a syntactic label do not undergo any syntactic constraint during decoding. In my current work, I will try to approach these two problems by using Combinatory Categorial Grammar (CCG) to control glue grammar-based phrase concatenation. In addition, I will try to increase the coverage of CCG-based syntactic labels by using composite syntactic labels which consist of two or more CCG categories. This work is still in progress, so I will appreciate the feedback from group members, especially the parsing specialists.
Everyone is concerned with health topics. Thus, there is
a proliferation of health-related textual data: any kind of author
The very diverse and heterogeneous landscape of huge
amounts of digital and digitized resources collections (publications,
datasets, multimedia files, processing tools, services and
applications) has drastically transformed the requirements for their
publication, archiving, discovery and long-term maintenance. Digital
repositories provide the infrastructure for describing and documenting,
storing, preserving, and making this information publicly available in
an open, user-friendly and trusted way. Repositories represent an
evolution of the digital libraries paradigm towards open access,
advanced search capabilities and large-scale distributed architectures.
In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrieval effectiveness. A standard method used to address this problem is pseudo relevance feedback (PRF) which updates user queries by adding feedback terms selected automatically from top ranked documents in a prior retrieval run. PRF assumes that the target collection provides enough feedback information to select effective expansion terms. This is often not the case in image retrieval since images often only have short metadata annotations leading to the IAP. Our work proposes the use of an external knowledge resource (Wikipedia) in the process of refining user queries. In our method, Wikipedia documents strongly related to the terms in user query ("definition documents'') are first identified by title matching between the query and titles of Wikipedia articles. These definition documents are used as indicators to re-weight the feedback documents from an initial search run on a Wikipedia abstract collection using the Jaccard coefficient. The new weights of the feedback documents are combined with the scores rated by different indicators. Query-expansion terms are then selected based on these new weights for the feedback documents. Our method is evaluated on the ImageCLEF WikipediaMM image retrieval task using text-based retrieval on the document metadata fields. The results show significant improvement compared to standard PRF methods.
At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. In this talk, I give an overview of my research on applying such grammars to automatically judge the grammaticality of an input string. I show evidence that grammaticality is reflected in the generative probability of the best parse, discuss training and evaluation of classifiers and present results for 9 selected methods, including baseline XLE grammar and n-gram methods and machine learning-based methods.
This presentation gives an overview of the DCU participation in the SMS-based FAQ Retrieval task at FIRE, the Forum for Information Retrieval Evaluation. The objective in the SMS-based FAQ retrieval task is to find answers in a collection of frequently asked questions (FAQ) given a SMS question in "text-speak". DCU submitted experimental runs for the monolingual English subtask. The DCU approach to this problem consists of first transforming the noisy SMS queries into a normalised, corrected form. The normalised queries are then used to retrieve a ranked list of FAQ results by combining the results from three slightly different retrieval mechanisms. Finally, using information from the retrieval results, out-of-domain (OOD) queries are identified and tagged. The results of our best run on the final test set are the best among 13 participating groups. We retrieved results for 70% in-domain queries correctly, identified 85.6% out-of-domain queries correctly, and obtained an MMR score of 0.896.
We propose a topical relevance model (TRLM) which is a generalized relevance model (RLM) aiming to alleviate the limitations of a standard RLM by exploiting the topical structures of pseudo-relevant documents so as to boost the intra topical co-occurrences and down-weight the inter topical ones. TRLM provides a framework to estimate a set of underlying hypothetical relevance models for each information need expressed in a query. Latent Dirichlet allocation (LDA) of pseudo-relevant documents is used in the probability estimations. It is not only the pseuso-relevant documents that may be topically structured, but also a massive query itself, which expresses a set of largely diverse information needs, such as the ones which are used in associative document search eg. patent prior art search or legal search. Two variants of TRLM are thus proposed, one for the standard ad-hoc search scenario, and the other for associative document search. The first variant, called the unifaceted TRLM (uTRLM), assumes that a query expresses a single overall information need encapsulating a set of related sub-information needs. The second variant, called the multifaceted model (mTRLM), is built on the assumption that a query explicitly expresses several different information needs. Reults show that uTRLM significantly outperforms RLM for ad-hoc search, and mTRLM outperforms both RLM and uTRLM in patent prior art search. TRLM is shown to be more robust than RLM in filtering out the noise factor from non relevant feedback documents.
This presentation gives an overview of the Khresmoi EC FP7 project which aims to develop a multilingual, multimodal search and access system for biomedical information and documents. The system is targeted at three use cases: general public, general practitioners and radiologists, understanding different languages at different levels, and with different medical information requirements. DCU’s contributions to this project are in the translation, retrieval, collaborative components, and evaluation technique development spaces.
Domain adaptation is becoming increasingly topical and of industry interest. We are no longer looking for catch-all solutions to MT, but working towards developing specific systems for the niche market and specific domains. This presentation will give an overview of the CNGL D3c demonstrator project. We address the specific area of domain tuning to language and content style for Symantec forum data.where the most pertinent language resources are unavailable. I will discuss our work in this area so far.
Chiang's hierarchical phrase-based (HPB) translation model advances the state-of-the-art in statistical machine translation by expanding conventional phrases to hierarchical phrases -- phrases that contain sub-phrases. However, the original HPB model is prone to over-generation due to lack of linguistic knowledge. In this talk, I will give an overview of my research on syntactic augmented machine translation. I will show a simple but effective translation model, called the Head-Driven HPB (HD-HPB) model, which incorporates head information in translation rules to better capture syntax-driven information in a derivation. An extensive set of experiments on Chinese-English translation on four NIST MT test sets show that our HD-HPB model significantly outperforms Chiang's model.
Regression based machine translation (RegMT) approach provides a learning framework for machine translation, separating learning models for training, training instance selection, feature representation, and decoding. We introduce sparse regression as a better model than L2 regularized regression for statistical machine translation and demonstrate that sparse regression models achieve better performance in predicting target features, estimating word alignments, creating phrase tables, and generating translation outputs. We develop training instance selection algorithms that not only make RegMT computationally more scalable but also improve the performance of standard SMT systems. We develop good evaluation techniques for measuring the performance of the RegMT model, SMT systems, and the quality of the translations.
User-generated content (UGC) has transformed the way that information is handled on-line. This paradigm shift focused on the user has given rise to Web 2.0 applications, where users generate, share and consume information. UGC is a valuable resource that can be exploited for different purposes, such as opinion mining, directed advertising or information retrieval. However, UGC analysis can be challenging because of the informal features present in this new kind of textual communication. The use of emoticons, colloquial language, slang, misspellings and abbreviations are higher in comparison with standard texts, thus becoming less accessible and hard to understand to both people and Natural Language Processing applications. In this talk we aim to explain some techniques to classify and normalise UGC and discuss future work to enhance this research topic.
we propose a graph-based method for one-to-one word alignment. Instead of assuming independence among each word alignment decision as in the IBM1 model, we observe that word alignment decisions are dependent on each other and that such dependencies should be modeled to improve performance. Specifically, we propose a graph representation which like the IBM1 model captures co-occurrence information for word alignment, but can also model dependencies among all word alignment decisions in a corpus. We propose a PageRank style evidence propagation framework, to exploit dependencies in the graph for better performance. Experimental results demonstrate both good alignment quality and high run time efficiency. Our results show that with an appropriate design of the graph structure, our result can surpass IBM1 by 1.3% on AER, and by 1.8% on BLEU (p<0.05).
In this paper we present a hybrid statistical machine translation (SMT)-example-based MT (EBMT) system that shows signiﬁcant improvement over both SMT and EBMT baseline systems. First we present a runtime EBMT system using a subsentential translation memory (TM). The EBMT system is further combined with an SMT system for effective hybridization of the pair of systems. The hybrid system shows signiﬁcant improvement in translation quality (0.82 and 2.75 absolute BLEU points) for two different language pairs (English–Turkish (En–Tr) and English– French (En–Fr)) over the baseline SMT system. However, the EBMT approach suffers from signiﬁcant time complexity issues for a runtime approach. We explore two methods to make the system scalable at runtime. First, we use an heuristic-based approach. Secondly, we use an IR-based indexing technique to speed up the time-consuming matching procedure of the EBMT system. The index-based matching procedure substantially improves run-time speed without affecting translation quality.
We describe the development of a test collection for the investigation of speech retrieval beyond identification of relevant content. This collection focuses on satisfying user information needs for queries associated with specific types of speech acts. The collection is based on an archive of the Internet video from Internet video sharing platform (blip.tv), and was provided by the MediaEval benchmarking initiative. A crowdsourcing approach was used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. We describe and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform. We highlight the challenges of constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items.
|Last update: 3rd April 2012|