National Centre for Language Technology

Dublin City University, Ireland

National Centre for Language Technology

 

Centre for Next Generation Localisation

School of Computing

School of Applied Languages and Intercultural Studies

School of Electronic Engineering

 
 
 

NCLT Seminar Series



NCLT Home

Members

History

Projects

Theses

Links

Publications

Research Groups
 

NCLT Seminar Series 2011/2012

The NCLT seminar series usually takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).

The schedule of presenters will be added below as they are confirmed. Please contact Antonio Toral if you have any queries about the NCLT 2011/2012 Seminar Series.

Time and venue Speaker(s) Title(s)
October 4th 2011; 15:00, The Gallery (The Helix)
Jennifer Foster,
Jon Dehdari
Comparing the Use of Edited and Unedited Text in Parser Self-Training,
Morphological Features for Parsing Morphologically-rich Languages: A Case of Arabic
October 12th 2011; 16:00, L2.21 Federico Gaspari
User-focused task-oriented MT evaluation for wikis: a case study
October 20th 2011; 16:00, L2.21 Hanan Samet
Place-based Information Systems: Textual Location Identification and Visualization
October 27th 2011; 16:00, L2.21 Özlem Çetinoğlu From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0
November 2nd 2011; 16:00, L2.21 Pratyush Banerjee
Domain Adaptation in SMT of User-Forum Data using Component Level Mixture Modelling
November 15th 2011; 16:00, L2.21 Hala Al-Maghout Extending syntactic constraints in syntax-augmented HPB SMT system
November 23rd 2011; 16:00, L2.21 Lorraine Goeuriot
How can NLP contribute to the medical domain? Some examples
November 30th 2011; 16:00, L2.21 John Judge META-SHARE - An Open Resource Sharing Infrastructure for Language Technologies
January 11th 2012; 16:00, L2.21 Jinming Min
External Query Reformulation for Text-based Image Retrieval
January 18th 2012; 16:00, L2.21 Joachim Wagner
Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers
January 25th 2012; 16:00, L2.21 Johannes Leveling
DCU@FIRE 2011: SMS-Based FAQ Retrieval
February 15th 2012; 16:00, L2.21 Debasis Ganguly
Topical Relevance Models
February 22nd 2012; 16:00, L2.21 Liadh Kelly Khresmoi – Medical information analysis and retrieval
February 29th 2012; 16:00, L2.21 Qun Liu (Chinese Academy of Sciences) Progress in Machine Translation Technologies: the next 5 Years Ahead
March 14th 2012; 16:00, L2.21 Sara Morrissey Domain and Personalised Tuning for Machine Translation: the CNGL demonstrator
March 21st 2012; 16:00, L2.21 Junhui Li
Head-Driven Hierarchical Phrase-based Translation
March 28th 2012; 16:00, L2.21 Ergun Biçici
The Regression Model of Machine Translation
April 4th 2012; 11:00, L2.21 Alejandro Mosquera (University of Alicante)
Studying the informal features of user-generated content
April 11th 2012; 16:00, L2.21 Xiaofeng Wu
AlignRank: A Graph-Based Word Alignment Approach Using Evidence Propagation
April 18th 2012; 16:00, L2.21 Sandipan Dandapat
Combining EBMT, SMT, TM and IR Technologies for Quality and Scale
May 16th 2012; 16:00, L2.21 Maria Eskevich Creating a Data Collection for Evaluating Rich Speech Retrieval
May 23rd 2012; 16:00, L2.21 EAMT dry-runs

May 30th 2012; 16:00, L2.21

June 6th 2012; 16:00, L2.21 Stephen Doherty

June 13th 2012; 16:00, L2.21

June 20th 2012; 16:00, L2.21

June 27th 2012; 16:00, L2.21

Comparing the Use of Edited and Unedited Text in Parser Self-Training (IWPT Short Paper)

Jennifer Foster (joint work with Ozlem Cetinoglu, Joachim Wagner and Josef van Genabith)

We compare the use of edited text in the form of newswire and unedited text in the form of discussion forum posts as sources for training material in a self-training experiment involving the Brown reranking parser and a test set of sentences from an online sports discussion forum. We find that grammars induced from the two automatically parsed corpora achieve similar Parseval fscores, with the grammars induced from the discussion forum material being slightly superior. An error analysis reveals that the two types of grammars do behave differently.


Morphological Features for Parsing Morphologically-rich Languages: A Case of Arabic (SPMRL Long Paper)

Jon Dehdari (joint work with Lamia Tounsi and Josef van Genabith)

We investigate how morphological features in the form of part-of-speech tags impact parsing performance, using Arabic as our test case. The large, fine-grained tagset of the Penn Arabic Treebank (498 tags) is difficult to handle by parsers, ultimately due to data sparsity. However, ad-hoc conflations of treebank tags runs the risk of discarding potentially useful parsing information. The main contribution of this paper is to describe several automated, language-independent methods that search for the optimal feature combination to help parsing. We first identify 15 individual features from the Penn Arabic Treebank tagset. Either including or excluding these features results in 32,768 combinations, so we then apply heuristic techniques to identify the combination achieving the highest parsing performance. Our results show a statistically significant improvement of 2.86% for vocalized text and 1.88% for unvocalized text, compared with the baseline provided by the Bikel-Bies Arabic POS mapping (and an improvement of 2.14% using product models for vocalized text, 1.65% for unvocalized text), giving state-of-the-art results for Arabic constituency parsing.


User-focused task-oriented MT evaluation for wikis: a case study (JEC paper)

Federico Gaspari (joint work with Antonio Toral and Sudip Kumar Naskar)

This paper reports on an evaluation experiment focusing on statistical machine translation (MT) software integrated into a more complex system for the synchronization of multilingual information contained in wiki sites. The experiment focused on the translation of wiki entries from German and Dutch into English carried out by ten media professionals, editors, journalists and translators working at two major media organizations who post-edited the MT output. The investigation concerned in particular the adequacy of MT to support the translation of wiki pages, and the results in- clude both its success rate (i.e. MT effectiveness) and the associated confi- dence of the users (i.e. their satisfaction). Special emphasis is laid on the post-editing effort required to bring the output to publishable standard. The results show that overall the users were satisfied with the system and regarded it as a potentially useful tool to support their work; in particular, they found that the post-editing effort required to attain translated wiki entries in English of publishable quality was lower than translating from scratch.


Place-based Information Systems: Textual Location Identification and Visualization

Hanan Samet

The popularity of web-based mapping services such as Google Earth/Maps and Microsoft Virtual Earth (Bing), has led to an increasing awareness of the importance of location data and its incorporation into both web-based search applications and the databases that support them, In the past, attention to location data had been primarily limited to geographic information systems (GIS), where locations correspond to spatial objects and are usually specified geometrically. However, in the web-based applications, the location data often corresponds to place names and is usually specified textually.

An advantage of such a specification is that the same specification can be used regardless of whether the place name is to be interpreted as a point or a region.  Thus the place name acts as a polymorphic data type in the parlance of programming languages.  However, its drawback is that it is ambiguous.  In particular, a given specification may have several interpretations, not all of which are names of places.  For example, ``Jordan'' may refer to both a person as well as a place. Moreover, there is additional ambiguity when the specification has a place name interpretation.  For example, ``Jordan'' can refer to a river or a country while there are a number of cities named ``London''.   In this talk we examine the extension of GIS concepts to textually specified location data and review search engines that we have developed to retrieve documents where the similarity criterion is not based solely on exact match of elements of the query string but instead also based on spatial proximity.  Thus we want to take advantage of spatial synonyms so that, for example, a query seeking a rock concert in Dublin would be satisfied by a result finding a rock concert in Leixlip or Maynooth.  This idea has been applied by us to develop the STEWARD (Spatio-Textual Extraction on the Web Aiding Retrieval of Documents) system for finding documents on website of the Department of Housing and Urban Development.  This system  relies on the presence of a document tagger that automatically identifies spatial references in text, pdf, word, and other unstructured documents.  The thesaurus for the document tagger is a collection of publicly available data sets forming a gazetteer containing the names of places in the world.  Search results are ranked according to the extent to which they satisfy the query, which is determined in part by the prevalent spatial entities that are present in the document.  The same ideas have also been adapted by us to collections of news articles as well as Twitter tweets resulting in the NewsStand and TwitterStand systems, respectively, which will be demonstrated along with the STEWARD system in conjunction with a discussion of some of the underlying issues that arose and the techniques used in their  implementation.  Future work involves applying these ideas to spreadsheet data.

Biography

Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies.  He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications involving spatial data.  He has a Ph.D from Stanford University.  He is the author of the recent book "Foundations
of Multidimensional and Metric Data Structures" published by Morgan-Kaufmann, San Francisco, CA, in 2006
(http://www.mkp.com/multidimensional), an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structures titled "Design and Analysis of Spatial Data Structures" and "Applications of Spatial Data Structures:  Computer Graphics, Image Processing and GIS" published by Addison-Wesley, Reading, MA, 1990.  He is the founding chair of ACM SIGSPATIAL, a recipient of the 2009 UCGIS Research Award and the 2010 CMPS Board of Visitors Award at the University of Maryland, a Fellow of the ACM, IEEE, AAAS, and IAPR (International Association for Pattern Recognition), and an ACM Distinguished Speaker.


From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0 (IJCNLP paper)

Özlem Çetinoğlu (joint work with Jennifer Foster, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan and Josef van Genabith)

We investigate the problem of parsing the noisy language of social media. We evaluate four Wall-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We find that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically significant improvement for all three parsers.


Domain Adaptation in SMT of User-Forum Data using Component Level Mixture Modelling

Pratyush Banerjee

This talk reports experiments on adapting components of a Statistical Machine Translation (SMT) system for the task of translating online user forum data1 from Symantec. User-generated forum data is monolingual, and differs from available bitext MT training resources in a number of important respects. For this reason, adaptation techniques are important to achieve optimal results. We investigate the use of mixture modelling to adapt our models for this specific task. Individual models, created from different in-domain and out-of-domain data sources are combined using linear and log-linear weighting methods for the different components of an SMT system. The results show a more profound effect of language model adaptation over translation model adaptation with respect to translation quality. Surprisingly, linear combination outperforms log-linear combination of the models. The best adapted systems provide a statistically significant improvement of 1.78 absolute BLEU points (6.85% relative) and 2.73 absolute BLEU points (8.05% relative) over the baseline system for English-German and English-French translations, respectively.


Extending syntactic constraints in syntax-augmented HPB SMT system

Hala Almaghout

Hierarchical Phrase-Based (HPB) Machine Translation (MT) system extracts synchronous Context-Free Grammar (CFG) from a parallel corpus without using any syntactic information. Nonterminals in HPB rules act as placeholders which are replaced by other phrases during decoding. In the baseline HPB system, there are no syntactic constraints imposed on nonterminal replacement during decoding. Methods such Syntax-Augmented Machine Translation (SAMT) try to constraint phrases allowed to replace nonterminals in HPB rules by labelling them with syntactic labels extracted using phrase-structure grammar. However, the effect of using such constraints is limited because their application does not cover all the levels of the derivation for two reasons. First, these syntactic constraints are applied on hierarchical rules only and do not include glue grammar rules, which perform monotone phrase concatenation in HPB SMT system. Second, phrases which fail to have a syntactic label do not undergo any syntactic constraint during decoding. In my current work, I will try to approach these two problems by using Combinatory Categorial Grammar (CCG) to control glue grammar-based phrase concatenation. In addition, I will try to increase the coverage of CCG-based syntactic labels by using composite syntactic labels which consist of two or more CCG categories. This work is still in progress, so I will appreciate the feedback from group members, especially the parsing specialists.


How can NLP contribute to the medical domain? Some examples

Lorraine Goeuriot

Everyone is concerned with health topics. Thus, there is a proliferation of health-related textual data: any kind of author (e.g.,
patients, students, general practitioners, researchers) can write about any kind of topic (e.g., disease, treatment, drug) in any language. What kind of knowledge can we extract from such an amount of data? And how can this contribute to the medical domain? In this talk, I will present two projects. The first one focuses on the compilation of trilingual comparable corpora on specialised domains, especially on the medical one. Those resources are mainly used to build multilingual terminologies, highly needed in specialised domains. The second project is an opinion mining system on health-related user-generated content.
First focusing on drug-related texts, our lexicon-based system provides summarised view of opinions expressed on different aspects of the drugs.


META-SHARE - An Open Resource Sharing Infrastructure for Language Technologies

John Judge

The very diverse and heterogeneous landscape of huge amounts of digital and digitized resources collections (publications, datasets, multimedia files, processing tools, services and applications) has drastically transformed the requirements for their publication, archiving, discovery and long-term maintenance. Digital repositories provide the infrastructure for describing and documenting, storing, preserving, and making this information publicly available in an open, user-friendly and trusted way. Repositories represent an evolution of the digital libraries paradigm towards open access, advanced search capabilities and large-scale distributed architectures.

META-SHARE aims at providing such an open, distributed, secure, and interoperable infrastructure for the Language Technology domain. Open, since the infrastructure is conceived as an ever-evolving, scalable resource base including free and for-a-fee resources and services; distributed because it will consist of networked repositories/data centres accessible through common interfaces; interoperable, because the resource base will be standards-compliant, trying to overcome format, terminological and semantic differences; secure, since it will guarantee legally sound governance, legal compliance and secure access to licensable resources.

META-SHARE builds a multi-layer infrastructure that will:
* make available quality documented LRs and related metadata over the network,
* ensure that such LRs and metadata are properly managed, preserved and maintained,
* provide a set of services to all META-SHARE members and users,
* promote the use of widely acceptable standards for language resource building ensuring the maximum possible interoperability of LRs,
* allow associated third parties to export their LRs over the META-SHARE network,
* allow potential users of the LRs to easily and legally safely acquire the LRs requested for their own purposes.

The targeted resources and technologies of META-SHARE, in order of priority, include:
* language data, such as written and spoken corpora,
* language-related data, including and/or associated to other media and modalities where written and spoken natural language plays an important role,
* language processing and annotation tools and technologies,
* services through the use of language processing tools and technologies,
* evaluation tools, metrics and protocols, services addressing assessment and evaluation,
* service workflows by combining and orchestrating interoperable services.

META-SHARE intends to turn into a useful infrastructure for providers and users of language resources and technologies, as well as LT integrators/vendors, language professionals (translators, interpreters, localization experts), national and international data centres and repositories of LRs and technologies, and national and international LT policy makers and other LR & LT funders and sponsors.

META-SHARE will be a freely available facility, supported by a large user and developer community, based on distributed networked repositories accessible through common interfaces. Users (consumers, providers or aggregators) will have single sign-on accounts and will be able to access everything within the repositories network.

Language resources and their metadata will reside at the members' repositories. Metadata only are exported to be available for harvesting and for populating the network's inventory that will include metadata-based descriptions of all LRs in the network. Software for building one's own repository will be made available by META-SHARE itself, free of charge.

In META-SHARE the requested LRs are just a few clicks away.


External Query Reformulation for Text-based Image Retrieval

Jinming Min

In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrieval effectiveness. A standard method used to address this problem is pseudo relevance feedback (PRF) which updates user queries by adding feedback terms selected automatically from top ranked documents in a prior retrieval run. PRF assumes that the target collection provides enough feedback information to select effective expansion terms. This is often not the case in image retrieval since images often only have short metadata annotations leading to the IAP. Our work proposes the use of an external knowledge resource (Wikipedia) in the process of refining user queries. In our method, Wikipedia documents strongly related to the terms in user query ("definition documents'') are first identified by title matching between the query and titles of Wikipedia articles. These definition documents are used as indicators to re-weight the feedback documents from an initial search run on a Wikipedia abstract collection using the Jaccard coefficient. The new weights of the feedback documents are combined with the scores rated by different indicators. Query-expansion terms are then selected based on these new weights for the feedback documents. Our method is evaluated on the ImageCLEF WikipediaMM image retrieval task using text-based retrieval on the document metadata fields. The results show significant improvement compared to standard PRF methods.


Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers

Joachim Wagner

At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. In this talk, I give an overview of my research on applying such grammars to automatically judge the grammaticality of an input string. I show evidence that grammaticality is reflected in the generative probability of the best parse, discuss training and evaluation of classifiers and present results for 9 selected methods, including baseline XLE grammar and n-gram methods and machine learning-based methods.


DCU@FIRE 2011: SMS-Based FAQ Retrieval

Johannes Leveling

This presentation gives an overview of the DCU participation in the SMS-based FAQ Retrieval task at FIRE, the Forum for Information Retrieval Evaluation. The objective in the SMS-based FAQ retrieval task is to find answers in a collection of frequently asked questions (FAQ) given a SMS question in "text-speak". DCU submitted experimental runs for the monolingual English subtask. The DCU approach to this problem consists of first transforming the noisy SMS queries into a normalised, corrected form. The normalised queries are then used to retrieve a ranked list of FAQ results by combining the results from three slightly different retrieval mechanisms. Finally, using information from the retrieval results, out-of-domain (OOD) queries are identified and tagged. The results of our best run on the final test set are the best among 13 participating groups. We retrieved results for 70% in-domain queries correctly, identified 85.6% out-of-domain queries correctly, and obtained an MMR score of 0.896.


Topical Relevance Models

Debasis Ganguly

We propose a topical relevance model (TRLM) which is a generalized relevance model (RLM) aiming to alleviate the limitations of a standard RLM by exploiting the topical structures of pseudo-relevant documents so as to boost the intra topical co-occurrences and down-weight the inter topical ones. TRLM provides a framework to estimate a set of underlying hypothetical relevance models for each information need expressed in a query. Latent Dirichlet allocation (LDA) of pseudo-relevant documents is used in the probability estimations. It is not only the pseuso-relevant documents that may be topically structured, but also a massive query itself, which expresses a set of largely diverse information needs, such as the ones which are used in associative document search eg. patent prior art search or legal search. Two variants of TRLM are thus proposed, one for the standard ad-hoc search scenario, and the other for associative document search. The first variant, called the unifaceted TRLM (uTRLM), assumes that a query expresses a single overall information need encapsulating a set of related sub-information needs. The second variant, called the multifaceted model (mTRLM), is built on the assumption that a query explicitly expresses several different information needs. Reults show that uTRLM significantly outperforms RLM for ad-hoc search, and mTRLM outperforms both RLM and uTRLM in patent prior art search. TRLM is shown to be more robust than RLM in filtering out the noise factor from non relevant feedback documents.


Khresmoi – Medical information analysis and retrieval

Liadh Kelly

This presentation gives an overview of the Khresmoi EC FP7 project which aims to develop a multilingual, multimodal search and access system for biomedical information and documents. The system is targeted at three use cases: general public, general practitioners and radiologists, understanding different languages at different levels, and with different medical information requirements. DCU’s contributions to this project are in the translation, retrieval, collaborative components, and evaluation technique development spaces.


Domain and Personalised Tuning for Machine Translation: the CNGL demonstrator

Sara Morrissey

Domain adaptation is becoming increasingly topical and of industry interest. We are no longer looking for catch-all solutions to MT, but working towards developing specific systems for the niche market and specific domains. This presentation will give an overview of the CNGL D3c demonstrator project. We address the specific area of domain tuning to language and content style for Symantec forum data.where the most pertinent language resources are unavailable. I will discuss our work in this area so far.


Head-Driven Hierarchical Phrase-based Translation

Junhui Li

Chiang's hierarchical phrase-based (HPB) translation model advances the state-of-the-art in statistical machine translation by expanding conventional phrases to hierarchical phrases -- phrases that contain sub-phrases. However, the original HPB model is prone to over-generation due to lack of linguistic knowledge. In this talk, I will give an overview of my research on syntactic augmented machine translation. I will show a simple but effective translation model, called the Head-Driven HPB (HD-HPB) model, which incorporates head information in translation rules to better capture syntax-driven information in a derivation. An extensive set of experiments on Chinese-English translation on four NIST MT test sets show that our HD-HPB model significantly outperforms Chiang's model.


The Regression Model of Machine Translation

Ergun Biçici

Regression based machine translation (RegMT) approach provides a learning framework for machine translation, separating learning models for training, training instance selection, feature representation, and decoding. We introduce sparse regression as a better model than L2 regularized regression for statistical machine translation and demonstrate that sparse regression models achieve better performance in predicting target features, estimating word alignments, creating phrase tables, and generating translation outputs. We develop training instance selection algorithms that not only make RegMT computationally more scalable but also improve the performance of standard SMT systems. We develop good evaluation techniques for measuring the performance of the RegMT model, SMT systems, and the quality of the translations.


Studying the informal features of user-generated content

Alejandro Mosquera

User-generated content (UGC) has transformed the way that information is handled on-line. This paradigm shift focused on the user has given rise to Web 2.0 applications, where users generate, share and consume information. UGC is a valuable resource that can be exploited for different purposes, such as opinion mining, directed advertising or information retrieval. However, UGC analysis can be challenging because of the informal features present in this new kind of textual communication. The use of emoticons, colloquial language, slang, misspellings and abbreviations are higher in comparison with standard texts, thus becoming less accessible and hard to understand to both people and Natural Language Processing applications. In this talk we aim to explain some techniques to classify and normalise UGC and discuss future work to enhance this research topic.


AlignRank: A Graph-Based Word Alignment Approach Using Evidence Propagation

Xiaofeng Wu

we propose a graph-based method for one-to-one word alignment. Instead of assuming independence among each word alignment decision as in the IBM1 model, we observe that word alignment decisions are dependent on each other and that such dependencies should be modeled to improve performance. Specifically, we propose a graph representation which like the IBM1 model captures co-occurrence information for word alignment, but can also model dependencies among all word alignment decisions in a corpus. We propose a PageRank style evidence propagation framework, to exploit dependencies in the graph for better performance. Experimental results demonstrate both good alignment quality and high run time efficiency. Our results show that with an appropriate design of the graph structure, our result can surpass IBM1 by 1.3% on AER, and by 1.8% on BLEU (p<0.05).


Combining EBMT, SMT, TM and IR Technologies for Quality and Scale

Sandipan Dandapat

In this paper we present a hybrid statistical machine translation (SMT)-example-based MT (EBMT) system that shows significant improvement over both SMT and EBMT baseline systems. First we present a runtime EBMT system using a subsentential translation memory (TM). The EBMT system is further combined with an SMT system for effective hybridization of the pair of systems. The hybrid system shows significant improvement in translation quality (0.82 and 2.75 absolute BLEU points) for two different language pairs (English–Turkish (En–Tr) and English– French (En–Fr)) over the baseline SMT system. However, the EBMT approach suffers from significant time complexity issues for a runtime approach. We explore two methods to make the system scalable at runtime. First, we use an heuristic-based approach. Secondly, we use an IR-based indexing technique to speed up the time-consuming matching procedure of the EBMT system. The index-based matching procedure substantially improves run-time speed without affecting translation quality.


Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC dry-run)

Maria Eskevich

We describe the development of a test collection for the investigation of speech retrieval beyond identification of relevant content. This collection focuses on satisfying user information needs for queries associated with specific types of speech acts. The collection is based on an archive of the Internet video from Internet video sharing platform (blip.tv), and was provided by the MediaEval benchmarking initiative. A crowdsourcing approach was used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. We describe and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform. We highlight the challenges of constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items.


TBA

TBA

TBA


TBA

TBA

TBA


Dublin City University   Last update: 3rd April 2012