The commercial demand for high quality Machine Translation (MT) is obvious. For localization purposes, a software company such as Symantec needs to deliver helpful content to its customers in their native languages. However, MT evaluation via automatic metrics is only possible when a reference translation is available. In the more realistic setting where no such reference is available, reliable techniques for estimating the quality of translation system output are needed.
As more and more customers move away from traditional call centres and corporate websites in favour of self-service via dedicated discussion forums, there is a growing need for machine translation of User-Generated Content (UGC). Because UGC is an unedited mix of writing styles containing spelling mistakes, abbreviations and non-standard punctuation, it poses a particular challenge for Natural Language Processing (NLP) tools that have been trained on well-formed text.
The aim of the Confident MT project is to develop Confidence Estimation (CE, or QE for Quality Estimation) methods to measure the reliability of MT output in the context of UGC about Symantec products. The CE methods will be applied across a range of MT systems (such as Rule-Based, Example-Based, Phrase-Based SMT and Syntax-Enhanced SMT) and the results will be used to inform the optimal combination of MT systems.
We have created two datasets as part of the ConfidentMT project:
Note, to access both data sets on the Symantec website, please click on the + icon next to the title of the paper describing the dataset.
- SymForum: an English/French data set for quality estimation of machine translated Norton forum text
- Foreebank: an English/French data set for evaluating syntactic parser accuracy on Norton forum text and for measuring the effect of grammatical noise on parsing
The data set is available for downloading here.
It is described in detail in the following publication which should be cited
if you use the data set in your research:
Rasoul Kaljahi, Jennifer Foster, Johann Roturier, 2014,
Syntax and Semantics in Quality Estimation of Machine Translation,
Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), Doha, Qatar. Paper.
The Foreebank data set is split into two components: the DCU side, which contains phrase structure trees without their leaves, is available for downloading here; the Symantec side, which contains the sentences themselves, is available for downloading here. A script for combining the yields with their trees is contained in the DCU side of the data set.
The Foreebank data set is described in detail in the following publication which should be cited
if you use it in your research:
Rasoul Kaljahi, Jennifer Foster, Johann Roturier, Corentin Ribeyre, Teresa Lynn and Joseph Le Roux, 2015. Foreebank: Syntactic Analysis of Customer Support Forums. In EMNLP. Paper. Poster.