NCLT Seminar Series 2007/2008The NCLT seminar series takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).
The seminar will comprise a mixture of research talks and, this year, a round of tutorials based on chapters from the new (draft) edition of the Jurafsky and Martin book, Speech and Language Processing, which can be found here.
View current chapter allocation
The schedule of presenters for the 2007/2008 series (Semester 1) will be added below as they are confirmed:
In presenting J&M's Chapter 7 on phonetics, we will quickly run through articulatory phonetics and phonology, only pausing where necessary (as dictated by those in attendance). I intend to spend more time on acoustic phonetics as this is almost a prerequisite for appreciating the technical aspects of both ASR (automatic speech recognition) and waveform generation in speech synthesis. In summary, we'll look more closely at understanding time and frequency representations of the speech signal and how digital signal processing plays its part. I plan to conduct the session in a fairly informal and interactive fashion. The primary presentation source will be the pdf of the chapter - so please bring a printout along - but I will have some supporting material.
Acyclic finite state automata are widely used in Natural Language Processing in order to represent and store huge data such as dictionaries. Our work deals with the study of internal structure of acyclic automaton; more precisely we are interested in finding structures inside a finite state automaton, which we call sub-automata. Thus, we propose a O(n3) algorithm to compute all subautomata of a given automaton. This study can be used in applications whose aim is to decompose a very large FSA into smaller ones, to discover frequently occurring data and to reduce memory consumption. The second part of our work is devoted to the application of our algorithm for compression and indexing of automata that represent electronic dictionaries. Also, we propose a compression algorithm to reduce the memory required to store the automata and to preserve an effective access to data. The main propositions are, on the one hand, the application of the direct acyclic word graph, initially dedicated for indexing text, to index the subautomata, and, on the other hand, heuristic to select the most interesting substructure to factorize. The best candidates to be factorized are those which increase memory storage efficiency and reduce the size of the initial automaton.
NLP research often demands resources not available on a single desktop PC. Training statistical models can be very memory-intensive, corpus processing very CPU-intensive, and some tasks require large amounts of temporary disk space. As many users share the same machines for their experiments, there have been resource conflicts in the past (for example "disk full"). To address these needs and problems, 5 new machines have been bought and organised in a cluster over the last 6 months. The resources of the cluster are managed centrally and allocated exclusively for experiments. In this talk I will give an overview of the cluster, show how to use it and outline the plan for integration the old machines into the cluster and adding more new machines.
In my presentation I will focus on the first part of (Jurafsky & Martin, 2007: Chapter 6): Hidden Markov Models. HMMs are probabilistic sequence classifiers which can be used to compute a possible probability distribution over possible labels and are applied to a wide range of NLP tasks such as speech recognition, tagging, chunking, word sense disambiguation, and so forth. I will talk about different aspects of applications (evaluation, decoding, training), and introduce the Forward, Viterbi and Forward-Backward algorithm.
This talk gives a short introduction to Maximum Entropy models and their use for classification (i.e. document classification) and sequence labeling (i.e. POS-tagging). Maximum entropy models belong to the family of log-linear, or exponential, classifiers. They solve multi-class classification problems using multinomial logistic regression. MaxEnt is based on the idea of building a probabilistic model which satisfies constraints learned from the training data but otherwise makes no additional assumptions. That is a MaxEnt model is the most uniform distribution which is consistent with the constraints. MaxEnt models in themselves are classifiers. Markov models based on Maximum Entropy, or MEMMs can be used for sequence labeling. They work by using the Viterbi algorithm to find the best sequence of labels given the conditional probability distributions for each element in the sequence. Their advantage in comparison to HMM is the ability to condition on arbitrary features. For example for POS-tagging features such as suffixes, capitalization or surrounding punctuation can be used which are difficult to encode in HMMs.
We use existing tools to automatically build two parallel treebanks from existing parallel corpora. We then show that combining the data extracted from both the treebanks and the corpora into a single trans- lation model can improve the translation quality in a baseline phrase- based statistical machine translation system.
This paper is a contribution to the ongoing discussion on treebank annotation schemes and their impact on PCFG parsing results. We provide a thorough comparison of two German treebanks: the TIGER treebank and the TšuBa-D/Z. We use simple statistics on sentence length and vocabulary size, and more rened methods such as perplexity and its correlation with PCFG parsing results, as well as a Principal Components Analysis. Finally we present a qualitative evaluation of a set of 100 sentences from the TšuBa- D/Z, manually annotated in the TIGER as well as in the TšuBa-D/Z annotation scheme, and show that even the existence of a parallel subcorpus does not support a straightforward and easy comparison of both annotation schemes.
I will present an overview of the chapter "Parsing with Context-Free Grammars" from the new edition of Jurafsky and Martin's "Speech and Language Processing". The chapter covers full parsing with CFGs -- including CKY, Earley and agenda-based (chart) parsing -- as well as partial parsing (in particular machine learning-based base-phrase chunking). Since much of the material should be quite familiar to many people I will put a particular focus on the new additions to the chapter, and will provide some additional examples not covered in the book.
I will present an overview of statistical parsing, based on a draft Chapter 14 of the new edition of Jurafsky and Martin's ``Speech and Language Processing''. The chapter covers the following topics: PCFGs, using PCFGs for syntactic disambiguation and language modelling, probabilistic CKY, obtaining rule probabilities, PCFG limitations, lexicalised history-based generative parsing, discriminative parsing, parser evaluation and the human parsing mechanism. I will cover all but the last two topics and also include a very brief overview of the dependency parsing field.
In this talk, I will present an overview of the Jurafsky & Martin chapter on Machine Translation (Chapt. 25, 2007). I will briefly outline the history of MT as a field of research and where the different approaches fit in. However, I will focus most of the talk on the basics of word- and phrase-based statistical MT, which is the dominant approach to MT both in the research arena and in Jurafsky & Martin's chapter. I aim to outline how the IBM word-alignment models (implemented in Giza++) work, how phrase-pair induction currently works and the basics of decoding for SMT. However, it's a long chapter and each of these topics involves a lot of detail so I'm not sure how much material I'll get through -- Yanjun will also present some of this chapter later on in the seminar series.
In present Statistical Machine Translation (SMT) systems, alignment is trained in a previous stage as the translation model. Consequently, alignment model parameters are not tuned in function of the translation task, but only indirectly. The speaker will present a framework for discriminative training of alignment models with automated translation metrics as maximisation criterion. Thus, no link labels at the word level are needed. First the n-gram-based machine translation system will be introduced. Then difficulties of word alignment evaluation and its correlation with machine translation quality will be discussed. After this, the alignment system will be described. Finally, the speaker will present the minimum-translation-error alignment training method on small corpora, and its extension for large corpora (the alignment model coefficients are tuned over a small part of the corpus and used to align the whole corpus).
This talk will focus on generative word alignment models in Statistical Machine Translation (SMT). I will firstly give a formal definition of word alignment, introduce the evaluation methods and mainstream approaches for this task. Then this talk will first cover HMM word alignment models including first-order HMM word alignment model (Vogel, Ney and Tillman 96) and zero-order word alignment model (IBM model 1 and IBM model2). I will also go through how to use EM algorithm for unsupervised parameter estimation (cf. Mary's talk). Then, fertility-based models including IBM model 3, 4, 5 will follow. I will introduce heuristics (hill-climbing) for the approximate parameter estimation of these more complicated models. Advantages and limitations for each of these models will also be illustrated during the talk. Finally, I will give a brief introduction to the implementation of these models - GIZA++ and its limitations in use. An overview of some recent attempts to improve generative models will end this talk.
|Last update: 1st October 2010|