Probabilistic, Information-Theoretic Models for Etymological Alignment
Hannes Wettig
Abstract
This thesis starts out by reviewing Bayesian reasoning and Bayesian
network models. We present results related to discriminative learning of
Bayesian network parameters. Along the way, we explicitly identify a
number of problems arising in Bayesian model class selection. This leads
us to information theory and, more specically, the minimum description
length (MDL) principle. We look at its theoretic foundations and
practical implications. The MDL approach provides elegant solutions for
the problem of model class selection and enables us to objectively
compare any set of models, regardless of their parametric
structure. Finally, we apply these methods to problems arising in
computational etymology. We develop model families for the task of
sound-by-sound alignment across kindred languages. Fed with linguistic
data in the form of cognate sets, our methods provide information about
the correspondence of sounds, as well as the history and ancestral
structure of a language family. As a running example we take the family
of Uralic languages.