Information-theoretic modeling of etymological sound change
Hannes Wettig, Javad Nouri, Kirill Reshetnikov and Roman Yangarber
Extended Abstract
We present work on the UraLink Project, jointly funded
by the Academy of Finland and by the Russian Fund
for the Humanities, aiming to study computational methods
for analysing etymological data in the Uralic Language
Family. The Project is conducted jointly with several
partners who provide us with Uralic etymological
databases.
One striking feature of the databases is a large amount
of a. uncertainty, marked explicitly inside entries in the
databases, and b. conflicting statements or theories, between
the databases. This alone directly indicates an urgent
need to attempt to quantify uncertainty in etymological
data. Uncertainty is a natural and inescapable aspect
of the very complex task of establishing genetic relationships
among languages by positing cognate sets. However,
having rigorous methods for analyzing and quantifying
the uncertainty can provide essential insights into
the quality of the databases. Among other results, we
seek measures of internal consistency of the datasets that
have been compiled by linguists.
We approach the problem as follows. Our starting
point is two databases—compilations of cognate sets,
done at different times by different scholars. We treat
each data set in turn and try to
1. establish a complete alignment of all lexical items
contained in the data set—an alignment that is globally
optimal, and
2. simultaneously discover the rules of correspondence
that govern the alignments.
There is a duality to the problem posed in this way: on
one hand, if we had a complete and accurate alignment
of the entire data, we could simply read off the rules of
correspondence from the aligned data. On the other hand,
if we had a complete set of rules of correspondence, we
could construct the alignment, simply by following the
rules. We have neither. This suggests an iterative method:
we begin with some random alignment, and proceed to
discover incrementally better rules from the aligned data,
while at the same time obtaining an improved alignment.
Our methods are automatic, using the non-aligned data
only, with no additional parameters and no supervision.
We make the assumption that the “true” rules of regular
sound correspondence that underly the relationships
within the family are violated very infrequently; however,
we admit the possibility in principle that they may be violated.
This suggests a probabilistic (rather than deterministic)
approach, since that would allow us to model
in a continuous fashion the probability of some rule will
hold. It also suggests that the information-theoretic Minimum
Description Length Principle (MDL) is an appropriate
way to model our data: it seems natural to try to encode
as much of the information in the dataset as possible
via the rules, and then code the (hopefully infrequent) exceptions
explicitly. MDL establishes a link between the
quality of our “theory” and the code-length of the data: if
we can find very crisp rules that describe the sound correspondences
among the languages, we will be able to
encode the data compactly (short code-length) by writing
down the rules. If the rules are too weak—not sufficiently
descriptive—we will need to spend a lot of code
length on writing down the exceptions. If the rules are
overly strong—with conditions that are too rigid and apply
in very few instances—we will need to spend a lot of
code-length on writing down the rules themselves. MDL
finds the optimal balance between the two undesirable extremes.
Our presentation is structured as follows. In the first
part we will introduce a group of models, starting with a
baseline that is simplified to facilitate the presentation of
the key ideas, as well as to provide a worst-case performance
benchmark. In the second part, we introduce several
means for evaluating the performance of the models,
discussing what they actually help us prove and achieve,
and indicate directions for future enhancements.
The Baseline Model makes the following initial simplifications:
Pairwise alignment: we align words from only two
languages at a time. We provisionally call them the
“source” and “target” languages, although the Baseline
Model is actually symmetric. The more advanced
models are directional.
We align symbols in a 1-1 fashion: one source symbol
may correspond only to one target symbol—or
to the empty symbol to model insertions and deletions.
We ignore the context of the aligned symbol pair.
Symbols are treated as atoms, not analyzed in terms
of their distinctive features.
We make the assumption that sounds are identified with
alphabet symbols. In Uralic data, it is fairly simple to
transform written form to phonetic form.
After introducing the Baseline Model, we introduce the
iterative learning algorithm, which allows us to obtain a
complete alignment of the observed data. The “rules” obtained
by this model are very simple: they are embodied
in a probability distribution over all possible symbol-pair
alignments. To make sure that there is a small number
of wide-coverage rules, the algorithm tries to reduce the
entropy of the distribution as far as possible.
We then in turn relax the simplifying assumptions of
the Baseline Model. First we extend it to align multiple
symbols. Then we align multiple languages simultaneously.
Finally we model each symbol as a vector of its
phonetic features, and take information from the context
of the symbol into account to predict the symbol more
accurately (which cuts down the code length). Each extension
yields an improvement in code length over the
previous model.
We next address the question: how do we measure
what the models tell us about the quality of the data? We
first provide that the models behave sensibly by showing
that they outperform standard data compressors; this follows
the principle that we can claim to have discovered
the regularities in the data if and only if we can compress
it effectively. We then show how the alignment
costs (code-length) can be turned into good measures
of language distance. By using algorithms for building
genealogical trees, we show that our code-lengths yield
family trees that strongly resemble trees constructed by
linguists.
We examine closely some of the complex rules of
correspondence that the algorithm discovers from data,
among different languages, and confirm that they agree
with rules posited by linguists, and introduce several
methods for demonstrating that the rules are objectively
sound.
The main contribution of this work is a set of methods
that allow us to obtain quantitative measures of the goodness
of etymological data sets, and to compare different
sets by using these metrics. The methods are objective
in the sense that they rely only on the data provided—the
posited relationships—and not on any additional assumptions.
We do not aim to construct etymological databases
from scratch, or from raw linguistic data. Rather we use a
dataset to study what rules it already inherently contains,
what philogenies it inherently induces, and how consistent
it is. Although the methods are explored using Uralic
data, they can be easily applied to many other language
families.
We discuss the implications of these methods and how
they can support etymological research. One way to view
our approach is as an attempt to combine the best of both
worlds: allow humans to do what they do best—posit semantic
links and propose hypotheses, and allow machines
to do what they do best—check the consistency of these
assumptions over large data sets.