One striking feature of the databases is a large amount of a. uncertainty, marked explicitly inside entries in the databases, and b. conflicting statements or theories, between the databases. This alone directly indicates an urgent need to attempt to quantify uncertainty in etymological data. Uncertainty is a natural and inescapable aspect of the very complex task of establishing genetic relationships among languages by positing cognate sets. However, having rigorous methods for analyzing and quantifying the uncertainty can provide essential insights into the quality of the databases. Among other results, we seek measures of internal consistency of the datasets that have been compiled by linguists.

We approach the problem as follows. Our starting point is two databases—compilations of cognate sets, done at different times by different scholars. We treat each data set in turn and try to

1. establish a complete alignment of all lexical items contained in the data set—an alignment that is globally optimal, and

2. simultaneously discover the rules of correspondence that govern the alignments.

There is a duality to the problem posed in this way: on one hand, if we had a complete and accurate alignment of the entire data, we could simply read off the rules of correspondence from the aligned data. On the other hand, if we had a complete set of rules of correspondence, we could construct the alignment, simply by following the rules. We have neither. This suggests an iterative method: we begin with some random alignment, and proceed to discover incrementally better rules from the aligned data, while at the same time obtaining an improved alignment. Our methods are automatic, using the non-aligned data only, with no additional parameters and no supervision. We make the assumption that the “true” rules of regular sound correspondence that underly the relationships within the family are violated very infrequently; however, we admit the possibility in principle that they may be violated. This suggests a probabilistic (rather than deterministic) approach, since that would allow us to model in a continuous fashion the probability of some rule will hold. It also suggests that the information-theoretic Minimum Description Length Principle (MDL) is an appropriate way to model our data: it seems natural to try to encode as much of the information in the dataset as possible via the rules, and then code the (hopefully infrequent) exceptions explicitly. MDL establishes a link between the quality of our “theory” and the code-length of the data: if we can find very crisp rules that describe the sound correspondences among the languages, we will be able to encode the data compactly (short code-length) by writing down the rules. If the rules are too weak—not sufficiently descriptive—we will need to spend a lot of code length on writing down the exceptions. If the rules are overly strong—with conditions that are too rigid and apply in very few instances—we will need to spend a lot of code-length on writing down the rules themselves. MDL finds the optimal balance between the two undesirable extremes. Our presentation is structured as follows. In the first part we will introduce a group of models, starting with a baseline that is simplified to facilitate the presentation of the key ideas, as well as to provide a worst-case performance benchmark. In the second part, we introduce several means for evaluating the performance of the models, discussing what they actually help us prove and achieve, and indicate directions for future enhancements. The Baseline Model makes the following initial simplifications: Pairwise alignment: we align words from only two languages at a time. We provisionally call them the “source” and “target” languages, although the Baseline Model is actually symmetric. The more advanced models are directional.

We align symbols in a 1-1 fashion: one source symbol may correspond only to one target symbol—or to the empty symbol to model insertions and deletions.

We ignore the context of the aligned symbol pair.

Symbols are treated as atoms, not analyzed in terms of their distinctive features. We make the assumption that sounds are identified with alphabet symbols. In Uralic data, it is fairly simple to transform written form to phonetic form. After introducing the Baseline Model, we introduce the iterative learning algorithm, which allows us to obtain a complete alignment of the observed data. The “rules” obtained by this model are very simple: they are embodied in a probability distribution over all possible symbol-pair alignments. To make sure that there is a small number of wide-coverage rules, the algorithm tries to reduce the entropy of the distribution as far as possible. We then in turn relax the simplifying assumptions of the Baseline Model. First we extend it to align multiple symbols. Then we align multiple languages simultaneously. Finally we model each symbol as a vector of its phonetic features, and take information from the context of the symbol into account to predict the symbol more accurately (which cuts down the code length). Each extension yields an improvement in code length over the previous model. We next address the question: how do we measure what the models tell us about the quality of the data? We first provide that the models behave sensibly by showing that they outperform standard data compressors; this follows the principle that we can claim to have discovered the regularities in the data if and only if we can compress it effectively. We then show how the alignment costs (code-length) can be turned into good measures of language distance. By using algorithms for building genealogical trees, we show that our code-lengths yield family trees that strongly resemble trees constructed by linguists.

We examine closely some of the complex rules of correspondence that the algorithm discovers from data, among different languages, and confirm that they agree with rules posited by linguists, and introduce several methods for demonstrating that the rules are objectively sound.

The main contribution of this work is a set of methods that allow us to obtain quantitative measures of the goodness of etymological data sets, and to compare different sets by using these metrics. The methods are objective in the sense that they rely only on the data provided—the posited relationships—and not on any additional assumptions. We do not aim to construct etymological databases from scratch, or from raw linguistic data. Rather we use a dataset to study what rules it already inherently contains, what philogenies it inherently induces, and how consistent it is. Although the methods are explored using Uralic data, they can be easily applied to many other language families.

We discuss the implications of these methods and how they can support etymological research. One way to view our approach is as an attempt to combine the best of both worlds: allow humans to do what they do best—posit semantic links and propose hypotheses, and allow machines to do what they do best—check the consistency of these assumptions over large data sets.