Etymon Project

Research Focus:

In the Etymon Project, we develop computational methods for modeling language evolution and relationships among languages within language families. The methods are based on information theory and the Minimum Description Length Principle (MDL). Initially, the methods were applied to the Uralic language family—viz., the languages genetically related to Finnish. The methods are applicable generally; we currently explore application to the following language families:

Supported by:

Language evolution and etymology:

Unsupervised Learning of Morphology:

Transliteration:

Resources:

People:

  • Mian Du, PhD student
  • Suvi Hiltunen, MSc (2010-2012)
  • Guowei Lv, Master's student (2012)
  • Javad Nouri, MSc student
  • Kirill Reshetnikov,
    Russian Academy of Sciences,
    Institute of Linguistics, Moscow
  • Arto Vihavainen, MSc (2010-2011)
  • Marjaana Välisalo, MSc student (2008)
  • Hannes Wettig, PhD student (completed 2013)
  • Roman Yangarber: Project Lead

Collaboration:

Cover: Family network

Cover: Fin-Est alignment

Publications: conference and journal papers, book chapters, dissertations

  1. Modeling language evolution with codes that utilize context and phonetic features   (pdf)
    Javad Nouri, Roman Yangarber
    In Proceedings of CoNLL: 2016 Conference on Computational Natural Language Learning
    (2016) Berlin, Germany

  2. From alignment of etymological data to phylogenetic inference via population genetics   (pdf)
    Javad Nouri, Jukka Sirén, Jukka Corander, Roman Yangarber
    In Proceedings of CogACLL: the 7th Workshop on Cognitive aspects of Computational Language Learning colocated with ACL-2016
    (2016) Berlin, Germany

  3. Minimum Description Length Models for Unsupervised Learning of Morphology   (Master's Thesis)
    Javad Nouri
    (2016) University of Helsinki, Department of Computer Science

  4. A novel method for evaluation of morphological segmentation   (pdf)
    Javad Nouri, Roman Yangarber
    In Proceedings of LREC: 10th International Conference on Language Resources and Evaluation
    (2016) Portorož, Slovenia

  5. Measuring Language Closeness by Modeling Regularity   (pdf)
    Javad Nouri, Roman Yangarber
    In Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants
    (2014) Doha, Qatar

  6. Cognate discovery and alignment in computational etymology   (Master's Thesis)
    Guowei Lv
    (2014) University of Helsinki, Department of Computer Science

  7. MDL-based Models for Transliteration Generation   (pdf)
    Javad Nouri, Lidia Pivovarova, Roman Yangarber
    SLSP 2013: International Conference on Statistical Language and Speech Processing
    Springer Verlag, Lecture Notes in Artificial Intelligence (LNAI) Volume 7978, (2013) Tarragona, Spain

  8. Information-theoretic modeling of etymological sound change   (abstract)
    Hannes Wettig, Javad Nouri, Kirill Reshetnikov and Roman Yangarber
    Invited chapter in Approaches to measuring linguistic differences (Lars Borin, Anju Saxena, eds.) Trends in Linguistics Series, Volume 265.
    (2013) Mouton de Gruyter

  9. Probabilistic, Information-Theoretic Models for Etymological Alignment   (Ph.D. Thesis)
    Hannes Wettig
    (2013) University of Helsinki, Department of Computer Science

  10. Information-theoretic Methods for Analysis and Inference in Etymology   (pdf)
    Hannes Wettig, Javad Nouri, Kirill Reshetnikov and Roman Yangarber
    In Proceedings of WITMSE-2012: the 5th Workshop on Information-theoretic Methods in Science and Engineering   (Steven de Rooij, Wojciech Kotłowski, Jorma Rissanen, Petri Myllymäki, Teemu Roos & Kenji Yamanishi, eds.)
    (2012) Amsterdam, the Netherlands

  11. Minimum Description Length Modeling of Etymological Data   (Master's Thesis)
    Suvi Hiltunen
    (2012) University of Helsinki, Department of Computer Science

  12. Using Context and Phonetic Features in Models of Etymological Sound Change   (pdf)
    Hannes Wettig, Kirill Reshetnikov and Roman Yangarber.
    In Conference of the European Chapter of the Association for Computational Linguistics (EACL) Workshop on Visualization of Linguistic Patterns and Uncovering Language History from Multilingual Resources
    (2012) Avignon, France

  13. MDL-based models for alignment of etymological data   (pdf)
    Hannes Wettig, Suvi Hiltunen, Roman Yangarber.
    RANLP-2011: Conference on Recent Advances in Natural Language Processing
    (2011) Hissar, Bulgaria

  14. MDL-based modeling of etymological sound change in the Uralic language family   
    Hannes Wettig, Suvi Hiltunen, Roman Yangarber.
    WITMSE-2011: The 4th Workshop on Information Theoretic Methods in Science and Engineering
    (2011) Helsinki, Finland

  15. Probabilistic models for alignment of etymological data   (pdf)
    Hannes Wettig, Roman Yangarber.
    Nodalida-2011: Nordic Conference on Computational Linguistics
    (2011) Riga, Latvia

  16. Hidden Markov models for induction of morphological structure of natural language   
    Hannes Wettig, Suvi Hiltunen, Roman Yangarber.
    WITMSE-2010: Workshop on Information Theoretic Methods in Science and Engineering
    (2010) Tampere, Finland

  17. A Database of the Uralic language family for etymological research   
    Yangarber, R., Salmenkivi, M., Välisalo, M.
    University of Helsinki, Technical Report Series C; C-2008-38.
    (2008) Helsinki, Finland

Example reconstruction of Turkic family, based on StarLing data:

Cover: Turkic language family