[go: up one dir, main page]

Anni Eskelinen


2024

pdf bib
From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations
Erik Henriksson | Amanda Myntti | Saara Hellström | Selcen Erten-Johansson | Anni Eskelinen | Liina Repo | Veronika Laippala
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

In corpus linguistics, registers–language varieties suited to different contexts–have traditionally been defined by their situations of use, yet recent studies reveal significant situational variation within registers. Previous quantitative studies, however, have been limited to English, leaving this variation in other languages largely unexplored. To address this gap, we apply a quantitative situational analysis to a large multilingual web register corpus, using large language models (LLMs) to annotate texts in English, Finnish, French, Swedish, and Turkish for 23 situational parameters. Using clustering techniques, we identify six situational text types, such as “Advice”, “Opinion” and “Marketing”, each characterized by distinct situational features. We explore the relationship between these text types and traditional register categories, finding partial alignment, though no register maps perfectly onto a single cluster. These results support the quantitative approach to situational analysis and are consistent with earlier findings for English. Cross-linguistic comparisons show that language accounts for only a small part of situational variation within registers, suggesting registers are situationally similar across languages. This study demonstrates the utility of LLMs in multilingual register analysis and deepens our understanding of situational variation within registers.

pdf bib
Building Question-Answer Data Using Web Register Identification
Anni Eskelinen | Amanda Myntti | Erik Henriksson | Sampo Pyysalo | Veronika Laippala
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish – Turku WebQA – comprising over 200,000 QA pairs.

2023

pdf bib
FinGPT: Large Generative Models for a Small Language
Risto Luukkonen | Ville Komulainen | Jouni Luoma | Anni Eskelinen | Jenna Kanerva | Hanna-Mari Kupari | Filip Ginter | Veronika Laippala | Niklas Muennighoff | Aleksandra Piktus | Thomas Wang | Nouamane Tazi | Teven Scao | Thomas Wolf | Osma Suominen | Samuli Sairanen | Mikko Merioksa | Jyrki Heinonen | Aija Vahtola | Samuel Antao | Sampo Pyysalo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.

pdf bib
Toxicity Detection in Finnish Using Machine Translation
Anni Eskelinen | Laura Silvala | Filip Ginter | Sampo Pyysalo | Veronika Laippala
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Due to the popularity of social media platforms and the sheer amount of user-generated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.