Litkey Corpus

Welcome to the Litkey Corpus, a longitudinal corpus of picture story descriptions produced by German primary school children from grades 2 to 4. Litkey is short for “Literacy as the key to social participation: Psycholinguistic perspectives on orthography instruction and literacy acquisition” and refers to the research project in which the corpus was created. Here you can find a general description of the aims of the project and the four project strands that are involved. In the strand Corpus Analysis, we investigate the relationship between spelling errors of beginning writers and the orthographic properties of words based on the Litkey Corpus. You can find out more about the strand Corpus Analysis here.

Quick Overview of the Litkey Corpus

The texts in the Litkey Corpus were collected by Frieg (2014) between 2010 and 2012. At ten testing points in total, children in grades 2 to 4 were asked to write down a story shown in the sequence of six pictures, so there are up to ten texts per child. All school classes were located in urban areas of Northrhine-Westfalia. About 86% of the children were born in Germany but for only about 52%, German was reported to be among the first languages acquired in the family. Most of the children, about 63%, reported to be multilingual. The following table gives an overview of the size of the Litkey Corpus:

No. of texts	1922
No. of children	251
Avg. no. of texts per child	7.66 (± 2.08)
No. of tokens	212,505
No. of types	6,364
No. of alphabetic tokens	189,394
No. of words with spelling errors	37,446

In the Litkey project, all texts were transcribed manually and enriched with a target hypothesis which corrects orthographic errors only. The original and target spelling are aligned character-wise. Furthermore, rich annotations were added semi-automatically, which include:

information on the target word
- POS
- phonemes, syllables, morphemes
- key orthographic features
- lexical properties like type and lemma frequency
information related to a spelling error, e.g.
- error category based on the Litkey scheme comprising 80 categories
- whether the pronunciation of the word is affected by the spelling error
- whether the correct spelling can be inferred from a related word form

The Litkey corpus is available in different formats:

Litkey-XML: LearnerXML (XML-based format)
Litkey-Tab: table-based view of the tokens in the corpus
Litkey-DB: table-based view of the types in the corpus
Litkey-ANNIS: access via the corpus search tool ANNIS (Krause & Zeldes, 2016)
Litkey-csv: CSV files of aligned original and target tokens
Litkey-orig: original texts as plain text files

More Information and Access to the Corpus

On the following pages you can find more information about the Litkey Corpus:

Documentation: Information about the processing steps and the annotations of the corpus
Access/Citing: Access to the different formats of the Litkey Corpus, license information etc.
Publications: Publications related to the Litkey Corpus
Contact: People working on the Litkey Corpus and contact information

References

Frieg, H. (2014). Sprachförderung im Regelunterricht der Grundschule: Eine Evaluation der Generativen Textproduktion (Dissertation). Ruhr-Universität Bochum. Retrieved from http://www-brs.ub.ruhr-uni-bochum.de/netahtml/HSS/Diss/FriegHendrike/diss.pdf

Krause, T., & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities, 31(1), 118–139.