Litkey Corpus
Welcome to the Litkey Corpus, a longitudinal corpus of picture story descriptions produced by German primary school children from grades 2 to 4. Litkey is short for “Literacy as the key to social participation: Psycholinguistic perspectives on orthography instruction and literacy acquisition” and refers to the research project in which the corpus was created. Here you can find a general description of the aims of the project and the four project strands that are involved.
In the strand Corpus Analysis, we investigate the relationship between spelling errors of beginning writers and the orthographic properties of words based on the Litkey Corpus. You can find out more about the strand Corpus Analysis here.
Quick Overview of the Litkey Corpus
The texts in the Litkey Corpus were collected by Frieg (2014) between 2010 and 2012. At ten testing points in total, children in grades 2 to 4 were asked to write down a story shown in the sequence of six pictures, so there are up to ten texts per child. All school classes were located in urban areas of Northrhine-Westfalia. About 86% of the children were born in Germany but for only about 52%, German was reported to be among the first languages acquired in the family. Most of the children, about 63%, reported to be multilingual. The following table gives an overview of the size of the Litkey Corpus:
No. of texts | 1922 |
---|---|
No. of children | 251 |
Avg. no. of texts per child | 7.66 (± 2.08) |
No. of tokens | 212,505 |
No. of types | 6,364 |
No. of alphabetic tokens | 189,394 |
No. of words with spelling errors | 37,446 |
In the Litkey project, all texts were transcribed manually and enriched with a target hypothesis which corrects orthographic errors only. The original and target spelling are aligned character-wise. Furthermore, rich annotations were added semi-automatically, which include:
- information on the target word
- POS
- phonemes, syllables, morphemes
- key orthographic features
- lexical properties like type and lemma frequency
- information related to a spelling error, e.g.
- error category based on the Litkey scheme comprising 80 categories
- whether the pronunciation of the word is affected by the spelling error
- whether the correct spelling can be inferred from a related word form
The Litkey corpus is available in different formats:
- Litkey-XML: LearnerXML (XML-based format)
- Litkey-Tab: table-based view of the tokens in the corpus
- Litkey-DB: table-based view of the types in the corpus
- Litkey-ANNIS: access via the corpus search tool ANNIS (Krause & Zeldes, 2016)
- Litkey-csv: CSV files of aligned original and target tokens
- Litkey-orig: original texts as plain text files
More Information and Access to the Corpus
On the following pages you can find more information about the Litkey Corpus:
- Documentation: Information about the processing steps and the annotations of the corpus
- Access/Citing: Access to the different formats of the Litkey Corpus, license information etc.
- Publications: Publications related to the Litkey Corpus
- Contact: People working on the Litkey Corpus and contact information
References
Frieg, H. (2014). Sprachförderung im Regelunterricht der Grundschule: Eine Evaluation der Generativen Textproduktion (Dissertation). Ruhr-Universität Bochum. Retrieved from http://www-brs.ub.ruhr-uni-bochum.de/netahtml/HSS/Diss/FriegHendrike/diss.pdf
Krause, T., & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities, 31(1), 118–139.