[go: up one dir, main page]

Introduction

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0  (PDT-C in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here).

PDT-corpora included in PDT-C:

 

The difference from the separately published original treebanks can be briefly described as follows:

  • it is published in one package, to allow easier data handling for all the datasets;
  • the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; 
  • a common valency lexicon for all four original parts is enclosed;
  • a number of errors found during the process of manual morphological annotation has been corrected.

Layers of annotations.  The PDT-annotation scheme has a multi-layer architecture:

  • morphological layer (m-layer): all tokens of the sentence get a lemma and morphological tag,
  • surface syntax layer (analytical, a-layer): a dependency tree capturing surface syntactic relations such as subject, object, adverbial, etc.,
  • deep syntax layer (tectogrammatical, t-layer): capturing the deep syntactic relations, ellipses, valency, topic-focus articulation, and coreference. In the process of the further development of the PDT-scenario, additional semantic annotations (bridging relation, discourse, genre specification, multi-word expressions, etc.) are being added to the original annotation scheme.

In addition to the above-mentioned three (main) annotation layers in the PDT-scenario, there is also the raw text layer (w-layer), where the text is segmented into documents and paragraphs and individual tokens are assigned unique identifiers. There is additional audio and speech recognition layer (z-layer) in the spoken data. In the spoken data part (as opposed to the written corpora), the w-layer is in fact also an “annotated” layer, namely the manually provided transcription of the audio signal.

In order not to lose any piece of the original information, tokens (nodes) at a lower layer are explicitly referenced from the corresponding closest (immediately higher) layer. These links allow for tracing every unit of annotation all the way down to the original text, or to the transcript and audio (in the spoken data). 

Example sentence from PDT 3.5

Sarančata jsou doposud ve stadiu larev a pohybují se pouze lezením. V tomto období je účinné bojovat proti nim chemickými postřiky, ale dožívající družstva ani soukromí rolníci nemají na jejich nákup potřebné prostředky.

Example sentences from PDT-C 1.0, with tectogrammatical annotation including coreference links (blue and brown arrows), MWEs (red stripes) and discourse annotation (orange arrows and attributes/lables). Lit.: Grasshoppers are still in the larvae stadium, crawling only. At this time of the year, it is efficient to fight them using chemicals, but neither the ailing cooperatives nor private farmers can afford them.

 

In the current PDT-C 1.0 release, manual annotation has been fully performed at the lowest morphological layer; also, basic phenomena of the annotation at the highest deep syntactic layer (structure, functions, verbal valency) have been done manually in all four datasets. Manual annotation of the surface syntactic layer is contained only in the dataset of PDT written texts. Additional semantic features in PDT dataset have been also done manually. Table 1 presents an overview of various types of annotation at the three annotation layers in each dataset and the information of the manner in which the annotations was carried out.

Dataset /

Type of annotation

PDT

Written

PCEDT (Czech)

Translated

PDTSC

Spoken

PDT-Faust

User-generated

Audio

non-applicable

non-applicable

provided

non-applicable

ASR Transcription

non-applicable

non-applicable

provided

non-applicable

Transcript

non-applicable

non-applicable

manually

non-applicable

Translation

non-applicable

manually

non-applicable

manually

Morphological layer

Speech reconstruction

non-applicable

non-applicable

manually

non-applicable

Lemmatization

manually

manually

manually

manually

Tagging

manually

manually

manually

manually

Surface syntactic layer

Dependency structure

manually

automatically

automatically

automatically

Syntactic function

manually

automatically

automatically

automatically

Clause segmentation

automatically

not annotated

not annotated

not annotated

Deep syntactic layer

Deep syntactic structure

manually

manually

manually

manually

Deep syntactic function

manually

manually

manually

manually

Verbal valency

manually

manually

manually

manually

Nominal valency

manually

not annotated

not annotated

not annotated

Grammatemes

manually

not annotated

not annotated

not annotated

Coreference grammatical

manually

manually

manually

not annotated

Coreference textual

manually

manually

manually

not annotated

Bridging relation

manually

not annotated

not annotated

not annotated

Topic-focus articulation

manually

not annotated

not annotated

not annotated

Discourse

manually

not annotated

not annotated

not annotated

Genre specification

manually

not annotated

not annotated

not annotated

Quotation

manually

not annotated

not annotated

not annotated

Multiword expressions

manually

not annotated

not annotated

not annotated

Table 1: Overview of various types of annotation and their realization in the datasets

Volume of the data

The data volume is given in Table 2. Altogether, the consolidated treebank contains 3,885,591 tokens with manual morphological annotation and 2,245,945 t-nodes with manual deep syntactic annotation (manual annotation of the surface syntactic layer is contained only in the dataset of written texts and it consists of 1,503,741 a-nodes).

 

PDT

Written

PCEDT (Czech)

Translated

PDTSC

Spoken

PDT-Faust

User-generated

Total

Morphological layer

(number of m-forms)

1,957,150

1,152,289

742,316

33,836

3,885,591

Surface syntactic layer

(number of a-nodes)

1,503,741

1,152,289

742,316

33,837

3,432,183

Deep syntactic layer

(number of t-nodes)

675,034

932,334

608,472

30,105

2,245,945

Table 2. Volume of the datasets (number of tokens on the respective layers)