[go: up one dir, main page]

The Prague Dependency Treebank

The Prague Dependency Treebank (PDT) contains a large amount of Czech texts with complex and interlinked morphological, syntactic and complex semantic annotation; in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. Newer version also contain multiword expression annotation, discourse relation annotation, and various other additions and corrections added since the first full release of PDT 2.0.

PDT is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.

The Prague Dependency Treebank - Consolidated 1.0 (PCT-C)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0  (PDT-C in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here).
PDT-corpora included in PDT-C:

The difference from the separately published original treebanks can be briefly described as follows:

  • it is published in one package, to allow easier data handling for all the datasets;
  • the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; 
  • a common valency lexicon for all four original parts is enclosed;
  • a number of errors found during the process of manual morphological annotation has been corrected

Reference Publication

Hajič Jan, Bejček Eduard, Hlaváčová Jaroslava, Mikulová Marie, Straka Milan, Štěpánek Jan, Štěpánková Barbora: Prague Dependency Treebank - Consolidated 1.0. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4, pp. 5208-5218, 2020. (pdf)

The Prague Dependency Treebank 3.5

The Prague Dependency Treebank 3.5 contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (over 800 thousand nodes) on all layers, from tectogrammatical to words, and additional sentences on the analytical (surface dependency syntax) and morphological layers of annotation (approx. 1.8 million words in total).

The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts. There are other members of the "family" of the Prague Dependency Treebanks, available separately and described elsewhere; search for "Prague Dependency Treebank" in the LINDAT/CLARIN repository.

Quick download link and PID: http://hdl.handle.net/11234/1-2621. The data is provided under CC-BY-NC-SA, 4.0. For proper citation(s) and more details, see below.

From PDT 1.0 to PDT 3.5

The first version of PDT has been published at LDC in 2001. Since then, various branches of PDT have been developed, adding more annotation. Most importantly, the PDT 2.0 added the tectogrammatical layer, which distinguishes the PDT family of treebanks from most other dependency treebanks available. As of January 2018, PDT 3.5 is the current version encompassing all previous versions, corrections and additional annotation. The history of the PDT editions is briefly listed below.

  • PDT 1.0
    • Words, Tokenization
    • Morphology (13 categories (features): POS, number, gender, case, negation, ...)
    • (Surface) Dependency syntax ("analytical layer"), dependency relations
  • Added in PDT 2.0
    • Tectogrammatical annotation (deep syntax, valency), including valency dictionary PDT-Vallex
    • Coreference (pronominal/textual, grammatical)
    • Information structure
    • Grammatemes (tense, modalities, number, ...)
  • Added in PDT 2.5
    • Multiword expressions
    • Pair/group meaning
    • Clause segmentation (on analytical layer)
  • Added in PDiT 1.0
    • Extended textual coreference
    • Bridging anaphora
    • Discourse relations marked by explicit connectives
  • Added in PDT 3.0
    • Revision of several grammatemes
    • Revision of sentence modality annotation
    • Replacement of t_lemma #Benef
    • Genres of documents
    • Pronominal textual coreference of 1st and 2nd person
    • Updated discourse relations marked by explicit connectives
  • Added in PDiT 2.0
    • Annotation of secondary connectives and senses (semantico-pragmatic discourse relations) they express
    • Updated annotation of discourse relations marked by primary connectives:
      • fixes of various individual errors
      • missing connectives filled in (except for relations of 'specification')
      • relations marked with discourse type 'other' changed to a nearest other type
      • fixes in strange low-count connectives
  • Added in PDT 3.5
    • Consolidated documentation, authorship, licence
    • New and separate item in LINDAT/CLARIN repository

Download

To download the data, please visit the PDT 3.5 item in the LINDAT/CLARIN repository.

Search

To search the treebank please use the PML-TQ (PML Tree Query) service at LINDAT/CLARIN. Please note this leads to search in PDT 3.0, but except for the discourse annotation added later in PDiT 2.0, the data are identical. (PDT 3.5 in PML-TQ is coming soon.)

Cite

To properly acknowledge this resource, please cite the following data item in the LINDAT/CLARIN repository:

For LREC papers (separate language resources references):


@languageresource{lrPDT35,
 title = {Prague Dependency Treebank 3.5},
 author = {Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina 
 and Bur\'{a}\v{n}ov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i} 
 and Homola, Petr and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava 
 and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie 
 and Lopatkov\'{a}, Mark\'{e}ta and Mikulov\'{a}, Marie and M\'{\i}rovský, Ji\v{r}\'{\i} 
 and Nedoluzhko, Anna and Pajas, Petr and Panevov\'{a}, Jarmila 
 and Pol\'{a}kov\'{a}, Lucie and Rysov\'{a}, Magdal\'{e}na and Sgall, Petr 
 and Spoustov\'{a}, Johanka and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na 
 and Šev\v{c}\'{\i}kov\'{a}, Magda and Štěp\'{a}nek, Jan and Urešov\'{a}, Zde\v{n}ka 
 and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka 
 and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
 url = {http://hdl.handle.net/11234/1-2621},
 publisher={Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University},
 address={Prague, Czech Republic}, 
 lindat={http://hdl.handle.net/11234/1-2621},
 year = {2018} }

For general papers and citations:


@misc{11234/1-2621,
 title = {Prague Dependency Treebank 3.5},
 author = {Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina 
 and Bur\'{a}\v{n}ov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i} 
 and Homola, Petr and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava 
 and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie 
 and Lopatkov\'{a}, Mark\'{e}ta and Mikulov\'{a}, Marie and M\'{\i}rovský, Ji\v{r}\'{\i} 
 and Nedoluzhko, Anna and Pajas, Petr and Panevov\'{a}, Jarmila 
 and Pol\'{a}kov\'{a}, Lucie and Rysov\'{a}, Magdal\'{e}na and Sgall, Petr 
 and Spoustov\'{a}, Johanka and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na 
 and Šev\v{c}\'{\i}kov\'{a}, Magda and Štěp\'{a}nek, Jan and Urešov\'{a}, Zde\v{n}ka 
 and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka 
 and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
 url = {http://hdl.handle.net/11234/1-2621},
 note = {{LINDAT}/{CLARIN} digital library at the Institute of Formal and Applied Linguistics ({{\\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
 copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
 year = {2018} }

For "plaintext" reference:

(Hajič et al., 2018)

Hajič, J., Bejček, E., Bémová, A., Buráňová, E., Hajičová, E., Havelka, J., Homola, P., Kárník, J., Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mikulová, M., Mírovský, J., Nedoluzhko, A., Pajas, P., Panevová, J., Poláková, L., Rysová, M., Sgall, P., Spoustová, J., Straňák, P., Synková, P., Ševčíková, M., Štěpánek, J., Urešová, Z., Vidová Hladká, B., Zeman, D., Zikánová, Š. and Žabokrtský, Z. (2018). Prague Dependency Treebank 3.5. Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University, LINDAT/CLARIN PID: http://hdl.handle.net/11234/1-2621.

For footnote references, the following is sufficient in LaTeX papers:

\url{http://hdl.handle.net/11234/1-2621}

Short overview of the original PDT 2.0 attributes and their values

Slides and video recordings from the Prague Treebanking for Everyone: A two-day tutorial, Vilem Mathesius Lecture Series 21.