[go: up one dir, main page]

Documentation

This page describes the Prague Discourse Treebank 3.0 (PDiT 3.0) and summarizes changes in the annotation of discourse relations carried out after the previous publication of discourse relations in the Prague Dependency Treebank 1.0 - Consolidated (PDT-C 1.0, 2020). For details on previous versions of the underlying Prague Dependency Treebank (PDT) and the separate releases of the Prague Discourse Treebank (PDiT), please refer to their respective documentations:

1 Introduction

The new version of PDiT (3.0) comes with two major changes:

  • revision of discourse annotation based on the previous work on the lexicon of Czech discourse connectives (CzeDLex 1.0, Mírovský et al. 2021)
  • in addition to the traditional Prague data format and formalism, the discourse annotation has been transformed to the format and sense taxonomy of the Penn Discourse Treebank 3.0

2 Annotation extent

The annotation covers intra- and inter-sentential discourse relations marked by explicitly expressed primary or secondary connectives. Whereas primary connectives are grammaticalized, mostly one-word expressions (like a “and”, ale “but”, když “when”, protože “because”), secondary connectives are not (yet) fully grammaticalized expressions (cf. z tohoto důvodu “for this reason”, za těchto podmínek “under these conditions”, kvůli tomu “due to this” etc.). Discourse relations annotated in the corpus hold between two spans of text (containing finite verbs) called discourse arguments. In the tectogrammatical level of the treebank, the relations are captured by an arrow leading between two verbal nodes (or their coordinations) representing whole arguments (see Figure 1). Each relation is also provided a discourse type (a semantico-pragmatic label such as reason-result, condition, purpose etc., see Table 1) and by the exact extent of the discourse arguments.

Figure 1: S ohledem na toto ustanovení by se hrubé chování muselo týkat vaší osoby a nestačí pouze nevhodné zacházení s předmětem darovací smlouvy, to je darem. Z tohoto důvodu by byla vaše žaloba na vrácení daru u soudu zamítnuta.

[With regard to this provision, the abusive behaviour would have to be related to your person and an inappropriate treatment of the subject of the donation contract is not enough. For this reason, your action on the return of the donation would be rejected at the court.]

 

Table 1: List of possible discourse types
CONTRAST EXPANSION CONTINGENCY TEMPORAL
confrontation conjunction reason–result synchrony
opposition conjunctive alternative     pragmatic reason–result     precedence–succession
restrictive opposition     disjunctive alternative explication  
pragmatic contrast instantiation condition  
concession specification pragmatic condition  
correction equivalence purpose  
gradation generalization  

3 Changes from the previous release

3.1 Changes in the annotation

A previous work on the Lexicon of Czech discourse connectives CzeDLex revealed inconsistencies in annotation of approx. 20 expressions (e.g.  potom “then”, rovněž “also”, mimo jiné “besides”) with a possible discourse connective function – in some contexts they had not been marked as discourse connectives although they actually signalled a discourse relation. All occurrences of these expressions not marked as discourse connectives have been checked and the mistakes adjusted.

Besides, during the work on the lexicon, approx. 300 individual contexts requiring a correction had been listed. All of them have been checked and the annotation revised.

The second part of the revisions resulted from an analysis of pragmatic relations (Poláková and Synková, 2021), which had revealed some inconsistencies as well. According to this analysis, pragmatic relations have been completely revised and the annotation unified. As pragmatic condition had been annotated really rarely, a probe was conducted to see if there were some pragmatic conditions annotated as condition by mistake. Based on positive results of the probe, all occurrences of the condition relation have been checked and cases of pragmatic condition have been reannotated.

An analysis of pragmatic relations brought back also a need of a revision of the explication relation, as there were many bordeline cases with the (pragmatic) reason-result relation and the specification relation that were problematic in both previous versions of the data. All explication relations have therefore been checked and the annotation revised.

The third part of the revisions consisted of an analysis of annotators' comments – it revealed, besides other facts, that these comments often signalled contexts where the automatic part of discourse annotation had not recognized the discourse connective correctly. Relevant comments have been selected from the complete list and the annotation has been revised.

Altogether, approx. 3600 contexts have been manually checked, resulting in an annotation of approx. 400 new relations, deletion of approx. 50 relations and corrections of approx. 850 relations.

3.2 Transformation of Prague discourse types to Penn senses

The trasformation process consisted of two separate parts: (i) generation of plain text form of the arguments and connectives from their representation on the tectogrammatical layer, and (ii) transformation of Prague discourse types to Penn senses.

(i) The numerous issues in extracting plain text forms of the arguments can be split in two categories: (a) annotation inconsistencies in various parts of the data (on the deep-syntactic layer, on the surface-syntactic layer, in the discourse annotation), and (b) a complex nature of the deep-syntactic layer of annotation (reconstructed nodes/parts of the trees that take part in discourse relations, necessity to combine information from several annotation layers). Although we took great care in tuning the plain text generation of the arguments, we could not check and fix errors in all 21 thousand of discourse relations.

(ii) Most of the relations could be transformed automatically, as a single Penn counterpart corresponded to the Prague discourse type. However, in many cases there were more than one option; special attention was paid to the following PDTB 3.0 relations: Similarity, negative Condition, negative Result and Contrast.

Similarity from conjunction
 
This sense in PDiT was captured under the relation of conjunction. The connectives of Similarity in Czech were identified as: the single-word connectives in Czech “obdobně” (similarly) and “podobně” (similarly), and their complex variants such as “podobně i” (similarly also) or “podobně jako” (similarly as) as well as complex connectives containing the word “stejně” (equally, still) such as “stejně tak” or “stejně jako” (both meaning likewise).
 
Negative Condition from condition
 
In the analysis of negative Condition annotated for English, we focused especially on specific connectives used for this relation and we searched for their counterparts in Czech. We found the following connectives in Czech that were originally annotated as a pure condition in PDiT: “jinak" (the counterpart of English “otherwise" and “lest"), “nebo" or “buď_nebo" (the counterparts of English “or" and “either_or") and “aniž" (the counterpart of English constructions containing “without").
 

The most challenging situation appeared to be with the connective “unless" (the most frequent connective for negative Condition in the PDTB). Czech language does not have a direct counterpart for this English connective. The connective "unless" contains negation in its sense, but it does not simply mean “if not". However, the presence of negation in the Czech sentence was a basic condition for the search of Czech counterparts of English sentence with “unless". The reliable cases that could be marked as negative Condition automatically were those in which connectives expressing a condition (“pokud", “když", “-li", all meaning “if") and a connective such as “tedy" (so), “ovšem" or “však" (both meaning “however") occurred together in the sentence containing negation. However, the second connective (like “tedy" (so) in the Example) occurs explicitly in the sentence rather rarely. Therefore, we were looking for other tendencies that characterize the relation of Negative condition in Czech. It turned out that these are the order of the discourse arguments in combination with a particular connective. A big portion of cases that were evaluated as a Negative condition contained a connective “pokud" or “-li" (both meaning “if") in the second argument.

Negative Result from reason-result
 
The relation of negResult was specifically introduced for the lexico-syntactic construction “too X to Y”. This construction corresponds to the Czech complex connectives “na to, aby” or “k tomu, aby” that occur together with an adjunct expressing manner by specifying extent or intensity of the event or a circumstance, such as “příliš” (too).
 
Contrast from restrictive opposition
 
Another issue to solve concerned the relation of restrictive opposition. We primarily converted the relations of restrictive opposition to the PDTB3 Exception, but sometimes also to Contrast. We assumed the relation of Contrast in cases where the restrictive opposition was not accompanied by the use of a functor RESTR (restriction). Firstly, we manually evaluated the cases of intra-sentential relations of restriction in a complex sentence in which the subordinate clause did not contain the functor RESTR. We found out that the most cases of Contrast appeared in sentences with connectives “však" (however) and “(i)když" (although). In the next step, we thus limited our analysis to these connectives and extended the search also to inter-sentential relations. We found altogether 114 occurrences of such type of sentence and manually annotated 86 of them as a relation of Contrast.

4 List of discourse-related annotation attributes in PDiT 3.0

Discourse-related annotation is captured mostly in a structured attribute discourse at the start node of the relation, additional annotation is captured in attributes discourse_groups and discourse_special.

  • discourse/target_node.rf – id of the target node, or undefined if there is no target node (e.g. no hypertheme in a list structure)

  • discourse/type – the type of an arrow, two possible values: discourse (discourse relation), list (list entry)

  • discourse/start_range – start range of a discourse arrow; possible values: n where n (non-negative integer) = number of trees to the right of the actual tree belonging to the argument in addition to the node and its subtree (0 means just the node and its subtree), group (an arbitrary set of nodes; see below attributes discourse/start_group_id and discourse_groups), forward (means the node with its subtree plus a non-specified number of the following trees), backward (means the node with its subtree plus a non-specified number of the preceeding trees)

  • discourse/target_range – target range of a discourse arrow; possible values above

  • discourse/start_group_id – identifier of a group of nodes (positive integer) where the start_range of the arrow is set to "group"; individual nodes belonging to the group keep the group identifier in the attribute discourse_groups

  • discourse/target_group_id – identifier of a group of nodes (positive integer) where the target_range of the arrow is set to "group"; individual nodes belonging to the group keep the group identifier in the attribute discourse_groups

  • discourse/discourse_type – type of discourse semantic relation, such as cond (textual condition)

  • discourse/is_secondary – set to 1 if the relation is expressed by a secondary connective

  • discourse/is_negated – set to 1 if the relation is expressed by a negated secondary connective

  • discourse/comment – further specifies the discourse type for some relations expressed by secondary connectives; three possible values: Regard, Conclusion, Entailment.

  • discourse/t-connectors.rf – list of ids of nodes from the tectogrammatical layer that represent the discourse connective (or the core of the secondary discourse connective)

  • discourse/a-connectors.rf – list of ids of nodes from the analytical layer that represent the discourse connective (or the core of the secondary discourse connective)

  • discourse/t-connectors_ext.rf – list of ids of nodes from the tectogrammatical layer that represent the whole ("extended") secondary discourse connective

  • discourse/a-connectors_ext.rf – list of ids of nodes from the analytical layer that represent the whole ("extended") secondary discourse connective

  • discourse_groups – list of identifiers of groups the given node belongs to

  • discourse_special – three possible values for three special roles of the phrase represented by the node and its subtree: heading (replaces attribute is_heading from PDiT 1.0), metatext and caption.

  • sense_PDTB3 – a transformation of the discourse type to a Penn Discourse Treebank 3.0 sense.

  • sense_PDTB3_manual – a manually filled-in sense for cases when automatic transformation of the discourse type to a Penn Discourse Treebank 3.0 sense would fail; this value was then used in sense_PDTB3

References

Poláková, L., Synková, P.: Pragmatické aspekty v popisu textové koherence. Naše řeč, 4, 2021, pp. 225-242.

Rysová, M.; Rysová, K.: Secondary Connectives in the Prague Dependency Treebank . In Hajičová, Eva; Nivre Joakim (eds.): Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015). Uppsala, Sweden: Uppsala University, 2015, pp. 291–299. ISBN 978-91-637-8965-6. WWW: http://www.aclweb.org/anthology/W/W15/W15-2132.pdf

Rysová, M.: Diskurzní konektory v češtině (Od centra k periferii). Ph.D. thesis, Charles University in Prague, Prague, Czechia, 268 pp., Oct 2015.

Rysová, M., Rysová, K.: The Centre and Periphery of Discourse Connectives. In Aroonmanakun, Wirote; Boonkwan, Prachya; Supnithi, Thepchai (eds.): Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing (PACLIC 28). Bangkok, Thailand: Department of Linguistics, Faculty of Arts, Chulalongkorn University, 2014, pp. 452–459. ISBN 978-616-551-887-1. WWW: http://aclweb.org/anthology/Y14-1052