[go: up one dir, main page]

This web page is dedicated to ÚFAL activities related to the Penn Discourse Treebank. For now, it gives information on:

Below we give instructions how to transform data of the Penn Treebank and the Penn Discourse Treebank into a PML format, in order to allow for querying the data in the PML-Tree Query. You need to have the tree editor TrEd installed, along with extensions for the PTB and the PDTB-2.0 (in TrEd, go to Setup->Manage Extensions->Get New Extensions, and search for ptb and pdtb2).

The transformation steps are:

  1. transformation of the PTB data into the PML format
    • The transformation script penn2pml.pl is a part of the ptb TrEd extension - a single file from the PTB, e.g. wsj_0001.mrg, can be transformed by:
      • penn2pml.pl wsj_0001.mrg
    • However, you do not need to use the script directly for individual files. Instead, you can go to the directory tools/01_PTB_to_PML of the pdtb2 TrEd extension, set the path to the original mrg/wsj data of the PTB in the Makefile, and run:
      • make transform
  2. transformation of the native PDTB format .pdtb into the COLUMN format
    • The transformation script convert.pl is a part of the original PDTB-2.0 distribution; a fixed version (correcting an error in the transformation of senses of AltLexes) is a part of the pdtb2 TrEd extension and the transformation itself is perfomed as a part of the next step.
  3. adding the data from the PDTB COLUMN format into the PTB files in the PML format
    • Transformation scripts are a part of the pdtb2 TrEd extension - in the directory tools/02_PDTB_to_PML-PTB, set the path to the original PDTB-2.0 distribution in the Makefile, and run:
      • make merge
  4. unifying non-terminals and terminals in a single node type
    • Transformation scripts are a part of the pdtb2 TrEd extension - in the directory tools/03_untype_nodes, run:
      • make untype
  5. adding information about genres of documents (optional)
    • Transformation scripts are a part of the pdtb2 TrEd extension - in the directory tools/04_add_genres, run:
      • make genres_ad
  6. adding redundancy to the files (optional)
    • Transformation scripts are a part of the pdtb2 TrEd extension - in the directory tools/05_add_redundancy, run:
      • make redundancy

After this transformation, you can open the data in the tree editor TrEd (the final data are in the directory tools/03_untype_nodes/data). To actually search in the data, you also need to install the TrEd extension for the PML-Tree Query (in TrEd, go to Setup->Manage Extensions->Get New Extensions, and search for pmltq).