Title |
An Efficient and Flexible Format for Linguistic and Semantic Annotation |
Authors |
Špela Vintar (DFKI GmbH Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany) Paul Buitelaar (DFKI GmbH Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany) Bärbel Ripplinger (Eurospider Information Technology AG Schaffhauserstrasse 18 CH-8006 Zürich, Switzerland) Bogdan Sacaleanu (DFKI GmbH Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany) Diana Raileanu (DFKI GmbH Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany) Detlef Prescher (DFKI GmbH Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany) |
Session |
WP4: Corpus Annotation |
Abstract |
The paper describes an XML annotation format and tool developed within the MUCHMORE project. The annotation scheme was designed specifically for the purposes of Cross-Lingual Information Retrieval in the medical domain so as to allow both efficient and flexible access to layers of information. We use a parallel English-German corpus of medical abstracts and annotate it with linguistic information (tokenisation, part-of-speech tagging, lemmatisation and decomposition, phrase recognition, grammatical functions) as well as semantic information from various sources. The annotation of medical terms/concepts, semantic types and semantic relations is based on the Unified Medical Language System (UMLS). Additionally, we use EuroWordNet as a general-language resource in annotating word senses and to compare domain-specific and general language use. A major aim of the project is also to complement existing ontological resources by extracting new terms and new semantic relations. We present the annotation scheme, which is conceptually related to stand-off annotation, and describe our tool for automatic semantic annotation. |
Keywords |
Flexible format, Tools |
Full Paper |