The official code for paper "Enhancing Language Representation with Constructional Information for Natural Language Understanding"
🔗 Data • Tutorial • Guideline • Quick Start • Related Work • FAQ❓
Note
This repository is still under construction and will take some time to complete.
Construction Grammar (CxG) is a branch of cognitive linguistics. It assumes that grammar is a meaningful continuum of lexicon, morphology and syntax. Constructions can be defined as linguistic patterns that store different form and meaning pairs. As the meaning of a construction is assigned to a linguistic pattern rather than specific words, learning constructional information can be more challenging via PLMs and requires large bulk training data, which may lead to failure in NLU tasks.
It motivates us to incorporate construction grammar with PLMs. Therefore, we propose a preliminary framework HyCxG (Hypergraph network of construction grammar) to enhance the language representation with constructional information via a three stage solution. First, we extract and select the discriminative constructions from the input sentence. Then the Relational Hypergraph Attention Network are applied to attach the constructional information to the words. Then we can acquire the final representation to fine-tune on a variety of downstream tasks.
The content contained in each section of this repository includes:
- HyCxG includes the entire code for HyCxG framework.
- Data contains all the datasets used in this work as well as processing scripts. Most of the datasets will be downloaded from our mirror source. Meanwhile, some data processing scripts for baseline models are also provided.
- Tutorial includes some tutorials for HyCxG and related resources to our work.
- Guideline (Under construction) illustrates the information about baseline models & FAQ.
1 Experimental environment setup
We adopt Python=3.8.5
as the base environment, You can create the environment and install the dependencies with the following code:
conda create -n hycxg_env python=3.8.5
source activate hycxg_env
pip install -r requirements.txt
2 Prepare the dataset
We provide the script for data download in the data
folder. You can directly use the following command to get the data:
cd data
bash data_pipeline.sh
After downloading the data, please move each data folder (e.g., JSONABSA_MAMS) to the HyCxG/dataset
directory.
3 Prepare the data for components
Before running the code, it is necessary to download the required data for components (e.g., construction lists). The download process is under HyCxG/dataset
and HyCxG/Tokenizer
respectively. You can also obtain the data directly using the following command:
cd HyCxG/dataset
bash download_vocab.sh
cd ../Tokenizer
bash download_cxgdict.sh
4 Run HyCxG
We provide some examples of code for running HyCxG in HyCxG/run_hycxg.sh
.
- c2xg for extracting the constructions from the sentence
- simanneal for a convenient simulated annealing framework to solve problems
If you think our work is helpful, feel free to cite our paper "Enhancing Language Representation with Constructional Information for Natural Language Understanding":
@inproceedings{xu2023enhancing,
title = "Enhancing Language Representation with Constructional Information for Natural Language Understanding",
author = "Xu, Lvxiaowei and
Wu, Jianwang and
Peng, Jiawei and
Gong, Zhilin and
Cai, Ming and
Wang, Tianxiang",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2023",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.258",
pages = "4685--4705",
}
If you have any questions about the code, feel free to submit an Issue or contact xlxw@zju.edu.cn