Jia Xu

2023

pdf bib abs
Probabilistic Robustness for Data Filtering
Yu Yu | Abdul Rafae Khan | Shahram Khadivi | Jia Xu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We introduce our probabilistic robustness rewarded data optimization (PRoDO) approach as a framework to enhance the model’s generalization power by selecting training data that optimizes our probabilistic robustness metrics. We use proximal policy optimization (PPO) reinforcement learning to approximately solve the computationally intractable training subset selection problem. The PPO’s reward is defined as our (𝛼,𝜖, 𝛾)-Robustness that measures performance consistency over multiple domains by simulating unknown test sets in real-world scenarios using a leaving-one-out strategy. We demonstrate that our PRoDO effectively filters data that lead to significantly higher prediction accuracy and robustness on unknown-domain test sets. Our experiments achieve up to +17.2% increase of accuracy (+25.5% relatively) in sentiment analysis, and -28.05 decrease of perplexity (-32.1% relatively) in language modeling.In addition, our probabilistic (𝛼,𝜖, 𝛾)-Robustness definition serves as an evaluation metric with higher levels of agreement with human annotations than typical performance-based metrics.

2022

pdf bib abs
Analyzing Encoded Concepts in Transformer Language Models
Hassan Sajjad | Nadir Durrani | Fahim Dalvi | Firoj Alam | Abdul Khan | Jia Xu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a novel framework ConceptX, to analyze how latent concepts are encoded in representations learned within pre-trained lan-guage models. It uses clustering to discover the encoded concepts and explains them by aligning with a large set of human-defined concepts. Our analysis on seven transformer language models reveal interesting insights: i) the latent space within the learned representations overlap with different linguistic concepts to a varying degree, ii) the lower layers in the model are dominated by lexical concepts (e.g., affixation) and linguistic ontologies (e.g. Word-Net), whereas the core-linguistic concepts (e.g., morphology, syntactic relations) are better represented in the middle and higher layers, iii) some encoded concepts are multi-faceted and cannot be adequately explained using the existing human-defined concepts.

pdf bib abs
SIT at MixMT 2022: Fluent Translation Built on Giant Pre-trained Models
Abdul Khan | Hrishikesh Kanade | Girish Budhrani | Preet Jhanglani | Jia Xu
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the Stevens Institute of Technology’s submission for the WMT 2022 Shared Task: Code-mixed Machine Translation (MixMT). The task consisted of two subtasks, subtask 1 Hindi/English to Hinglish and subtask 2 Hinglish to English translation. Our findings lie in the improvements made through the use of large pre-trained multilingual NMT models and in-domain datasets, as well as back-translation and ensemble techniques. The translation output is automatically evaluated against the reference translations using ROUGE-L and WER. Our system achieves the 1st position on subtask 2 according to ROUGE-L, WER, and human evaluation, 1st position on subtask 1 according to WER and human evaluation, and 3rd position on subtask 1 with respect to ROUGE-L metric.

pdf bib abs
Measuring Robustness for NLP
Yu Yu | Abdul Rafae Khan | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

The quality of Natural Language Processing (NLP) models is typically measured by the accuracy or error rate of a predefined test set. Because the evaluation and optimization of these measures are narrowed down to a specific domain like news and cannot be generalized to other domains like Twitter, we often observe that a system reported with human parity results generates surprising errors in real-life use scenarios. We address this weakness with a new approach that uses an NLP quality measure based on robustness. Unlike previous work that has defined robustness using Minimax to bound worst cases, we measure robustness based on the consistency of cross-domain accuracy and introduce the coefficient of variation and (epsilon, gamma)-Robustness. Our measures demonstrate higher agreements with human evaluation than accuracy scores like BLEU on ranking Machine Translation (MT) systems. Our experiments of sentiment analysis and MT tasks show that incorporating our robustness measures into learning objectives significantly enhances the final NLP prediction accuracy over various domains, such as biomedical and social media.

pdf bib abs
Byte-based Multilingual NMT for Endangered Languages
Mengjiao Zhang | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840% relative improvement.

pdf bib abs
Can Data Diversity Enhance Learning Generalization?
Yu Yu | Shahram Khadivi | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

This paper introduces our Diversity Advanced Actor-Critic reinforcement learning (A2C) framework (DAAC) to improve the generalization and accuracy of Natural Language Processing (NLP). We show that the diversification of training samples alleviates overfitting and improves model generalization and accuracy. We quantify diversity on a set of samples using the max dispersion, convex hull volume, and graph entropy based on sentence embeddings in high-dimensional metric space. We also introduce A2C to select such a diversified training subset efficiently. Our experiments achieve up to +23.8 accuracy increase (38.0% relatively) in sentiment analysis, -44.7 perplexity decrease (37.9% relatively) in language modeling, and consistent improvements in named entity recognition over various domains. In particular, our method outperforms both domain adaptation and generalization baselines without using any target domain knowledge.

2021

pdf bib abs
Grouping Words with Semantic Diversity
Karine Chubarian | Abdul Rafae Khan | Anastasios Sidiropoulos | Jia Xu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to simplify input language representation with low ambiguity. Since the semantically diverse words reside in different contexts, we are able to substitute words with their groups and still distinguish word meanings relying on their contexts. We design several algorithms that compute diverse groupings based on random sampling, geometric distances, and entropy maximization, and we prove formal guarantees for the entropy-based algorithms. Experimental results show that our methods generalize NLP models and demonstrate enhanced accuracy on POS tagging and LM tasks and significant improvements on medium-scale machine translation tasks, up to +6.5 BLEU points. Our source code is available at https://github.com/abdulrafae/dg.

2020

pdf bib abs
Coding Textual Inputs Boosts the Accuracy of Neural Networks
Abdul Rafae Khan | Jia Xu | Weiwei Sun
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Natural Language Processing (NLP) tasks are usually performed word by word on textual inputs. We can use arbitrary symbols to represent the linguistic meaning of a word and use these symbols as inputs. As “alternatives” to a text representation, we introduce Soundex, MetaPhone, NYSIIS, logogram to NLP, and develop fixed-output-length coding and its extension using Huffman coding. Each of those codings combines different character/digital sequences and constructs a new vocabulary based on codewords. We find that the integration of those codewords with text provides more reliable inputs to Neural-Network-based NLP systems through redundancy than text-alone inputs. Experiments demonstrate that our approach outperforms the state-of-the-art models on the application of machine translation, language modeling, and part-of-speech tagging. The source code is available at https://github.com/abdulrafae/coding_nmt.

2019

This paper describes the systems of the CUNY-PKU team in SemEval 2019 Task 1: Cross-lingual Semantic Parsing with UCCA. We introduce a novel model by applying a cascaded MLP and BiLSTM model. Then, we ensemble multiple system-outputs by reparsing. In particular, we introduce a new decoding algorithm for building the UCCA representation. Our system won the first place in one track (French-20K-Open), second places in four tracks (English-Wiki-Open, English-20K-Open, German-20K-Open, and German-20K-Closed), and third place in one track (English-20K-Closed), among all seven tracks.

2018

pdf bib abs
Assessing Quality Estimation Models for Sentence-Level Prediction
Hoang Cuong | Jia Xu
Proceedings of the 27th International Conference on Computational Linguistics

This paper provides an evaluation of a wide range of advanced sentence-level Quality Estimation models, including Support Vector Regression, Ride Regression, Neural Networks, Gaussian Processes, Bayesian Neural Networks, Deep Kernel Learning and Deep Gaussian Processes. Beside the accurateness, our main concerns are also the robustness of Quality Estimation models. Our work raises the difficulty in building strong models. Specifically, we show that Quality Estimation models often behave differently in Quality Estimation feature space, depending on whether the scale of feature space is small, medium or large. We also show that Quality Estimation models often behave differently in evaluation settings, depending on whether test data come from the same domain as the training data or not. Our work suggests several strong candidates to use in different circumstances.

pdf bib abs
Hunter NMT System for WMT18 Biomedical Translation Task: Transfer Learning in Neural Machine Translation
Abdul Khan | Subhadarshi Panda | Jia Xu | Lampros Flokas
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of Hunter Neural Machine Translation (NMT) to the WMT’18 Biomedical translation task from English to French. The discrepancy between training and test data distribution brings a challenge to translate text in new domains. Beyond the previous work of combining in-domain with out-of-domain models, we found accuracy and efficiency gain in combining different in-domain models. We conduct extensive experiments on NMT with transfer learning. We train on different in-domain Biomedical datasets one after another. That means parameters of the previous training serve as the initialization of the next one. Together with a pre-trained out-of-domain News model, we enhanced translation quality with 3.73 BLEU points over the baseline. Furthermore, we applied ensemble learning on training models of intermediate epochs and achieved an improvement of 4.02 BLEU points over the baseline. Overall, our system is 11.29 BLEU points above the best system of last year on the EDP 2017 test set.

2017

2014

2011

pdf bib
DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT and RBMT
Jia Xu | Hans Uszkoreit | Casey Kennington | David Vilar | Xiaojun Zhang
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Generating Virtual Parallel Corpus: A Compatibility Centric Method
Jia Xu | Weiwei Sun
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Parallel Corpus Refinement as an Outlier Detection Algorithm
Kaveh Taghipour | Shahram Khadivi | Jia Xu
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Enhancing Chinese Word Segmentation Using Unlabeled Data
Weiwei Sun | Jia Xu
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

Evaluation of automatic translation output is a difficult task. Several performance measures like Word Error Rate, Position Independent Word Error Rate and the BLEU and NIST scores are widely use and provide a useful tool for comparing different systems and to evaluate improvements within a system. However the interpretation of all of these measures is not at all clear, and the identification of the most prominent source of errors in a given system using these measures alone is not possible. Therefore some analysis of the generated translations is needed in order to identify the main problems and to focus the research efforts. This area is however mostly unexplored and few works have dealt with it until now. In this paper we will present a framework for classification of the errors of a machine translation system and we will carry out an error analysis of the system used by the RWTH in the first TC-STAR evaluation.

pdf bib
Partitioning Parallel Documents Using Binary Segmentation
Jia Xu | Richard Zens | Hermann Ney
Proceedings on the Workshop on Statistical Machine Translation