Kenji Araki

Also published as: K. Araki

2022

pdf bib abs
Creation of Polish Online News Corpus for Political Polarization Studies
Joanna Szwoch | Mateusz Staszkow | Rafal Rzepka | Kenji Araki
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

In this paper we describe a Polish news corpus as an attempt to create a filtered, organized and representative set of texts coming from contemporary online press articles from two major Polish TV news providers: commercial TVN24 and state-owned TVP Info. The process consists of web scraping, data cleaning and formatting. A random sample was selected from prepared data to perform a classification task. The random forest achieved the best prediction results out of all considered models. We believe that this dataset is a valuable contribution to existing Polish language corpora as online news are considered to be formal and relatively mistake-free, therefore, a reliable source of correct written language, unlike other online platforms such as blogs or social media. Furthermore, to our knowledge, such corpus from this period of time has not been created before. In the future we would like to expand this dataset with articles coming from other online news providers, repeat the classification task on a bigger scale, utilizing other algorithms. Our data analysis outcomes might be a relevant basis to improve research on a political polarization and propaganda techniques in media.

2020

pdf bib abs
Can Existing Methods Debias Languages Other than English? First Attempt to Analyze and Mitigate Japanese Word Embeddings
Masashi Takeshita | Yuki Katsumata | Rafal Rzepka | Kenji Araki
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

It is known that word embeddings exhibit biases inherited from the corpus, and those biases reflect social stereotypes. Recently, many studies have been conducted to analyze and mitigate biases in word embeddings. Unsupervised Bias Enumeration (UBE) (Swinger et al., 2019) is one of approach to analyze biases for English, and Hard Debias (Bolukbasi et al., 2016) is the common technique to mitigate gender bias. These methods focused on English, or, in smaller extent, on Indo-European languages. However, it is not clear whether these methods can be generalized to other languages. In this paper, we apply these analyzing and mitigating methods, UBE and Hard Debias, to Japanese word embeddings. Additionally, we examine whether these methods can be used for Japanese. We experimentally show that UBE and Hard Debias cannot be sufficiently adapted to Japanese embeddings.

2019

pdf bib abs
Word Embedding-Based Automatic MT Evaluation Metric using Word Position Information
Hiroshi Echizen’ya | Kenji Araki | Eduard Hovy
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a new automatic evaluation metric for machine translation. Our proposed metric is obtained by adjusting the Earth Mover’s Distance (EMD) to the evaluation task. The EMD measure is used to obtain the distance between two probability distributions consisting of some signatures having a feature and a weight. We use word embeddings, sentence-level tf-idf, and cosine similarity between two word embeddings, respectively, as the features, weight, and the distance between two features. Results show that our proposed metric can evaluate machine translation based on word meaning. Moreover, for distance, cosine similarity and word position information are used to address word-order differences. We designate this metric as Word Embedding-Based automatic MT evaluation using Word Position Information (WE_WPI). A meta-evaluation using WMT16 metrics shared task set indicates that our WE_WPI achieves the highest correlation with human judgment among several representative metrics.

2018

pdf bib
Comparison of Pun Detection Methods Using Japanese Pun Corpus
Motoki Yatsu | Kenji Araki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs
Automatic Evaluation of Commonsense Knowledge for Refining Japanese ConceptNet
Seiya Shudo | Rafal Rzepka | Kenji Araki
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

In this paper we present two methods for automatic common sense knowledge evaluation for Japanese entries in ConceptNet ontology. Our proposed methods utilize text-mining approach: one with relation clue words and WordNet synonyms, and one without. Both methods were tested with a blog corpus. The system based on our proposed methods reached relatively high precision score for three relations (MadeOf, UsedFor, AtLocation), which is comparable with previous research using commercial search engines and simpler input. We analyze errors and discuss problems of common sense evaluation, both manual and automatic and propose ideas for further improvements.

This research focuses on text processing in the sphere of English-language social media. We introduce two database resources. The first, CECS (Casual English Conversion System) database, a lexicon-type resource of 1,255 entries, was constructed for use in our experimental system for the automated normalization of casual, irregularly-formed English used in communications such as Twitter. Our rule-based approach primarily aims to avoid problems caused by user creativity and individuality of language when Twitter-style text is used as input in Machine Translation, and to aid comprehension for non-native speakers of English. Although the database is still under development, we have so far carried out two evaluation experiments using our system which have shown positive results. The second database, CEGS (Casual English Generation System) phoneme database contains sets of alternative spellings for the phonemes in the CMU Pronouncing Dictionary, designed for use in a system for generating phoneme-based casual English text from regular English input; in other words, automatically producing humanlike creative sentences as an AI task. This paper provides an overview of the necessity, method, application and evaluation of both resources.

2010

pdf bib
Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
Hiroshi Echizen-ya | Kenji Araki
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

pdf bib
Evaluation of a System for Noun Concepts Acquisition from Utterances about Images (SINCA) Using Daily Conversation Data
Yuzu Uchida | Kenji Araki
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf bib abs
A Multi-Lingual Dictionary of Dirty Words
Jonas Sjöbergh | Kenji Araki
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a multi-lingual dictionary of dirty words. We have collected about 3,200 dirty words in several languages and built a database of these. The language with the most words in the database is English, though there are several hundred dirty words in for instance Japanese too. Words are classified into their general meaning, such as what part of the human anatomy they refer to. Words can also be assigned a nuance label to indicate if it is a cute word used when speaking to children, a very rude word, a clinical word etc. The database is available online and will hopefully be enlarged over time. It has already been used in research on for instance automatic joke generation and emotion detection.

pdf bib abs
What is poorly Said is a Little Funny
Jonas Sjöbergh | Kenji Araki
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We implement several different methods for generating jokes in English. The common theme is to intentionally produce poor utterances by breaking Grices maxims of conversation. The generated jokes are evaluated and compared to human made jokes. They are in general quite weak jokes, though there are a few high scoring jokes and many jokes that score higher than the most boring human joke.

pdf bib
A Complete and Modestly Funny System for Generating and Performing Japanese Stand-Up Comedy
Jonas Sjöbergh | Kenji Araki
Coling 2008: Companion volume: Posters

pdf bib
A Casual Conversation System Using Modality and Word Associations Retrieved from the Web
Shinsuke Higuchi | Rafal Rzepka | Kenji Araki
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Modifying SO-PMI for Japanese Weblog Opinion Mining by Using a Balancing Factor and Detecting Neutral Expressions
Guangwei Wang | Kenji Araki
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
OMS-J: An Opinion Mining System for Japanese Weblog Reviews Using a Combination of Supervised and Unsupervised Approaches
Guangwei Wang | Kenji Araki
Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)

pdf bib
Automatic evaluation of machine translation based on recursive acquisition of an intuitive common parts continuum
Hiroshi Echizen-ya | Kenji Araki
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Semi-supervised Algorithm for Human-Computer Dialogue Mining
Calkin S. Montero | Kenji Araki
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

pdf bib
Recreating Humorous Split Compound Errors in Swedish by Using Grammaticality
Jonas Sjöbergh | Kenji Araki
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib
Is It Correct? – Towards Web-Based Evaluation of Automatic Natural Language Phrase Generation
Calkin S. Montero | Kenji Araki
Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions

2005

pdf bib
Detecting the Countability of English Compound Nouns Using Web-based Models
Jing Peng | Kenji Araki
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf bib
Automatic Acquisition of Bilingual Rules for Extraction of Bilingual Word Pairs from Parallel Corpora
Hiroshi Echizen-ya | Kenji Araki | Yoshio Momouchi
Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition

2003

pdf bib
Effectiveness of automatic extraction of bilingual collocations using recursive chain-link-type learning
Hiroshi Echizen-ya | Kenji Araki | Yoshio Momouchi | Koji Tochinai
Proceedings of Machine Translation Summit IX: Papers

2002

pdf bib
Evaluation of Direct Speech Translation Method Using Inductive Learning for Conversations in the Travel Domain
Koji Murakami | Makoto Hiroshige | Kenji Araki | Koji Tochinai
Proceedings of the ACL-02 Workshop on Speech-to-Speech Translation: Algorithms and Systems

pdf bib
A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning
Zhongjian Wang | Kenji Araki | Koji Tochinai
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

pdf bib
Study of Practical Effectiveness for Machine Translation Using Recursive Chain-link-type Learning
Hiroshi Echizen-ya | Kenji Araki | Yoshio Momouchi | Koji Tochinai
COLING 2002: The 19th International Conference on Computational Linguistics

2000

pdf bib
Effectiveness of layering translation rules based on transition networks in machine translation using inductive learning with genetic algorithms
Hiroshi Echizen-ya | Kenji Araki | Yoshio Momouchi | Koji Tochinai
Proceedings of the International Conference on Machine Translation and Multilingual Applications in the new Millennium: MT 2000

1999

pdf bib abs
Example-based machine translation of part-of-speech tagged sentences by recursive division
Tantely Andriamanankasina | Kenji Araki | Koji Tochinai
Proceedings of Machine Translation Summit VII

Example-Based Machine Translation can be applied to languages whose resources like dictionaries, reliable syntactic analyzers are hardly available because it can learn from new translation examples. However, difficulties still remain in translation of sentences which are not fully covered by the matching sentence. To solve that problem, we present in this paper a translation method which recursively divides a sentence and translates each part separately. In addition, we evaluate an analogy-based word-level alignment method which predicts word correspondences between source and translation sentences of new translation examples. The translation method was implemented in a French-Japanese machine translation system and spoken language text were used as examples. Promising translation results were earned and the effectiveness of the alignment method in the translation was confirmed.

pdf bib
Sub-Sentential Alignment Method by Analogy
Tantely Andriamanankasina | Kenji Araki | Koji Tochinai
Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation

pdf bib
A Study of Performance Evaluation for GA-ILMT Using Travel English
Hiroshi Echizen-ya | Kenji Araki | Yoshio Momouchi | Koji Tochinai
Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation