NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

Uri Katz, Matan Vetzler, Amir Cohen, Yoav Goldberg

Abstract

Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained—and intersectional—entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.

Anthology ID:: 2023.findings-emnlp.218
Original:: 2023.findings-emnlp.218v1
Version 2:: 2023.findings-emnlp.218v2
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3340–3354
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.218
DOI:: 10.18653/v1/2023.findings-emnlp.218
Bibkey:
Cite (ACL):: Uri Katz, Matan Vetzler, Amir Cohen, and Yoav Goldberg. 2023. NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3340–3354, Singapore. Association for Computational Linguistics.
Cite (Informal):: NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval (Katz et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.218.pdf

PDF (v2) PDF (v1) Cite Search