Wikidata:Embedding Project

This page is a work in progress, not an article or policy, and may be incomplete and/or unreliable.
Please offer suggestions on the talk page.

The Wikidata Embedding Project is an initiative led by Wikimedia Deutschland in collaboration with Jina.AI and DataStax. The project’s aim is to enhance the search functionality of Wikidata by integrating advanced vector-based semantic search. By employing advanced machine learning models and scalable vector databases, the project seeks to support the open-source community in developing innovative AI applications and use Wikidata's multilingual and inclusive knowledge graph, while making its extensive data more accessible, and contextually relevant for users across the globe.

Overview

The Wikidata Embedding Project aims to enhance how people access and engage with Wikidata's vast knowledge base. By implementing advanced vector-based semantic search, the project makes finding relevant information easier and more contextually meaningful for everyone. The current search method, CirrusSearch, is limited by its focus on keyword matching, which often fails to capture the meaning behind a search query. On the other hand, SPARQL offers precise and accurate data retrieval, but its steep learning curve and complexity make it challenging for many users to leverage. The vector-based approach bridges this gap, combining the accessibility of keyword searches with context-aware results.

Beyond improving search for Wikidata, this project also encourages the open-source AI/ML community to build innovative solutions on top of a structured and publicly accessible knowledge graph. By making the tools and data open-source, the project empowers developers to create new AI-driven applications that leverage Wikidata’s inclusive and accessible knowledge. Potential applications include source-attribution generative AI, named entity recognition (NER) and disambiguation (NED), hybrid semantic and graph-based search, data visualisation, and multilingual consistency detection…

Goals

The primary objectives of the Wikidata AI Project are:

Supporting the AI/ML Community:

Contribute to the open-source machine learning community by making the project’s tools and models freely accessible. By providing an open-source, scalable vector database, we encourage developers to build AI and ML projects on top of Wikidata’s structured knowledge base.

Building a Scalable Vector Database for Enhanced Search:

Develop a robust, scalable vector database that enhances search functionality, enabling efficient querying of Wikidata entities through text similarity and question-answering capabilities. This database opens new possibilities for AI-driven applications, offering a flexible and powerful resource for accessing Wikidata’s comprehensive data.

Promoting Global Access and Community Collaboration:

Allow AI/ML projects built on the vector database to inherit Wikidata’s commitment to inclusivity, transparency, and community involvement. By integrating Wikidata, users are encouraged to contribute actively, correct inaccuracies, and collaborate to continually improve data quality . Additionally, with support for multiple languages, the project ensures accessibility, making these tools and data resources accessible for a global audience.

Partners and Collaboration

The project involves strategic partnerships with leading organisations in the AI and machine learning space:

Jina.AI: Jina.AI is providing a powerful open-source embedding model that supports 100+ languages and can handle up to 8192 tokens.

DataStax: DataStax is providing a scalable vector database, allowing the storage and retrieval of Wikidata entities through vector similarity.

Setup

The Wikidata Embedding Project involves transforming structured data from Wikidata into vectorized representations that facilitate semantic search and contextual relevance. This setup process occurs in several key stages:

Transforming Wikidata Entities into Text: Each Wikidata entity consists of structured information, including labels, descriptions, aliases, and claims. Labels serve as the primary name for each entity, while descriptions provide brief contextual information. Aliases offer alternative names, and claims define additional properties and relationships to other entities. To create a text representation of each entity, these components are combined into a coherent string that preserves the entity’s essential information. As an example, consider the Wikidata entity for Douglas Adams (Q42). A potential transformation could be: Douglas Adams, English science fiction writer and humorist (1952–2001), also known as Douglas N. Adams, Douglas Noël Adams,Douglas Noel Adams, DNA. Attributes include: - instance of: "human" - sex or gender: "male" - occupation: "playwright", "Screenwriter", "novelist (start time: 1979)" - notable work: "The Hitchhiker's Guide to the Galaxy pentalogy", "Dirk Gently series", "The Private Life of Genghis Khan" - date of birth: "1952 Mar 11" - place of birth: "Cambridge" … ◼ Label ◼ Description ◼ Aliases ◼ Property Label ◼ Statement Value ◼ Qualifiers
Converting Text into Vectors: Once entities are transformed into text, they are passed through an embedding model. This model generates vector representations, or embeddings, of the text, capturing the semantic meaning and context of each entity.
Storing Vectors in a Vector Database: The generated vectors are stored in a vector database. This database is optimised for storing and retrieving high-dimensional vectors, enabling efficient and scalable vector search capabilities. Each vector is linked to its corresponding Wikidata entity, allowing for fast retrieval of relevant entities during search.

Inference

During the inference phase, user queries are processed to retrieve relevant Wikidata entities based on vector similarity, which measures how closely the meaning of the query aligns with stored vectors of the entities. This process involves the following steps:

Embedding User Queries: When a user inputs a query (in the form of a question or statement), the query is passed through the embedding model, to generate a vector representation. This vector captures the semantic meaning of the query, aligning it within the same vector space as the Wikidata entities.
Vector Similarity Search: The query vector is then compared to the vectors stored in the DataStax vector database. Using vector similarity metrics, the database identifies and retrieves entities that are most relevant to the query. This approach allows the system to return results based on conceptual closeness rather than just keyword matching.

For example, if a user inputs the query "Who wrote Hitchhiker's Guide to the Galaxy?", the vector database identifies that the embedding for the entity "Douglas Adams" has the highest similarity to the query. This is because the text representation of the entity "Douglas Adams" contains the relevant information that answers the question.

Embedding Model

An embedding model is a type of machine learning model that transforms text into a continuous, high-dimensional vector representation. These vectors serve as a numerical encoding of the semantic meaning of the text, positioning similar concepts closely within the vector space even if different words are used to express them. This ability makes embedding models particularly powerful for similarity comparison and advanced semantic search, allowing the system to group together concepts with similar meanings.

Vector Database

A vector database is a specialised data storage system that is designed to efficiently handle high-dimensional vector data. Unlike traditional databases that store structured data in tables, vector databases are optimised to store, manage, and retrieve complex vector representations produced by machine learning models. Vector databases support similarity search algorithms, such as cosine similarity and euclidean distance, which enable the identification of vectors that are conceptually close within the vector space.

Demo

Coming soon!

Get Involved

Are you interested in contributing or learning more about our project? We'd love to hear from you! Reach out to us for more information or collaboration opportunities:

Philippe Saadé, AI/ML Project Manager, Wikimedia Deutschland
Lydia Pintscher, Portfolio Lead Product Manager for Wikidata, Wikimedia Deutschland
Jonathan Fraine, Head of Engineering, Co-Head of Software Development, Wikimedia Deutschland

Presentations & Blog Posts

Presentation at AI_dev Open Source GenAI & ML Summit
- Date: June 19, 2024
- Event: AI_dev Open Source GenAI & ML Summit
- Title: Wikidata Knowledge Graph to Enable Equitable and Validated Generative AI
- Speakers: Jonathan Fraine & Lydia Pintscher, Wikimedia Deutschland

Press Release
- Date: September 17, 2024
- Title: Wikidata and Artificial Intelligence: Simplified Access to Open Data for Open Source Projects
- Contact: Corinna Schuster, Wikimedia Deutschland

Press Release
- Date: December 03, 2024
- Title: Wikimedia Deutschland Launches AI Knowledge Project in Collaboration with DataStax Built with NVIDIA AI
- Contact: Regan Schiappa, DataStax & Zarah Ziadi, Wikimedia Deutschland

Blog
- Date: December 03, 2024
- Title: Build Equitable and Validated Generative AI with Wikidata and DataStax Leveraging NVIDIA Technologies
- Writer: Cedrick Lunven, DataStax

Source Code

You can find the code for data preparation, experimentation, and evaluation in this Github repository.

Updates

You can find weekly summaries and status updates on Wikidata at this link