[go: up one dir, main page]

Page MenuHomePhabricator

Write a client that consumes the RDF update stream from https://stream.wikimedia.org/ and update a triple store
Open, Needs TriagePublic

Description

In the wdqs code we consume the RDF update stream from kafka using the KafkaStreamConsumer class. A similar implementation should be written to work on top of HTTP EventStreams.

The features it must provide are:

  • implement StreamConsumer
  • offsets handling and persistance (which is provided when consuming directly from kafka)
    • it knows what to do on the first run (infer the initial offset possibly using the triple store itself scanning select (min(?date) as ?start) { wikibase:Dump schema:dateModified ?date } LIMIT 1)
    • it knows how to resume operations
  • Adapt or add a new main to run it based on a set of parameters
  • Use the same batching/compression technique (see PatchAccumulator)
  • ideally populate the same set of metrics

AC:

  • a triple compatible with SPARQL 1.1 Update operations and loaded with a munged wikidata dump can be updated outside of the WMF infrastructure using HTTP event streams.