[go: up one dir, main page]

Page MenuHomePhabricator

Update snapshots service to produce and upload chunks of snapshots
Closed, ResolvedPublic8 Estimated Story Points

Description

In order to allow (parallel) downloading chunks (of a snapshot) and querying chunks metadata, we need to update our snapshots service (export handler) to generate chunk tar and chunk metadata.

Refer to RfC-chuncked-snapshots for details.

To do

  • Generate and upload chunks tar.gz and json

The export handler generates and uploads to s3 snapshots/enwiki_namespace_0.json and snapshots/enwiki_namespace_0.tar.gz for project enwiki and namespace 0 (for example).
Each tar.gz contains several ndjson file based on the uncompressed file limit. We want each of those ndjson to be uploaded to s3 as chunks as follows:

chunks/enwiki_namespace_0/chunk_0.json
chunks/enwiki_namespace_0/chunk_0.tar.gz
chunks/enwiki_namespace_0/chunk_1.json
chunks/enwiki_namespace_0/chunk_1.tar.gz
.
.

  • Update chunks field of snapshots metadata.

In the export handler, we need to update the chunks field of the snapshots metadata in order to reflect the number of chunks present for a snapshot.

For the above example:

{
    "identifier": "enwiki_namespace_0",
    "version": "637a1410d4e803c0b5ca04ecc6890815",
    "date_modified": "2023-12-21T02:40:14.475051666Z",
    "is_part_of": {
        "identifier": "enwiki"
    },
    "in_language": {
        "identifier": "en"
    },
    "namespace": {
        "identifier": 0
    },
    "size": {
        "value": 123374.514e0,
        "unit_text": "MB"
    },
    “chunks”: [“enwiki_namespace_0_chunk_0”, “enwiki_namespace_0_chunk_1”, …]
}
  • Switch to enable chunking

As the snapshots handler are also used to generate batches. We are only aiming to generate chunks for snapshots.
In order to enable/disable chunking, add a new filed enable_chunking to ExportRequest
in protos/snapshots.proto

Update protos submodule for scheduler and snapshots services. Set enable_chunking to false for batches DAG and true for snapshots DAG.

  • Optional: some code refactoring

Team decided this is totally optional. If refactoring takes time and makes completion of this task complex - opt to put good comments in the code instead.
Note that snapshot generation and upload has similar steps and functionalities as that of chunk generation and upload. If possible, consider using common interfaces such as: (these are just examples)

// TarWriter takes a buffer, creates tar header using the buffer, then writes the tar header and buffer data using the tar writer.
type TarWriter interface {
	TarWriter(buf *buffer, trw *tar.Writer) (error)
}

// Uploader reads a pipe and uploads it to s3 bucket using the key.
type Uploader interface {
	Uploader(upl *s3manager.Uploader, prr *nio.PipeReader, bkt string, key string) (error)
}
.
.

Sync with @ROdonnell-WMF as he did some draft refactoring and implementation already.

QA / Acceptance criteria

  • Dev deployment and testing

After dev deployment, verify that the right amount of chunks are getting uploaded to s3, and that the chunks and snapshots metadata is updated accordingly.

Event Timeline

prabhat updated the task description. (Show Details)

I refactored some of the code for the uploader and compression but didn't create an interface for the tar method.

I will hold off on dev testing until SC is in Prod

Sounds good. Refactoring is best effort and optional.

MR in Added chunking "lite"

I want to add more unit tests to improve the coverage of compressed files and review uncovered lines of code. Rather than push this to dev and rush it through.

There is a 2nd MR that has a heavier refactoring. We could use some of the code structures and folders to make the code more maintainable. But that is a long term effort.

AN issue, we won't have download file size or hash MD5 in S3, maybe I add the PutInputObject for each chunk?

JArguello-WMF changed the point value for this task from 5 to 8.Jul 31 2024, 1:04 PM