Some things to work out:
- Where on stat1008 should the canonical repo live, or does that matter? (currently various engineers have run the pipeline from their home directories, doesn't seem like a problem so long as the repo is up-to-date)
- How often should we re-run the pipeline for an existing dataset? This is needed as page titles get renamed, for example (T274195), as we currently use page title instead of page ID for querying datasets. Should we automate the process via a cronjob?
- What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.
- [your questions / concerns here]