[go: up one dir, main page]

Page MenuHomePhabricator

Investigation: is it feasible to drop the mapdata group ID hash magic?
Open, Needs TriagePublic

Description

Mapframe and maplink tags can be assigned an explicit name (like "see"), and if one is lacking then they will be given an automatically-assigned name which is calculated from the hash of fully-parsed GeoJSON contents. We have experiences some weaknesses caused by this magic (TBD: link tasks).

Is it possible to assign a plain, autoincrement numeric ID instead?

Open questions:

  • Explore changes needed to drop production of hash IDs:
  • Where is back-compatibility required?
    • Static image URL with old group IDs. groups are passed to the mapdata library and included in the mapdata API, so we can push compat back to the mapdata API for this case.
    • mapdata API when requesting old group IDs
    • mapdata API when requesting all groups?
      • Could the consumer silently know old group IDs?
      • How often is the mapdata API called without group IDs?
        • Around 0.01%–0.1% of requests have no mpdgroups parameter.
      • Sample these requests—they mostly come from "node-fetch"
      • what is this caller and does it assume hash IDs?
    • mapdata API when requesting by titles and not revids?
      • How often is the API called without revids?
        • Around 2% of requests are by title.
        • Almost all of these are coming from kartotherian
        • What is still calling kartotherian with title?
  • What to do about template changes which can cause renumbering of maps for a given revision?
    • This could cause maps to appear in the wrong place in an article.
    • Could be converted back into the existing bug (missing map) by using a more sophisticated ID?
    • Random (version 4) UUID changes on every parse, which makes it unsuitable
  • mapdata API should accept additional parameters which cause variation in parsed output, eg. isMobile. (can be follow-up work)
  • Consider producer rollback.
    • Compatiblity ahead of producer change. These are separate patches.
    • Toggle producer with configuration.

If successful, consider how this can be applied to solve T328772: Design safe migration path for geoshape expansion.

Event Timeline

Change 887779 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/extensions/Kartographer@master] [POC] Drop hashing for group IDs

https://gerrit.wikimedia.org/r/887779

Change 887782 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/extensions/Kartographer@master] [POC] Transparently map legacy group hash IDs to new ID

https://gerrit.wikimedia.org/r/887782

Change 887805 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Kartographer@master] [POC] Hash calculated from canonical, partly normalized GeoJSON

https://gerrit.wikimedia.org/r/887805

I left a few comments on Gerrit. Most notably:

This means the only identifier will be the position on the page, right? Here is the problem: What if I edit the page and move a <mapframe> down? This will swap their identifiers without a chance for anybody to understand what just happened.
We could make the revision id a strict requirement in all cases, effectively making it part of the group id. This would improve the situation but not solve it entirely, unfortunately. New scenario: What if the article starts with an infobox template, but that template is later changed to contain a <mapframe>? This will affect old revisions and the total number of <mapframe> they show, which affects the number here.

An edit as simple as "add a new map on top of the article" will shift the proposed ids down. Out of a sudden the same number identifies a different map. Adding the revision id to the mix improves the situation but can't fundamentally solve the issue. There is no canonical rendering of an old revision we could reliably use to derive the auto-incremented numbers from.

User-provided group names are fundamentally different. They consolidate data from across the entire page. They don't identify individual maps.

The best alternative proposal I have at the moment is what's demoed in https://gerrit.wikimedia.org/r/887805.

This means the only identifier will be the position on the page, right? Here is the problem: What if I edit the page and move a <mapframe> down? This will swap their identifiers without a chance for anybody to understand what just happened.

Thanks, the feedback is appreciated! What use case are you thinking of, in which automatic renumbering will cause problems? We already have this in the extreme when using hashes: changing any tag content at all will cause the group ID to be renumbered. The idea of the internal ID is that it's never exposed to the user and is not referenced manually, so nobody needs to understand how they change. They only exist in order for the maps server to pull a subset of the page data when rendering static thumbnails.

We already have this in the extreme when using hashes: changing any tag content at all will cause the group ID to be renumbered.

I'm not sure how this relates. The hashes are guaranteed to be unique per map and guaranteed to change when an edit is made. The behavior of sequence numbers is pretty much the opposite.

The idea of the internal ID is that it's never exposed […]

I wonder what the point is then? When we still need to give all the same guarantees about the hash we can as well use that.

The POC patches I see so far don't touch the fundamental issue: Depending on skin, mobile domain, language variant and whatnot (i.e. ParserOptions) the parser behaves different, the wikitext in titles and descriptions creates different HTML, which ends in the expanded GeoJSON and becomes part of the hash. While the POC patches move the hash calculation to a different place it uses the same GeoJSON as before.

The mapdata API is currently hard-coded to use ParserOptions::newFromAnon(), which sounds fine, but doesn't necessarily match the ParserOptions that have been used to create the HTML the user is currently looking at. The mapdata API might find something in this "anon" parser cache slot, but will learn that the group hashes found in this slot are different from the ones provided in the snapshot image url.

As long as we expose these hashes but can't guarantee the parser cache slot found by the mapdata API is the same as the one where the hashes came from, nothing changes.

Exposing the proposed sequence numbers in the snapshot image urls sounds like it should work in many cases. Unfortunately it's flawed for other reasons, as I tried to explain above.

We already have this in the extreme when using hashes: changing any tag content at all will cause the group ID to be renumbered.

I'm not sure how this relates. The hashes are guaranteed to be unique per map and guaranteed to change when an edit is made. The behavior of sequence numbers is pretty much the opposite.

Whether a hash or a sequence is used will be opaque to editors and users. So as far as I can tell, they are functionally equivalent in every way except this one detail—and the only guarantee given by using a hash is that you will never accidentally see a different version map than the rest of the page contents. This is a nice guarantee, but it's also given by the revision_id+map_sequence_id as you point out, which is the image URL produced by Kartographer.

The POC patches I see so far don't touch the fundamental issue: Depending on skin, mobile domain, language variant and whatnot (i.e. ParserOptions) the parser behaves different, the wikitext in titles and descriptions creates different HTML, which ends in the expanded GeoJSON and becomes part of the hash. While the POC patches move the hash calculation to a different place it uses the same GeoJSON as before.

That's correct, however there's one important edge case failure which disappears immediately when using sequence numbers and that's exactly T328772: Design safe migration path for geoshape expansion which is the motivation behind doing this work. The variation is based on server software configuration and not on a request-specific parameter. Additionally, sequential IDs will give us the same easy migration path to solving all of the parameterized variations you list. In future work, if we wire the language variant through into the snapshot image URL and back through the mapdata API, then we're free to respond with transformed map contents without breaking the brittle hashing system.

This is a nice guarantee, but it's also given by the revision_id+map_sequence_id […]

Unfortunately not. There is no canonical representation of an old revision we can use to reliably count, number and identify maps on a page. One of many events that affect old revisions is when an old template is deleted. It also disappears from old revisions. When the template contained a map other maps on the page shift up.

This makes the proposed sequential ids as brittle as the hashes, I'm afraid.

The only canonical representation of an old revision is it's raw wikitext. Unfortunately this is not accessibly the moment a <mapframe> tag is parsed. But we have easy access to the raw, unprocessed text inside the tag – where we expect GeoJSON. In theory a hash based on this should never change. Not over time and not even when the page is edited as long as no change is made to the <mapframe>. That's the essence of the existing hash system. Unfortunately the implementation is flawed. https://gerrit.wikimedia.org/r/887805 is a possible way to fix it.

This is a nice guarantee, but it's also given by the revision_id+map_sequence_id […]

Unfortunately not. There is no canonical representation of an old revision we can use to reliably count, number and identify maps on a page. One of many events that affect old revisions is when an old template is deleted. It also disappears from old revisions. When the template contained a map other maps on the page shift up.

Ah, very interesting. Now I see that a random number or UUID would serve the purpose, there's no reason at all to use hashes. There's no point at which the hash is independently calculated and compared or looked up from content, this is just an identifier.

But changing old revision content is a huge pain. Perhaps we can consider this as a corner case, especially considering that viewing maps in old revisions wasn't even possible until last year. I think we can safely say that the most common case of this will be a small change to map templates, where a sequential ID will allow successfully serving the tweaked map. In the corner case of adding or removing a map, there's a chance that maps may move around or break until the cache is purged, when viewing stale HTML content for a page.

Sure, we as developers would use UUIDs. I'm sure this was considered when the system was designed. But where do we store this UUID when the <mapframe> source code is not even in the article, but somewhere deep in a template transclusion tree? How do we force users to generate UUIDs that actually are unique? How can UUIDs be generated from a template so users of the template don't need to be bothered with this, but the UUID changes only when a new template usage is added to an article and not for existing usages? So many questions. 😥️

Change 888215 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/extensions/Kartographer@master] [refactor] Move group filtering into a function

https://gerrit.wikimedia.org/r/888215

where do we store this UUID

Any IDs will be stored in the same place as now: in ParserOutput under the extension property. We could consider storing additional mappings to hash ID etc. as a step during migration, but this doesn't seem to be necessary.

How do we force users to generate UUIDs that actually are unique?

Automatic IDs are generated during parse and opaque to users, so no problem here. Users never see this ID.

UUID [should only] change [...] when a new template usage is added to an article

That's more of a problem, yeah. A random ID would change on every parse which explodes the cache mismatch problem. Revision ID + sequential ID isn't quite good enough because the revision's content and even map count can change. Parsoid assigns each element a unique ID, but this is only valid for one rendering of a page.

Change 888215 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] [refactor] Move group filtering into a function

https://gerrit.wikimedia.org/r/888215

I'm afraid we keep moving the problem around without tackling it.

The way this is described now means that these ids (no matter if UUIDs or sequence numbers) can change any time. Any time a page is saved. Any time a preview is generated. Any time a parser cache entry gets stale or is purged and the page re-rendered. This creates all the same problems as before: The HTML source a user looks at can easily be different from what the mapdata API will find in the parser cache. The HTML can be older (e.g. the page was purged and re-parsed in the meantime) or from a different parser cache slot (because of language variant, skin, mobile domain and whatnot). While the mapdata API will find some map data in the parser cache, the ids can be different for the same (and even more) reasons as before.

I don't see an elegant way out of this sub-revision case, but perhaps we can could down the already rare case by some large percentage by using a simple checksum of the final number of mapframes in the revision. This number is returned by the mapdata API and compared to a count embedded in the page HTML, and if the numbers don't match then we can * invalidate all map thumbnails on the page, fall back to dynamic maps, prompt the user to reload the page...

I realize also that I've been unfair to hashed IDs, they aren't directly to blame for the language variant and mobile image problems. In both cases, the hash is correctly varying and would work correctly if we also passed the necessary information through to the kartotherian snapshot endpoint.

Possible plan:

  • Add backwards compatibility code that keeps the old hashes working for a while.
  • Note we plan to change the hashes anyway because of parse time expansion. So we need this either way.
  • Make the hashes stable, only based on the raw wikitext, but not on language variant and whatnot.
  • We are now in a situation where users will see maps with the wrong language variant, wrong thumbsize and whatnot. This is already a massive improvement compared to the broken maps from before.
  • Make sure the mapdata API is called with all necessary information to be able to deliver the map in the correct variant.

That should be it.

If we want we can use another scheme for the new hashes, e.g. _v2:0123456789abcdef. But we don't need to, I believe. One benefit of sticking to the existing scheme is that some hashes (not sure how many, my gut feeling is 25%) won't change.

Change 889948 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Kartographer@master] [POC] No per-user thumbsize when parsing wikitext in GeoJSON

https://gerrit.wikimedia.org/r/889948

Possible plan:

  • Add backwards compatibility code that keeps the old hashes working for a while.
  • Note we plan to change the hashes anyway because of parse time expansion. So we need this either way.
  • Make the hashes stable, only based on the raw wikitext, but not on language variant and whatnot.
  • We are now in a situation where users will see maps with the wrong language variant, wrong thumbsize and whatnot. This is already a massive improvement compared to the broken maps from before.
  • Make sure the mapdata API is called with all necessary information to be able to deliver the map in the correct variant.

This seems reasonable to me. Any extension should be able to access the raw text of the contents of that extension tag /before it has been treated as wikitext or parsed in any way/, and that's a very reasonable thing to hash.

Another alternative is to /require/ the user to give a unique name to every map. You could enforce that with parser errors or warnings when a page is saved, adding a linter message to have volunteers name existing maps, create a bot to do the migration, etc. When you insert a map with VE or whatever a default UUID could be assigned and written into the wikitext as the name if the editor doesn't pick a better one. Eventually you'll turn on enforcement and the map will display as a big red error message or some such if there's not a user-defined name (or if the name isn't unique).