Topic on User talk:Hjfocs

Jump to navigation Jump to search

Soweego bot adding invalid fandom pages

10
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad ID extraction & percent-encoded URLs

BrokenSegue (talkcontribs)

Bot seems to be adding links to Fandom article ID (P6262) "lyrics". which is invalid. Can some extra filtering be done to prevent this from happening in the future?

BrokenSegue (talkcontribs)
Hjfocs (talkcontribs)

Many thanks for reporting this: I've just stopped the bot, will look into those bad IDs, and will delete those that were uploaded.

Hjfocs (talkcontribs)

I got rid of additional bad IDs, so they shouldn't be added anymore. The bot has now restarted. While deleting the uploaded ones, I noticed you have already taken care, thanks again for your action! One question: what's your solution to do so? I see you used QuickStatements, and was wondering how. Cheers!

BrokenSegue (talkcontribs)

I used a SPARQL query to find all items that linked to the bad fandom page and then a one line bash script to convert that to a quickstatement file that removes them.

The thing I don't get about this error by your bot is how it parsed out just "lyrics" as a fandom article link. That isn't the correct format for fandom articles but it is the correct format for Fandom wiki ID (P4073).

Hjfocs (talkcontribs)

I like your workflow!

The bot tries its best to parse URLs into proper Wikidata IDs, but it looks like there's a jungle of URL variations, bad regexps matches, multiple matching groups, and the like, so it's not perfect. In this specific case:

  • a couple of hundred fandom URLs available in MusicBrainz (the total has order of magnitude 10^5) matched the second regexp in Property:P6262#P8966;
  • the regexp has 2 matching groups;
  • the bot didn't consider URL match replacement value (P8967) qualifier stated in the regexp;
  • it took the first matching group as the ID value.
BrokenSegue (talkcontribs)
Lockal (talkcontribs)

External id builder for Fandom is still broken.

I have also a general request: could you validate extracted identifiers with ? It would solve the problem. Also I just know that MusicBrainz has no validators there, so it would protect Wikidata against ill-formed (accidentally or intentionally) data. --Lockal (talk) 08:02, 21 September 2021 (UTC)

Lockal (talkcontribs)

Similar problem with NicoNicoPedia ID (P6900) in - extractor should remove URL encoding (same thing applies to Fandom.com) and, I suppose, any other identifier.

Hjfocs (talkcontribs)

Thanks a lot for reporting these issues, really appreciated. I'll file them in my tracker, and take the due actions soon.