Topic on User talk:Hjfocs

Soweego bot adding invalid fandom pages

10 comments • 16:49, 29 September 2021 3 years ago

10

Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad ID extraction & percent-encoded URLs

BrokenSegue (talkcontribs)

Bot seems to be adding links to Fandom article ID (P6262) "lyrics". which is invalid. Can some extra filtering be done to prevent this from happening in the future?

21:45, 28 August 2021 3 years ago

BrokenSegue (talkcontribs)

Example of this happening: https://www.wikidata.org/w/index.php?title=Q344822&diff=1485652992&oldid=1479804413

21:45, 28 August 2021 3 years ago

Hjfocs (talkcontribs)

Many thanks for reporting this: I've just stopped the bot, will look into those bad IDs, and will delete those that were uploaded.

13:52, 30 August 2021 3 years ago

Hjfocs (talkcontribs)

I got rid of additional bad IDs, so they shouldn't be added anymore. The bot has now restarted. While deleting the uploaded ones, I noticed you have already taken care, thanks again for your action! One question: what's your solution to do so? I see you used QuickStatements, and was wondering how. Cheers!

14:41, 30 August 2021 3 years ago

BrokenSegue (talkcontribs)

I used a SPARQL query to find all items that linked to the bad fandom page and then a one line bash script to convert that to a quickstatement file that removes them.

The thing I don't get about this error by your bot is how it parsed out just "lyrics" as a fandom article link. That isn't the correct format for fandom articles but it is the correct format for Fandom wiki ID (P4073).

15:18, 30 August 2021 3 years ago

Hjfocs (talkcontribs)

I like your workflow!

The bot tries its best to parse URLs into proper Wikidata IDs, but it looks like there's a jungle of URL variations, bad regexps matches, multiple matching groups, and the like, so it's not perfect. In this specific case:

a couple of hundred fandom URLs available in MusicBrainz (the total has order of magnitude 10^5) matched the second regexp in Property:P6262#P8966;
the regexp has 2 matching groups;
the bot didn't consider URL match replacement value (P8967) qualifier stated in the regexp;
it took the first matching group as the ID value.

16:28, 30 August 2021 3 years ago

BrokenSegue (talkcontribs)

ok sounds like you should add support for URL match replacement value (P8967). it wouldn't have fixed this case but in general it's important.

16:56, 30 August 2021 3 years ago

Lockal (talkcontribs)

External id builder for Fandom is still broken.

I have also a general request: could you validate extracted identifiers with ? It would solve the problem. Also I just know that MusicBrainz has no validators there, so it would protect Wikidata against ill-formed (accidentally or intentionally) data. --Lockal (talk) 08:02, 21 September 2021 (UTC)

08:02, 21 September 2021 3 years ago

Lockal (talkcontribs)

Similar problem with NicoNicoPedia ID (P6900) in - extractor should remove URL encoding (same thing applies to Fandom.com) and, I suppose, any other identifier.

12:00, 21 September 2021 3 years ago

Hjfocs (talkcontribs)

Thanks a lot for reporting these issues, really appreciated. I'll file them in my tracker, and take the due actions soon.

13:44, 21 September 2021 3 years ago

Topic on User talk:Hjfocs

Soweego bot adding invalid fandom pages

Navigation menu

Search