All issues pertaining to Wikisource's IA Upload tool:
- Tool homepage: https://wikitech.wikimedia.org/wiki/Tool:IA_Upload
- Tool: https://ia-upload.wmcloud.org/
- Source code: https://github.com/wikisource/ia-upload/
All issues pertaining to Wikisource's IA Upload tool:
I do not think we should be supporting Internet Archive collections, as per se, however, based upon the IA identifier this really about multiple scanned objects available at a single IA identifier. I am not sure we should support more than a single item at a time to Commons, however, we should provide a means to upload each scanned sub-object available at a single IA identifier.
Jason Scott has suggested that the IA could be blocking us, so I've emailed info@archive.org to see if there's anything that can be done. I wouldn't be surprised if the Toolforge IPs have been blocked, considering they must see somewhat higher traffic from them. It sounds like IA is still in recovery mode, so we should be patient.
In T379402#10318391, @Chlod wrote:FWIW, one of my tools which relies on the Internet Archive also always times out. Perhaps the Internet Archive has temporarily(?) blocked some WMCS IPs following the outage?
FWIW, one of my tools which relies on the Internet Archive also always times out. Perhaps the Internet Archive has temporarily(?) blocked some WMCS IPs following the outage?
In T379402#10308031, @Samwilson wrote:Sorry, ignore me! ia-upload is not running on Toolforge any more, it's got its own VPS. :-/
However, the issue is the same:
$ curl -I https://archive.org/details/20231002_20231002_0537?output=json curl: (28) Failed to connect to archive.org port 443 after 129880 ms: Couldn't connect to server $ curl -I https://archive.org/details/history-of-telegraphy-wa-3?output=json curl: (28) Failed to connect to archive.org port 443 after 130678 ms: Couldn't connect to server
T178197 seems to discuss programmatic methods to detect at least some (perhaps all?) of the issues mentioned here.
Since this is related to the integration of the text layer using _djvu.xml, and seems to happen when there is a mismatch in the number of pages, this is likely related to T194861: DjVu construction from original scans (JP2) selects which pages to build incorrectly resulting in misintegration of djvu.xml based text layers (and the numerous other tickets merged/closed a duplicates of that).
There haven't been any new uploads in over 30 days so you won't find anything in commons:Special:RecentChanges (e.g., the recent-uploads link at the top of the tool page).
Sorry, ignore me! ia-upload is not running on Toolforge any more, it's got its own VPS. :-/
There was a little bit more to be done: https://github.com/wikisource/ia-upload/pull/63 (finished now I think).
Are any IA items working?
@Samwilson: I am closing this as resolved based upon your aforementioned PR being merged on 2024-07-16:
For Wikisource, use the DjVu option "from original scans (JP2)" instead. This is currently preferred to uploading as PDF due to the various issues mentioned by me in T363619.
Converting from PDF to DjVu is no longer supported sorry. We've not yet removed the option from the tool (that'll be done in T363619).
On the comment "the original PDFs can be uploaded directly", currently there are enough issues with our handling of PDFs (notably bad text layer extraction -- see T242169 -- and bad thumbnail generation -- see e.g. T224355 and linked issues, also note the related issue T339845) that DjVu is still being recommended over PDF on enWS.
This is caused by ia-upload adding pages that are marked as <addToAccessFormats>false</addToAccessFormats> in the scandata XML file, as mentioned by InductiveLoad in this comment.
But those are different files scanned from different sources. পথের পাঁচালী.djvu is not even originally from Internet Archive.
I agree that in general there is little advantage to creating DjVus from PDFs but sometimes people prefer such formats. PDF technology has now subsumed most of the advantages DjVu previously had. Unfortunately this now means PDF is a very large and complex set of specifications and it is hard to know how any single PDF is constructed without analysis by digital tools.
I do not believe this is an IA Upload issue as it is not specific to IA Upload nor to DjVu as it happens with PDFs. This is a common issues with Commons in general. The workaround it is to purge the file on Commons (and sometimes a null edit too) to reset its media metadata. Sometimes such things also have to be done on a local wiki that uses Commons too (e.g., on a Wikisource site, etc.)
I too am looking forward to scandata.xml addToAccessFormats page filtering. That would get rid of the irritating color card and white card pages often included at the end of many scans (but I have seen them in the middle of book scans too).
Deleted ia-upload-prod.
Deleted! Sorry for the delay.
Please delete :)
Yep, nearly. As noted in T369881 I'm just waiting another day or so before deleting it. I had a couple of reports yesterday of things not going right with uploading to Commons, but probably everything's fine. It'll be gone by the end of the week.
The only remaining buster host in this project is ia-upload-prod.wikisource.eqiad1.wikimedia.cloud, which is currently shut down. Can it be deleted?