[go: up one dir, main page]

Page MenuHomePhabricator

IA-Upload: DLI conversion failure for in.ernet.dli.2015.226478
Open, Needs TriagePublicBUG REPORT

Description

I think this is a new failure mode caused by the unusual Digital Library of India files, which start with the PDF and have enormous >2GB JP2 archives caused by bad compression on the IA's back end.

Example:

Uploading from: https://archive.org/details/in.ernet.dli.2015.226478

Log: https://ia-upload.wmcloud.org/log/in.ernet.dli.2015.226478

[2021-11-29T17:48:01.603585+00:00] LOG.INFO: Creating DjVu for in.ernet.dli.2015.226478 from Pdf [] []
[2021-11-29T17:48:01.605676+00:00] LOG.INFO: Requesting start of conversion of in.ernet.dli.2015.226478 [] []
[2021-11-29T17:48:01.997408+00:00] LOG.INFO: Starting download to /ia-upload/jobqueue/in.ernet.dli.2015.226478/in.ernet.dli.2015.226478.djvu [] []
[2021-11-29T17:48:01.997593+00:00] LOG.DEBUG: Getting in.ernet.dli.2015.226478 [] []
[2021-11-29T17:48:02.359769+00:00] LOG.DEBUG: Can't locate djvu file, ia id is valid, perhaps conversion failed or is in progress [] []
[2021-11-29T17:48:07.360127+00:00] LOG.DEBUG: Getting in.ernet.dli.2015.226478 [] []
[2021-11-29T17:48:07.694135+00:00] LOG.DEBUG: Can't locate djvu file, ia id is valid, perhaps conversion failed or is in progress [] []
[2021-11-29T17:48:12.694512+00:00] LOG.DEBUG: Getting in.ernet.dli.2015.226478 [] []
[2021-11-29T17:48:13.021762+00:00] LOG.DEBUG: Can't locate djvu file, ia id is valid, perhaps conversion failed or is in progress [] []
[2021-11-29T17:48:18.022152+00:00] LOG.DEBUG: Getting in.ernet.dli.2015.226478 [] []

And then this repeats for thousands of lines.

Event Timeline

This depends how how you are trying to process that. That IA item does not have an existing DjVu file (it was created well after March 2016 when they stopped making those).

The create "from original scans (JP2)" option will likely fail but with a different error message as the issue here seems to be that the "Single Page Processed JP2" archive is a tar file named 2015.226478.An-American_jp2.tar instead of a zip file with a name matching .*_jp2.zip as Jp2DjvuMaker currently expects. We should probably add support for tar files while also looking for the "Djvu XML" and "Single Page Processed JP2" files in the "files" metadata. Currently we fetch the complete details in JSON (which includes item and "files" metadata among other things) and then searches it for, downloads and processes files with names matching a regex pattern containing .*(_djvu.xml|_jp2.zip)).

However, based on your error messages it appears you tried the convert "from PDF" option which uses phetools/pdf_to_djvu to do the PDF conversion. pdf_to_djvu gets the files metadata and downloads the "Djvu XML" and "Additional Text PDF" or "Text PDF" files it finds. It then does the conversion from the PDF using DjVuDigital and reinstruments the OCR layer from the XML using DjVuLibre. This should work but I have no idea how to get any logs it might make to know if and how it fails or not.

IA Upload also does not seem to be able to get any further data as the API is very rudimentary. One can only request a conversion, or attempt to get the result while waiting for the conversion to complete. IA Upload has to poll this and why you get all the Can't locate djvu file, ia id is valid, perhaps conversion failed or is in progress log messages every five seconds. Hopefully this will be fixed by: T307956: IA-upload: Drop phetools dependency