[go: up one dir, main page]

Page MenuHomePhabricator

Text layer mismatch with page images on DjVu created from original scans (JP2)
Closed, DuplicatePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):
Hello, when I transfer large DjVu files (more than 500 pages) it is common that the text layer of the pages does not match the page images.

*I download a PDF file from Gallica
*I extract the pages to process them with Briss and Scantailor,
*I suppress the two disclaimer pages of Gallica
*I make the OCR and the DjVu with Abbyy Finereader 15.
*I check the DjVu with DjView.
*I upload the DjVu file on Commons
*If there is a mismatch I remake the OCR of DjVu created with Abbyy Finereader with Tesseract and if fix the problem, but the OCR is less performant than Finereader one.

Examples: https://commons.wikimedia.org/wiki/File:La_Revue_Ind%C3%A9pendante,_tome_1_-_mai_%C3%A0_octobre_1884.djvu

I precise that i check the correct matching of images page and text-layer produced by Abbyy Finereader, with DjView 4.11 (DjVulibre) and there is no problem.

Sorry for my poor English.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:
*Abbyy Finereader 15

Event Timeline

Désolé , je préfère m'exprimer en français. Je viens de faire un import direct depuis internet archive du livre https://archive.org/details/lefolkloredupoit00unse_0 en utilisant ia-upload sur wikisource français : https://fr.wikisource.org/wiki/Livre:Pineau_-_Le_Folk-lore_du_Poitou,_1892.djvu a un décalage de pages entre l'image de la page et la couche texte. Exemple sur cette page : https://fr.wikisource.org/wiki/Page:Pineau_-_Le_Folk-lore_du_Poitou,_1892.djvu/531

Hi @Cunegonde1, thanks for taking the time to report this!
Please follow the template and provide all of:

  • List of steps to reproduce (step by step, including full links if applicable)
  • What happens?:
  • What should have happened instead?:

in separate sections. Thanks!

I'll try to translate Cunegonde1's message:

List of steps to reproduce (step by step, including full links if applicable)
Use IA-Upload to upload to Commons the PDF file at https://archive.org/details/lefolkloredupoit00unse_0 (Internet Archive).
Select ".djvu" as the target file type.

What happens?
The text layer in the Djvu file on Commons (https://commons.wikimedia.org/wiki/File:Pineau_-_Le_Folk-lore_du_Poitou,_1892.djvu) is erroneous : page N contains the text for page N+1

See for example in the French Wikisource: the text layer in page 100Il a fallu chercher, etc. ») should be in page 101.

What should have happened instead?
The text layer in page N should of course be in page N. The problem does not appear in the original PDF file in the Internet Archive.

I don't know if this is related, but the PDF file in the Internet Archive contains 580 pages, while the DjVu file at Commons contains 582 pages: the 1st and last pages appeared during the upload. So maybe IA-Upload used the JP2 files, which number is 582 and removed the 1st page (used for the scanner calibration?).

I too have run into this issue and I do not think it is is so much of an issue with the OCR layer being on the wrong pages per se as the
Jp2DjvuMaker including extraneous pages from the jp2 set and thus the OCR layers effectively no longer match up with resultant pages.

Issues with page counts:

I have verified in each of these cases the correct pages can be ascertained by looking at the "_scandata.xml" pageType (looking for Cover or Normal) for the ".jp2" files and/or "_djvu.xml" for the ".djvu" pages to include into the final multi-page ".djvu"

Perhaps the better method is to use the addToAccessFormats tag and only include pages where this value is true not false. One thing to note though is that this scandata XML is not always a separate file as in some items is it included in a scandata.zip (e.g., prohibitionprinc00wheeuoft). Even when there is a _scandata.xml file, there can be more than one such file and it need not be named the same as the IA ID (e.g., volume-10_202211)

FYI: As a side note, for these IA used Tesseract for OCR instead of Abbyy so this verifies the issue is not directly related to the OCR software but rather how the OCR data is applied when the DjVu files are created.

It appears the current src/DjvuMaker/Jp2DjvuMaker.php is blindly assuming all the ".jp2" files in the derived archive should be made into pages in the final output when that is definitely not true. Typically there are several scans (it appears the number of original and derived scans are the same) that are not OCR'ed and not placed into the final PDF or DjVu. Jp2DjvuMaker also seems to rename the pages. When comparing ".djvu" files created at IA (when available) with those created by Jp2DjvuMaker, IA names pages <IAID>_####.djvu (where #### is the scan number) whereas Jp2DjvuMaker names the pages <IAID>_p<n>.djvu (where <n> appears to be the resultant unpadded page number`). IA scans seem to be numbered from "0000" and quite often that first scan is not the first cover and is not included as the first page which is why the first symptom noticed might be the skewed application of OCR data layers as reported above.

It appears Jp2DjvuMaker is only looking at the derived jp2 scans and "_djvu.xml" which is probably fine, however, it should look at it more carefully to determine which scans to throw away before applying OCR data and combining into the final multi-page DjVu.

Uzume renamed this task from Text layer mismatch with page images on djvu created with Abbyy finereader 15 to Text layer mismatch with page images on djvu created from original scan (JP2).Jun 23 2022, 5:24 PM
Uzume renamed this task from Text layer mismatch with page images on djvu created from original scan (JP2) to Text layer mismatch with page images on DjVu created from original scans (JP2).Jun 23 2022, 5:38 PM
Uzume updated the task description. (Show Details)