Saved searches

Use saved searches to filter your results more quickly

Cancel Create saved search Sign up Reseting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

kcroker / dpsprep Public

Python DJVU to PDF converter which preserves OCR text and bookmark metadata (e.g. TOC)

License

Notifications You must be signed in to change notification settings

kcroker/dpsprep

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Go to file

Folders and files

Last commit message Last commit date

Latest commit

History

View all files

Repository files navigation

dpsprep

This tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e.g. TOC) and text layers (e.g. OCR).

Usage

Full example (the name of the PDF is optional and inferred from the input name):

dpsprep --pool=8 --quality=50 input.djvu output.pdf

If you have OCRmyPDF installed, you can use its PDF optimizer:

dpsprep -O3 input.djvu

You can also skip translating the text layer (it is sometimes not translated well) and redo the OCR (rather than launching the ocrmypdf CLI, we use the API directly and accept options in JSON format):

dpsprep --ocr '' input.djvu

Consult the man file (online) for details; there are a lot of options to consider. See the next section for different ways to run the program.

Installation

libtiff for bitonal image compression.
libjpeg (or libjpeg-turbo ) for multitotal (RGB or grayscale) compression.
OCRmyPDF and jbig2enc for PDF optimization (see the next section).

libtiff depends on libjpeg , so installing libtiff will likely install both.

For details on how these dependencies can be installed, see the GitHub Actions workflow and the dpsprep-git package for Arch Linux.

Note that Windows support in djvulibre-python requires 64-bit djvulibre , and they only officially distribute 32-bit Windows packages. If you manage to make it work, consider opening a pull request.

Once inside the cloned repository, the environment for the program can be set up by simply running poetry install . After than, the following should work:

poetry run python -m dpsprep input.djvu

The program can easily be installed as a Python module via poetry and pip :

poetry build pip install [--user] dist/*.whl

If you are packaging this for some other package manager, consider using PEP-517 tools as shown in this PKGBUILD file.

A convenience script that can be copied or linked to any directory in $PATH can be found at ./bin/dpsprep .

Previous versions of the tool itself used to depend on third-party binaries, but this is no longer the case. The test fixtures are checked in, however regenerating them (see ./fixtures/makefile ) requires pdflatex (texlive, among others), gs (Ghostscript), pdftotext (Poppler), djvudigital (GSDjVU) and djvused (DjVuLibre). Similarly, the man file is checked in, but building it from markdown depends on ronn .

Note regarding compression

We perform compression in two stages:

The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if libtiff is available, group4 compression is used.
If OCRmyPDF is installed, its PDF optimization can be used via the flags -O1 to -O3 (this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression via jbig2enc .

If manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting --tesseract-timeout to 0 ) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:

python -m ocrmypdf.optimize

Acknowledgements

The font invisible1.ttf is taken from here. See the djvu_pages_to_text_fpdf function in ./dpsprep/text.py for how it is used.

Kevin's notes regarding the first version

I wrote this with the specific intent of converting ebooks in the DJVU format into PDFs for use with the fantastic (but pricey) Sony Digital Paper System.

DjVu technology is strikingly superior for many ebook applications, yet the Sony Digital Paper System (rev 1.3 US) only supports PDF technology: this is because its primary design purpose is not as an ereader. The device, however, is quite nearly the perfect ereader.

Unfortunately, all presently available DjVu to PDF tools seem to just dump flattened enormous TIFF images. This is ridiculous. Since PDF really can't do that much better on the way it stores image data, a 5-6x bloat cannot be avoided. However, none of the existing tools preserve:

The OCR'd text content
Table of Contents or Internal links

This is kind of silly, but until Sony's Digital Paper, there was no need to move functional DjVu files to PDFs. In order to make workable PDFs from DjVu files for use on the Digital Paper System, I have implemented in one location the following procedures detailed here: