Installing wkhtmltopdf with patched Qt on Solus

barjo · Jul 13, 2020

Hi,

The wkhtmltopdf package in Solus works great for converting html to pdf, but it isn't built with patched Qt, which means it lacks a lot of important features such as table of contents, internal links, and more.

Does anyone have experience installing wkhtmltopdf with patched Qt in Solus?

Alternatively, is it feasible for Solus to support wkhtmltopdf with patched Qt?

wkhtmltopdf has its own repository dedicated to building and packaging, found here. These instructions advise that the distribution/OS support wkhtmltopdf with patched Qt, quote:

It's best if you can get the distribution/OS to support wkhtmltopdf with patched Qt, as it would enable using the native package management tools.

Potentially two packages, the existing wkhtmltopdf, and an additional patched version named wkhtmltopdf-patched or something similar, would be absolutely ideal.

Kind regards

algent · Jul 13, 2020

barjo Hi! Welcome to Solus.
I replied you on reddit.

As you can see from builddeps, wkhtmltopdf is already built with Qt5. Not sure what to patch there.

builddeps  :
    - pkgconfig(Qt5Gui)
    - pkgconfig(Qt5Svg)
    - pkgconfig(Qt5WebKit)
    - pkgconfig(Qt5XmlPatterns)

barjo · Jul 13, 2020

algent Hey algent, thanks for your reply.

Some of the features of wkhtmltopdf are not available with the version in eopkg.

For example, when trying to input multiple documents, I receive the error:
Error: This version of wkhtmltopdf is build against an unpatched version of QT, and does not support more then one input document.

Or when trying to enable internal links using the --enable-internal-links flag:
The switch --enable-internal-links, is not support using unpatched qt, and will be ignored.

Girtablulu · Jul 13, 2020

not gonna add random patches, the developer should upstream them

barjo · Jul 13, 2020

Girtablulu Hi,

There's nothing more the developer of wkhtmltopdf can do, since these features come from Qt.

As it stands, wkhtmltopdf being built with unpatched Qt, the majority of the features which make it a great tool are absent.

Bbrent · Jul 14, 2020

I never thought such a beautiful creature existed--a cli app that turned a web page to a pdf! No more cutting and pasting!
So I installed it and ran a command:

$ wkhtmltopdf https://www.fivehearthome.com/greek-salad-dressing/ /home/workstation/Desktop/test.pdf
Loading page (1/2)
Error: Failed to load https://www.google-analytics.com/r/collect?v=1&_v=j83&a=1837533283&t=pageview&_s=1&dl=https0A1F1Fwww.fivehearthome.com1Fgreek-salad-dressing1F&ul=en-us&de=UTF-8&dt=Homemade Greek Salad Dressing • FIVEheartHOME&sd=32-bit&sr=800x600&vp=&je=0&_u=IEBAAUQ~&jid=138859473&gjid=1765531755&cid=1632260110.1594685828&tid=UA-41846948-1&_gid=192024543.1594685828&_r=1&gtm=2ou6o0&z=252867961, with network status code 1 and http status code 0 - Connection refused
Error: Failed to load https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v2.0, with network status code 1 and http status code 0 - Connection refused
Printing pages (2/2)                                               
Done                                                           
Exit with code 1 due to network error: ConnectionRefusedError

As you can see, two simple elements are blocked in my hosts file.
As a result:

98% missing content for a salad dressing recipe as a result of a simple hosts file?
Can't interpret this result.
Maybe I didn't utilize the full range of argument commands?

barjo · Jul 14, 2020

brent The full range of argument commands are not available in the version of wkhtmltopdf that we have in Solus (this is the reason why this thread exists, it must be compiled with patched Qt to make available the full range of features).

You could try this:
chrome --headless --disable-gpu --print-to-pdf=/home/workstation/Desktop/test.pdf https://www.fivehearthome.com/greek-salad-dressing/

This is a way of converting html to pdf using Chrome's integrated html to pdf converter. If you don't have Chrome you can replace chrome with any chromium-based browser such as chromium, brave-browser, vivaldi-stable, ungoogled-chromium etc.

Compare the results and you should be able to see if the problem is with wkhtmltopdf's conversion, or the webpage itself.

Bbrent · Jul 14, 2020

barjo $ vivaldi-stable --headless --disable-gpu --print-to-pdf=/home/workstation/Desktop/tester.pdf https://www.fivehearthome.com/greek-salad-dressing/ MESA-LOADER: failed to retrieve device information MESA-LOADER: failed to open radeon (search paths /usr/lib64/dri) failed to load driver: radeon MESA-LOADER: failed to open kms_swrast (search paths /usr/lib64/dri) failed to load driver: kms_swrast MESA-LOADER: failed to open swrast (search paths /usr/lib64/dri) failed to load swrast driver [0713/193936.270531:INFO:headless_shell.cc(620)] Written to file /home/workstation/Desktop/testes.pdf.

but 36 pages, some pix missing as it was narrative only for the ghost pics, but by page 22 the gold, then comments, then the end.

Much better as a vivaldi tool right now, you are correct-- I get it. If one is text-driven like me then the vivaldi guts or the wkhtmltopdf tool is workable for specific instances.
I don't know about the qt stuff and if you query at the dev tracker link above you may learn more.
I can say with certainty that the crew in charge here are not careless about inclusion/exclusion in their repo. In fact, I'd say they were thorough and transparent in their explanations and have the trust of their community.
edit: code

Girtablulu · Jul 14, 2020

barjo well if you cant tell me what the developers mean with a patched QT cant help you

barjo · Jul 14, 2020

Girtablulu They have a repo dedicated to the packaging and building of wkhtmltopdf, found here: https://github.com/wkhtmltopdf/packaging

I don't know the exact procedure to build wkhtmltopdf with patched Qt, and the build instructions don't seem to make it clear. Not to me at least. I have contacted the developer to ask for more information about how to build wkhtmltopdf using patched Qt.

fstojkovski · Aug 31, 2020

I had some inconsistency issues with wkhtmltopdf on my solus machine and my teammates machines so I made some debugging and come to a conclusion.
First of all when I run wkhtmltopdf --help in the console the output clearly says that it is a version without patched Qt. (screeshot below)

Then I debugged as much as I could with my colleagues and this is what I found out.
The wkhtmltopdf --version they have are wkhtmltopdf 0.12.5 (with patched qt), the solus version is wkhtmltopdf 0.12.5. Thats the difference.
The next step in the debugging process was to see the properties of the pdf file generated. The solus generated file is produced by qt webkit version 5.14.2 whereas the pdf file generated on the other machines is produced by qt webkit version 4.8 or something like that. (screenshot below)

I read the discussion above and I thought that this debugging and problem of mine should be added in this discussion.
Hopefully the problems that I, and the other members of this thread have will be solved.

barjo · Aug 31, 2020

fstojkovski I decided to stop using wkhtmltopdf and use the "Chromium Headless" method instead.

If you're interested, here's an easy way to use this method:

Download the ungoogled-chromium AppImage from here (latest version is 83.0.4103 as of writing this).
cd into the directory where the AppImage was downloaded.
Make the AppImage executable, chmod +x ungoogled-chromium..._linux.AppImage
Rename for convenience (optional), mv ungoogled-chromium..._linux.AppImage chromium.AppImage
Then you can convert your websites to PDF like this:

./chromium.AppImage --headless --disable-gpu --print-to-pdf="./result.pdf" https://example.com

This is essentially the same as going onto a page using chromium, pressing ctrl+P, and selecting 'Save as PDF'. It works very well.

Bbrent · Sep 1, 2020

fstojkovski Good work, over my layman head, but good. I'm going to be patient. It's a marvelous app I think that would have a lot of uses if fully functional. One of those things right now.

barjo read a lot about the headless workaround (and chrome/chromium iterations of it) and I don't quite understand it. Are you adding another browser to Solus (unggogled-chromium) to accomplish the htmltopdf functions? Or is this simply a tool that can function in Firefox or other after it's installed? Thanks.

barjo · Sep 1, 2020

brent Yes, ungoogled-chromium is a standalone browser that you download for Solus. It is in fact a fork of Google Chromium, with all the Google parts taken out. This makes the browser more light-weight and privacy respecting. It looks and behaves just like Chromium.

As for "headless mode", this is a way to use the browser without UI components. Essentially to use the browser like a command-line tool. Firefox has a headless mode as well: firefox -headless.

Chromium has a HTML to PDF converter built into it, which is used for printing websites. Quite usefully, you can access this HTML to PDF converter through headless mode. Meaning you can convert HTML documents to PDF using Chromium on the command-line. It functions the same as wkhtmltopdf. While wkhtmltopdf uses Qt's WebKit rendering engine, ungoogled-chromium uses Chromium's rendering engine which I believe is called Blink.

Since this feature is part of Chromium, it means that any Chromium-based browser (Vivaldi, Brave browser, Microsoft Edge) can be used in headless mode to perform these HTML to PDF conversions. I choose to use ungoogled-chromium for the afore mentioned reasons (light-weight and privacy respecting).

Description of the flags used in my previous comment:

--headless is used to enter headless mode in Chromium (use Chromium as a command-line tool)
--disable-gpu is optional, since there's no UI components we can disable GPU. If not used, you'll get a warning message about GPU, but the command will still work
--print-to-pdf is used to access the HTML to PDF converter in Chromium.

Hope this helps

Bbrent · Sep 1, 2020

barjo --headless is used to enter headless mode in Chromium (use Chromium as a command-line tool)
--disable-gpu is optional, since there's no UI components we can disable GPU. If not used, you'll get a warning message about GPU, but the command will still work
--print-to-pdf is used to access the HTML to PDF converter in Chromium.

that helps a heckuva lot. are there MAN pages with more flags (when in headless mode)?
thanks for taking the time to explain this.

barjo · Sep 1, 2020

brent No problem. There's a man page for Chromium, but it seems to only explain a few most commonly used flags, nothing about headless mode. I've only used headless Chromium for HTML to PDF conversion, but I believe it can be used for a lot of useful things. Here is a beginner's guide to headless Chromium which explains some flags that could be useful, and here's a presentation about headless Chromium at the Google I/O event in 2018.

Bbrent · Sep 1, 2020

barjo Your primer, now 4 posts up, was simple enough. It works like a charm. I never actually have to see chrome which is nice.

barjo "./result.pdf"

A user can change "results" to any name they want, and the default folder is the folder of origin (downloads for me).

Bbrent · Sep 9, 2020

barjo some websites look like 2 pages but are actually 10.
Do you have any commands with page delimiters?

For example $ ./chromium.AppImage --headless --disable-gpu --print-to-pdf="./Galoshes.pdf" http://www.greatgolashes/solostar3230.html

I tried a "1-2" (for page range) on BOTH sides of the equals sign with no luck.
I'll try to look for this on my own tomorrow night but wondering if you had a command handy? Thanks.

barjo · Sep 9, 2020

brent I looked around and it seems this isn't possible with headless Chrome from the command-line.

However, the Chrome team have developed a library for Node.js, called Puppeteer, which allows a user to control headless Chrome programmatically. Using Puppeteer, you have access to the full range of options that headless Chrome provides. Included in these are all the parameters that printToPDF utilises. One of them being pageRanges, which would allow you to set a page range like "1-2". Here is the full list of printToPDF parameters.

This is less than ideal, especially if you aren't keen on programming.

There's another way, without Puppeteer or third-party applications, albeit not very elegant:

Download the webpage
Open in an editor
Trim the file
Convert to PDF

As commands:

wget https://example.com/index.html (Downloads the webpage, file will be saved as "index.html")
gedit index.html (Open the file in Gedit, use an editor of your choice)
Select the parts of the page you want to "trim" and press delete
./chromium.AppImage --headless --disable-gpu --print-to-pdf="./result.pdf" index.html (Convert the file "index.html" to a PDF named "result.pdf")
firefox result.pdf (View the PDF)

Tip: To know where to begin trimming the file, you can convert the page to PDF like you did previously and scroll to the page you would like to cut off from. Select the first line of the page and copy it to your clipboard. Then, when you download the document and open it in your editor, you can Ctrl+F (find in document) and paste this line of text into the field. Now you know exactly the point the document where you want the ultimate page to end. Place your cursor at this point and delete all the text after it until the end of the document (Ctrl+Shift+End, then Delete). It's finicky I know, but that's one way to do it.

Another Tip: The webpage passed to print-to-pdf must be an HTML file. If you download the webpage and the file doesn't end with ".html", then mv file file.html.

Bbrent · Sep 9, 2020

barjo Wow, thanks for showing up with all this, it's really nice of you, and I appreciate it. No..don't want the Puppet.

The "less than ideal" and "not very elegant" are always attractive to me In course of trying your editor approach...a thought occurred to me as an old but effective workaround from my graphic arts days...in a heavy print deadline workflow...
...I just did this in 10 seconds. Took the 10 page PDF and printed "to file" as .ps (postscript) with pages 1-2 selected in UI print dialogue box. The .ps file was lo-res, but quickly turning the .ps back into a .pdf preserved it's resolution, and, like I said---seconds.
Thanks for coming back with the "not elegant" --that dusted off a memory.