Blog.

Downloading Articles for my Ebook Reader

(April 8, 2021)

I've recently taken to reading blog posts and other internet articles on my ereader. And I don't mean using my tablet's browser and wifi connection to load up websites. Instead, I convert the articles I want to read to PDF and read them like I would any other ebook (I have a large screen tablet on which reading PDFs is very comfortable; I would probably be playing around with EPUB conversion if I had a smaller screen).

The obvious way to get a PDF of a website would be to use my browser's built in print-to-PDF feature. But this has some minor problems for me:

Articles from different websites will look very differently. I can't anticipate how the website's CSS will affect readability (things like font, text size, etc.).
It's not super easy to automate. Maybe this is possible with headless browsers? But I haven't played around with those much and it feels silly to spin up a whole browser just to render some HTML as a PDF.

That second point — about automation and scripting — was particularly important to me. So the obvious tool for the job was the Swiss-army knife of document conversions, pandoc.

For a while I was wondering if I would have to write some clever script that downloads all of the article's HTML and other resources (like images) and then inputs them to pandoc. Fortunately, it turns out that pandoc <article url> -o <output file> does exactly what you think it does. The article ends up converted to PDF, with LaTeX used as an intermediate step, so everything is in the beautiful LaTeX font. pandoc also takes care of downloading and including images.

Hotkeys

I wrote a short script that calls pandoc and saves the PDF in a specific directory. With that script available and working, I added hotkeys to my browser and RSS reader that invoke it. These are the two programs in which I might find articles to read, and now I can easily generate PDFs from both.

Here's what the newsboat config looks like:

macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u"

And here's the qutebrowser binding:

config.bind(
        '"P',
        'spawn article2pdf {url}'
)

(article2pdf being the name of my script)

Caveats

This doesn't work perfectly.

There's some issues with certain Unicode characters (including emojis) that LaTeX apparently can't handle. Adding the --pdf-engine=xelatex flag when calling pandoc doesn't fully mitigate the issue, but it will produce reasonable output without completely failing.
Sometimes images are not handled great. For example they might not fit width-wise. LaTeX completely fails on images in the WebP format.
Similarly, sometimes code blocks might get cut off and not fit width-wise. This is admittedly a pretty big problem.
Headers and footers from many sites will not be rendered great. This doesn't bother me, all I care about is the main article contents.