Ask HN: Where are the good Markdown to PDF tools (that meet these requirements)?

37 points by SamCoding 20 hours ago

I'm trying to convert a very large Markdown file (a couple hundred pages) to PDF.

It contains lots of code in code blocks and has a table of contents at the start with internal links to later pages.

I've tried lots of different Markdown-PDF converters like md2pdf and Pandoc, even trying converting it through LaTeX first, however none of them produce working internal PDF links, have effective syntax highlighting for HTML, CSS, JavaScript and Python, and wrap code to fit it on the page.

I have a very long regular expression (email validation of course) that doesn't fit on one line but no solutions I have found properly break the lines on page overflow.

What tools does everyone recommend?

tikhonj 18 hours ago

I worked on a 500+ page book[1] in Pandoc that included a bunch of code samples, math, a table of contents with working links and an index. (In hindsight, I wish we had thought about the index from the beginning rather than adding it after the fact.)

What worked well for me: Pandoc with a custom LaTeX template, and a decent amount of inline LaTeX to handle edge cases. We had a LaTeX theme to use from our publisher, but we also needed our own totally separate theme for the free version version of the book.

For one-off things like really long code lines, I found it best to manually figure out how to handle them. Sometimes there was a bit of TeX magic but, more often, I just rewrote or reorganized the code. I see the presentation and structure of code and math snippets as an integral part of how I'm communicating the underlying ideas, so manually changing things around to read better was fundamentally no different from going back and editing prose.

Unfortunately, this also means that the process was relatively hands-on. If you need something ≈completed automated, I expect Pandoc → LaTeX is going to fall a bit short. Edge cases need manual intervention, and it's easy for formatting errors to sneak in—the free version of our book has some formatting mistakes like code bleeding into the margin because I ran out of energy to fix all of them!

[1]: https://github.com/TikhonJelvis/RL-book

solardev 18 hours ago

Have you explored the AST (abstract syntax tree) tools yet, like Mdast and the related remark and micromark?

https://github.com/syntax-tree/mdast-util-from-markdown

It might work better if you parse it into an intermediary Mdast format first, do whatever processing you need to implement "pages" (not a part of any Markdown dialect I'm familiar with?" but it shouldn't be hard to write a custom parser for that in Mdast), output that to HTML (via https://github.com/syntax-tree/mdast-util-to-hast) and THEN convert the HTML to PDF.

The AST tools basically give you structured JSON that's much easier to work with programmatically than raw Markdown. Then you can render that semantic JSON into HTML or other outputs.

mncharity 17 hours ago

I wonder how the Mdast and pandoc ASTs compare?
I did a customized-MD pipeline which normalized to pandoc (extra features got encoded to pass through pandoc), obtained the pandoc JSON ast, and emitted html/latex/etc using Julia pattern matching. The code was small, and the yak shave and husbandry was worth escaping the struggle with sea of crufty candidate tools, each with assorted one-chosen-point in a high-dimensional design space, and missing features, misfeatures, gotchas, and mazes ("maybe if I combine this unmaintained plugin with that one and add a postprocess massage step over there and then maybe..." - blech).
- solardev 15 hours ago
  
  I am not familiar with Pandoc, but it looks like a command-line tool that can do the same things? (Edit: I suspect this is probably one of those situations where different industries/domains end up developing similar tools in different ecosystems... Pandoc probably makes sense in academia, LaTex workflows, etc.? Mdast is used for web apps. I can see both realms wanting to do Markdown conversions, so I'm not surprised to see similar tools available in both. I'm a web dev, so only familiar with Mdast.)
  My guess is that either toolchain could do the job... maybe just depends on personal preference whether someone prefers to pipe together command-line tools in a bash script, vs making use of the npm ecosystem (mdast is all in JS).
  Maybe the popularity of JS & npm means there are available mdast plugins & third party packages that can help with whatever niche transformation you might need, and custom node rendering is just a lambda away. It's all in JS for a seamless experience, and there is no separate DSL to learn (just some basic helper functions).
  That might be harder to do in Pandoc... (might need a custom Lua filter or another language like your Julia pattern matching?)
  As for effectiveness... it probably just depends on the particular implementer :) I'd trust a grizzled old *NIX sysadmin type over your typical bootcamp JS programmer any day, but also... the JS ecosystem is pretty mature and powerful now, and Mdast is pretty amazing. At work we use it to build one of the most important parts of our app, and its power and flexibility never cease to amaze me.
  - mncharity 10 hours ago
    
    Let's see. So there are parsers in various languages, parsing various MD dialects, with varied internal representations, and surrounding ecosystems. And there are attempts at more turnkey document processing systems, often with a more extended dialect, and some collection of feature plugins. Often you can write pipeline AST filters in the given language, and sometimes get out an AST as JSON, and sometimes reinject JSON AST (allowing writing a filter in any language). Which leaves questions like: what dialect is the parser; is that extensible; how robustly correct is it; how clean and easily used and fragile is the AST; how well do the plugins/ecosystem already support your needed features. That AST one, I think of as a big deal, and hard to get a handle on. Aside from manipulation pragmatics, the asts resulting from parsing can get richly creative in quirkiness, that you then may need to regularize.
    So I guess two main observations. On build-vs-buy for backend features, given the breadth of possible "we want it like this, and not that", if one can easily play with ASTs, I was surprised by how quickly reinventing the wheel became a plausible call. Possibly skimming existing backend code for insight and templates, but mostly not using it (aka struggling to configure it to give you "this and not that"). The other observation, is once you have ast and don't care about existing backends, your choice of parser and backend language/ecosystem decouple. One might use `pandoc --to=json` and then JS generic-ast tooling to emit HTML.
    For parsing, a glance suggests Mdast emphasizes CommonMark and Github-flavored dialects. Pandoc-flavored MD is a bit broader.[1] My fuzzy recollection is I chose a pandoc parse for that, and an expectation of robustness ("it's haskell, and popular"), despite the then less that wonderful docs. IIRC, the resulting asts were fine. For backend, I wanted simple and concise to minimize burden, thus pattern matching (IIRC, most node types ended up a line or two), and chose road-less-traveled Julia for off-topic reasons (was thinking of using Julia for a compiler backend).
    Thanks for your thoughts on Mdast - I'm tempted to play with it.
    [1] https://garrettgman.github.io/rmarkdown/authoring_pandoc_mar...

pronoiac 17 hours ago

I think Pandoc and Calibre could work for you.

I've worked on PAIP, Paradigms of Artificial Intelligence Programming, and I might be able to help you a bit. It's around 1k pages long. I used Pandoc to generate an epub file, and then Calibre to turn that into a PDF file. I just tried using Pandoc to generate the PDF file directly, and it/LaTeX choked on some Unicode characters.

For internal ebook links, there's a Lua script. You'll have to keep anchors unique across the book for this:

* good: "chapter1#section1_1" and "chapter2#section2_1"

* bad: a "chapter1#section1" and a "chapter2#section1"

WIP: https://github.com/norvig/paip-lisp/pull/195

For line wrapping of code, there's CSS. I first used it over on "Writing an Operating System in 1,000 Lines"; here's the PR: https://github.com/nuta/operating-system-in-1000-lines/pull/...

jedberg 16 hours ago

> I have a very long regular expression (email validation of course)

On a tangentially related note, I guarantee you that your regex is wrong. There is only one way to validate an email address:

Send an email to it and have them respond. Otherwise you will block some valid users.

Now of course you can make a regex that gets most email addresses, and if you're ok with that, then that's fine. But if you don't want to accidentally exclude someone, then sending email is the only way to validate it.

flowerthoughts 16 hours ago

It's very easy to make a regex that allows all, but catches simple errors: /.+@.+/. Maybe narrow down the domain name, but don't forget that trailing dots are valid in DNS names.
Why is everyone trying to check for things they don't have to? If you need a valid email address, of course you have to send an email for confirmation. anna@example.com is perfectly syntactically valid, but isn't useful to anyone for sending emails. If you optionally want your users to enter an email address, don't overcomplicate things.
- skydhash 15 hours ago
  
  > Why is everyone trying to check for things they don't have to?
  I forgot where I read it (maybe something about testing or DDD), but an idea I like much is to not validate stuff coming from an external system other than for your internal constraints. You don't control an email account and how it was created and the specification is messy, so if you want to check for its existence, you query the other system. Same for other identifiers.

geor9e 19 hours ago

I think you're overcomplicating it. I assume you created this markdown file and I assume you have a preview render that shows it the way you like it to be shown. So just hit the print button, and in the print dialog select save as PDF.

npodbielski 18 hours ago

You are right if OP want to do this manually. If not I guess it is much more complicated, you would need some tool that allows you to print rendered markdown. Add one more step: convert print to PDF and effectively you have tool to convert markdown to PDF.
- milch 13 hours ago
  
  I do this with Chrome for my resume. Write it in Markdown, convert to HTML using Pandoc and then print it to PDF using Chrome.
  Google\ Chrome --no-sandbox --headless --print-to-pdf-no-header --no-pdf-header-footer --enable-logging=stderr --log-level=2 --in-process-gpu --disable-gpu --print-to-pdf=resume.pdf "file://path/to/resume.html"
  - npodbielski 3 hours ago
    
    Pretty neat. Did not thought that it is possible. Thanks.
gloxkiqcza 18 hours ago

I’m not OP but e.g. VS Code markdown preview (which works great) doesn’t offer printing.

martylamb 18 hours ago

It's not marketed as a markdown-to-pdf tool, but I've found that Obsidian (https://obsidian.md) does an excellent job. Just create a new "vault", paste your markdown into a new note, and export to PDF.

SamCoding 16 hours ago

I love Obsidian too, however I found that internal links didn't work when exporting it. Do you know what format works? My internal links work in the Obsidian preview but not in the PDF export.
- _diyar 2 hours ago
  
  How should the internal links work when converting into a PDF? They are obviously intended to enable a wiki-like structure in your notes, but I don't see a ways they could work upon export.

w4rh4wk5 18 hours ago

It has been a while, but back them i cobbled together a pipeline using Pandoc [1]. Back then, I wrote my master thesis with this [2]. While the primary output is HTML, PDF is supported as well.

[1]: https://github.com/w4rh4wk/dogx

[2]: https://github.com/W4RH4WK/M.Sc.-Thesis/blob/master/output/t...

fforflo 19 hours ago

Does converting to HTML first and then to PDF help?

deanebarker 19 hours ago

This is what I have done for a couple of books I wrote in Markdown (https://deanebarker.net/books/).
Convert to HTML, then use Prince (https://www.princexml.com/) to style and convert to PDF.
- tnt128 18 hours ago
  
  Their licenses are pretty expensive. Any good free open source alternatives?
  - flowerthoughts 16 hours ago
    
    I don't know what the scope of this is, but https://pagedjs.org/ is Javascript that does pagination and page margin styling. It's essentially a polyfill for CSS Paged Media: https://www.w3.org/TR/css-page-3/
    Pretty nice to work with, if you can run JS. (The rest is just Puppeteer to print. Though I couldn't use their command line tool, because it force-injects paged.js, and it didn't play well with the Preact components for previewing I had made.)
  - deanebarker 15 hours ago
    
    I've only ever used a free version? I've never paid for it. I think it's free for personal projects? Or at least it was...
    Edit: I see it's $495 now. I don't think it was priced when I used it, but it's been 4-5 years.
  - jessekv 18 hours ago
    
    If it looks correct in a browser, then chromium + playwright.

Syzygies 16 hours ago

I've been saving Markdown transcripts of my more involved AI chats, and I was unhappy with how any tool rendered to PDF. In either Cursor or Windsurf, I had Claude 3.5 Sonnet code a Ruby script for me that converts Markdown to Typst, a LaTeX alternative that looks a lot like Markdown. Typst offers beautiful formatting control for the output PDFs.

SamCoding 15 hours ago

Hi, thanks for all the suggestions, Typst ultimately worked best, as I was generating my Markdown file with a script I could modify it to generate a Typst file and all of the links and highlighting worked beautifully.

agateau 16 hours ago

> I have a very long regular expression (email validation of course) that doesn't fit on one line but no solutions I have found properly break the lines on page overflow.

Have you considered manually splitting the regular expression into multiple lines in the source document, using something like the `VERBOSE` mode from Python re module [1]?

[1]: https://docs.python.org/3/howto/regex.html#using-re-verbose

IshKebab 16 hours ago

I wouldn't use Markdown if you want all those features. Use Pandoc to convert your Markdown to Asciidoc, and then use asciidoctor-pdf.

Unfortunately Asciidoctor is written in Ruby which makes it an arse to work with if you need to write any plugins. And the HTML output uses Google Fonts by default, so I don't think much of the authors. But it's probably the best authoring system I've found for programming style content. For scientific content I would use LyX or maybe Typst.

countrymile 18 hours ago

Quarto is worth looking at. Might not be able to solve you regex issue though.

marcrosoft 17 hours ago

Render to html and then use webkit2pdf which will give you a pdf that looks exactly like the html shown in chrome. This is a million times easier than working with PDF libraries

ludsan 19 hours ago

I'm surprised Pandoc didn't fit the bill. It's quite configurable with fenced attributes.

I switched from using MD-->(Pandoc-->(latex))--> PDF to using MD-->(Pandoc-->(typst))--> PDF.

arcanemachiner 18 hours ago

I read quite a few people gushing about Pandoc in a similar thread. I have to look into it, as I have similar needs as the OP.

WolfOliver 19 hours ago

I would love to read your feedback how it works with MonsterWriter.

1. Download the app [01] 2. Create a new empty document 3. Insert a markdown section type 4. past your markdown code into the markdown section 5. click on "Preview & Export" 6. Configure your PDF

I'm the creator of MonsterWriter. For complex markdown it probably has some shortcomings but I would love to hear what is missing for your use case.

[01] https://www.monsterwriter.com/

SamCoding 16 hours ago

Does it support code blocks for Python, JS, HTML and CSS? Also my Markdown is auto-generated by a script of mine. Can I paste Markdown directly into y our platform?

Ancapistani 17 hours ago

I’ve not used it for very large documents, but I’ve been very happy with the fidelity of conversion using Marked 2 (https://marked2app.com)

I believe it’s Mac only. I use it sometimes when I’m creating PDFs from my personal documentation to share more publicly, which I keep in Markdown and deploy on Gitlab Pages as a static site.

amgreg 17 hours ago

If you’re on a Mac or iOS you could try creating a Shortcut where you input Markdown, convert to rich text, then output as a PDF. I use Shortcuts regularly. It’s pretty easy to set up. I haven’t tried it on something as larger as 500 pages, though. YMMV

SamCoding 16 hours ago

I'm on Windows (with WSL) so unfortunately I can't.

bobek 16 hours ago

I had a decent success with pandoc and typst - https://www.bobek.cz/til/pandoc-markdown-typst/

Yoric 19 hours ago

Random idea: how hard would it be to convert your markdown to typst?

Oras 18 hours ago

weasyprint worked well for me. I'm using it in a service to export resumes.

Keep in mind that you'll need to install custom fonts if you're using languages other than English.

misterspaceman 17 hours ago

Have you already tried converting it in Google Docs?

batrat 19 hours ago

Stirling PDF? https://github.com/Stirling-Tools/Stirling-PDF

adolph 19 hours ago

It may be worthwhile to take a deeper look at Pandoc if other replies don’t respond with something easier.

In a recent Talk Python to Me podcast [0], the Quarto [1] developers talked about how they are using Pandoc’s Lua interpreter [2] to perform transformations that aren’t part of vanilla pandoc in.md -o out.pdf.

0. https://talkpython.fm/episodes/show/493/quarto-open-source-t...

1. https://quarto.org/

2. https://pandoc.org/custom-writers.html

froh 17 hours ago

re: regex can you choose a syntax that allows for manual line breaks and manual formatting or even comments? like re.X in python?

https://docs.python.org/3/library/re.html#re.X

i_am_proteus 17 hours ago

Quarto should "just work" for this. There's an option to wrap code blocks.

dominicdoty 16 hours ago

Whoa this is weird timing - just this weekend I did a little exploration of using Svelte to create documents and eventually PDFs.

Its really just a proof of concept at this point, but it might be of interest to you (and others).

Code: https://github.com/dominicdoty/sveltedoc

Rendered: https://sveltedoc.pages.dev/

Writeup: https://www.dominicdoty.com/2025/03/02/sveltedoc/

TLDR - I've been using Asciidoc a lot at work recently and was dissatisfied with it. This was an attempt at using Svelte to generate a document as a webpage that formats well when printed (or printed to PDF). All the power of HTML+CSS+JS when you want it, but the ease of use to just write markdown when you don't.

netbioserror 18 hours ago

Typora is the best I've used. It's a GUI, but it's pretty fantastic for a GUI Markdown editor (especially an Electron one), and its PDF export is consistent and customizable with styles. Includes a few good ones out of the box. Plus an automated TOC.

oulipo 18 hours ago

Have you tried https://typst.app/?

jppope 18 hours ago

typora.io is what I use

geor9e 6 hours ago

Another option is Google Docs via Tools > Preferences > Enable Markdown

Yoric 19 hours ago

Have you tried mdbooks?

contingencies 16 hours ago

IMHO electron based markdown editors are generally slow, bloated, short-lived, and often platform-limited.

Use this and add sed lines for any required non-breakyness per normal CSS, rules can be specific to @media print as required.

  $ cat ~/bin/mdview 
  #!/bin/bash
  # markdown viewer
  tmpfile=.mdview.tmp-`uuidgen`.html
  # start html
  echo "<html><head><style>img{margin:20px;max-width:100%}@media print{img{max-height:90%;max-width:90%;page-break-after:always}}body{margin:6em;font-family:sans}pre,code{font-weight:bold;font-size:110%;font-family:Ubuntu Mono}</style></head><body>" >${tmpfile}
  # duplicate markdown for modification
  cp ${1} ${1}.mdtmp
  # add extra newline after trailing :
  sed -i -e 's/: \*$/:\r\r\n\n/' ${1}.mdtmp
  # generate HTML from markdown
  #  note the --html-no-skiphtml --html-no-escapehtml allows the preservation
  #  of <a name="blah"></a> anchors within text to allow [link][#anchorname]
  lowdown --html-no-skiphtml --html-no-escapehtml -thtml ${1}.mdtmp >>${tmpfile}
  # remove the temporary markdown file
  rm ${1}.mdtmp
  # add newline before images
  sed -i -e 's/<img/<br><img/' ${tmpfile}
  # view result
  firefox $tmpfile &
  # sleep for a short moment
  sleep 1.25
  # remove the temporary file
  rm ${tmpfile}

westurner 17 hours ago

MyST-MD transforms to LaTeX or HTML, which are transformable to (PostScript and then) PDF. With LaTeX it's possible to exactly typeset.

Sphinx and jupyter-book support MyST Markdown.

PDF Tables of Contents with links to headings or page numbers are possible with MyST and RestructuredText.