https://leancrew.com/all-this/2025/05/mathml-with-pandoc/
[snowman-20]
And now it's all this
I just said what I said and it was wrong
Or was taken wrong
Previous post
MathML with Pandoc
May 3, 2025 at 11:25 AM by Dr. Drang
Since switching from MathJax to MathML to render equations here on
ANIAT, I've tried several approaches to generate the MathML. There
are many utilities and libraries that claim to do the conversion, but
I've found all of them to be limited in one way or another. For a
while, I was even writing MathML directly, albeit with the help of
some Typinator abbreviations, because I couldn't trust the converters
to generate the correct characters or even understand some LaTeX
commands I use regularly. Recently, I began using what I should have
started out with: Pandoc.
It's not that I wasn't aware of Pandoc. Its famous in the Markdown/
HTML/LaTeX world, and I probably first heard of it shortly after its
release. But I've always thought of it as a document converter, not
an equation converter. I was wrong. It's very easy to use with a
single equation.
pandoc --mathml <<<'$$T = \frac{1}{2} v_x^2$$'
produces
where I've added linebreaks and indentation to the output to make it
easier to read. Because it's delimited by double dollar signs, the
equation is rendered in block mode, like this:
T=12v2
Single dollar signs would generate MathML with a display="inline"
attribute.
(If you look at the source code for this page, you'll see that I
usually delete some of the code Pandoc generates--we'll get to that
later.)
All the converters handled simple equations, like v02, well, but more
complicated stuff can be troublesome. One of the problems other
converters have is dealing with multiline equations, something Pandoc
handles with ease. For example, this piecewise function definition,
$$ f(x) = \left\{ \begin{array} {rcl}
-1 & : & x<0 \\
0 & : & x=0 \\
+1 & : & x>0
\end{array} $$
is rendered exactly as expected:
f(x)={-1:x<00:x=0+1:x>0
Well, perhaps not exactly as expected. If you're reading this in
Chrome (and, presumably, other Chrome-based browsers), all the cells
in the array are aligned left, which puts the zero in the wrong spot,
not vertically aligned with the ones. But that's Chrome's fault, not
Pandoc's.
Since Pandoc understands the \begin{array} command, it can do
matrices, too:
k=EAL[1-1-11]
So far, I've found only one small bug in Pandoc's conversion from
LaTeX to MathML. Here's a simple formula that includes both a
summation and a limit:
$$ e^x
= \sum_{n=0}^\infty \frac{x^n}{n!}
= \lim_{n\to\infty} \left( 1+ \frac{x}{n} \right)^n $$
This is what it should look like, a screenshot of the equation as
rendered by LaTeX itself:
Screenshot of exponential expansion and limit
But here's how it comes out after passing the equation to Pandoc:
ex=[?]n=0[?]xnn!=lim[?]n-[?](1+xn)n
The summation is fine, but the limit is formatted incorrectly. The
n-[?] part should be under the lim, not off to the side like a
subscript. That subscript-like formatting is what you'd use for an
inline equation, not a block equation.
Let's see what happened. Here's the MathML produced by Pandoc:
xml:
1:
The problem with the rendering of the limit is in Line 17. There's an
empty [?] element after the lim element. That's
what's messing up the formatting. If we remove that empty element,
the limit gets formatted the way it should:
ex=[?]n=0[?]xnn!=limn-[?](1+xn)n
Obviously, I'm not going to try to fix Pandoc; I have no idea how to
program in Haskell. I'll send a note to John McFarlane (can he really
still be the sole developer?) about the rendering bug, but in the
meantime I'll just remember to delete the empty [?] whenever I
need a limit.
I think I've mentioned in the past that one of my favorite features
of Markdown is that it allows you to mix HTML with regular Markdown
text; it passes the HTML through unchanged. I'm using that here to
add MathML equations to my blog posts. I write the equation in LaTeX,
select it, and run a Keyboard Maestro that replaces the LaTeX with
its MathML equivalent. Because I'm still messing around with the
macro (and may change it to an automation that BBEdit runs directly)
I won't post it here, but I do want to include the Python script that
runs Pandoc to do the conversion and then cleans up Pandoc's output
to make it more compact.
Here' the script:
python:
1: #!/usr/bin/env python3
2:
3: import sys
4: import subprocess
5: from bs4 import BeautifulSoup
6:
7: # Get LaTeX from stdin, run it through Pandoc, and parse the HTML
8: latex = sys.stdin.read()
9: process = subprocess.run(['pandoc', '--mathml'], input=latex, text=True, capture_output=True)
10: html = process.stdout
11: soup = BeautifulSoup(html, 'lxml')
12:
13: # Extract the MathML
14: math = soup.math
15:
16: # Delete the annotation
17: math.annotation.decompose()
18:
19: # Delete the unnecessary wrapper
20: math.semantics.unwrap()
21:
22: # Delete the unnecessary top-level wrapper in block display
23: if math['display'] == 'block':
24: math.mrow.unwrap()
25:
26: # Delete the unnecessary attribute for inline display
27: if math['display'] == 'inline':
28: del math['display']
29:
30: # Print the cleaned-up MathML
31: print(math)
Lines 8-10 get the LaTeX equation from standard input, pass it
through Pandoc via the subprocess.run function, and save the standard
output to the html variable. Line 11 then parses html with Beautiful
Soup, putting it in a form that makes it very easy to change.
Because we don't need the tags the MathML is wrapped in, we
pull out just the part in Line 14. The rest of the code
removes elements and attributes that can be useful, but which don't
add to the rendering of the equations. You may disagree with my
removal of these pieces, but it's my blog.
First, I don't want to keep the original LaTeX code, so Line 17
deletes the tag and everything inside it.
With that gone, and are no longer necessary,
so I got rid of them, too. Unlike the decompose function, which
removes tags and their contents, unwrap removes just the tags,
leaving behind what's between them.
I've noticed there's always an extra wrapper around
block equations, so Lines 23-24 get rid of that. And because display=
"inline" is the default, Lines 27-28 deletes that attribute from the