https://tex.stackexchange.com/questions/4201/is-there-a-bnf-grammar-of-the-tex-language

Skip to main content
 
 

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including
Stack Overflow, the largest, most trusted online community for
developers to learn, share their knowledge, and build their careers.

Visit Stack Exchange
[                    ]
Loading...
 1.  
      + Tour Start here for a quick overview of the site
      + Help Center Detailed answers to any questions you might have
      + Meta Discuss the workings and policies of this site
      + About Us Learn more about Stack Overflow the company, and our
        products
 2.  
 3. current community

      +  
        TeX - LaTeX
        help chat
      + 
         
        TeX - LaTeX Meta
   
    your communities

    Sign up or log in to customize your list.

    more stack exchange communities

    company blog
 4.
 5. Log in
 6. Sign up

TeX - LaTeX

 1.
     1.  
        Home
     2.  
        Questions
     3.  
        Tags
     4.
     5.  
        Users
     6.  
        Jobs
     7.  
        Companies
     8.  
        Unanswered
 2. Teams
    [teams-promo]

    Ask questions, find answers and collaborate at work with Stack
    Overflow for Teams.

    Explore Teams Create a free Team
 3.  
    Teams
 4. Ask questions, find answers and collaborate at work with Stack
    Overflow for Teams. Explore Teams

Teams

Q&A for work

Connect and share knowledge within a single location that is
structured and easy to search.

Learn more about Teams

Is there a BNF grammar of the TeX language?

Ask Question
Asked 13 years, 8 months ago
Modified 1 year, 8 months ago
Viewed 19k times
114
 

I'm looking for a BNF grammar of the TeX language, does it exist?

EDIT

For those of us who are not computer scientists, a BNF grammar is one
kind of formal description of a CFG: Backus Naur Form.

For those who don't know their Chomsky, a CFG is a context-free
grammar, which means (very roughly) that substitutions can be made
independently of their use context.

  * tex-core
  * parsing
  * grammar

Share
Improve this question
Follow 
edited Feb 6, 2017 at 9:32
 
Martin Schroder's user avatar
Martin Schroder
15.2k88 gold badges5555 silver badges140140 bronze badges
asked Oct 16, 2010 at 20:08
 
rystsov's user avatar
rystsovrystsov
1,24322 gold badges99 silver badges66 bronze badges
1

  * 
    If it were for LaTeX, the parser is included in
    tree-sitter.github.io/tree-sitter
    - sw.
    5 hours ago

Add a comment  |   

5 Answers 5

Sorted by: Reset to default
[Highest score (default)     ]
171
 

Parsing TeX is Turing complete

TeX can only be parsed by a complete Turing machine (modulo the
finite space available), which precludes it from having a BNF. This
comes from a combination of two features: first, TeX is Turing
complete (if you need proof, this Turing machine simulator should
suffice); and second, TeX can redefine macros (and their parsing
rules) at runtime. Since TeX can require that macros be followed by
specific characters, redefining a macro can mean redefining the
syntax of TeX. Combining these facts means that we can write TeX code
like the following, where \f is defined to be the algorithmic
representation in TeX of some arbitrary (computable) function f : Z -
Z:

\def\TuringCompleteness#1#2{%
  \edef\Output{#1{#2}}%
  \ifnum\Output=0
    \def\Required!{Bang!}%
  \else
    \def\Required?{Huh?}%
  \fi}
\TuringCompleteness{\f}{0}
\Required!

Will this TeX parse? That depends on the value of f(0) / \f{0}! And
since TeX is Turing-complete, computing this value may require a full
Turing machine, and therefore so will determining whether or not \
Required! is valid TeX syntax.


It gets worse: category codes

Perhaps, though, you think that you can still perform a rough check
for code that looks like a backslash, followed by letters, followed
by braced stuff. This will already get inadequate quickly, but that's
not all. If delimited macros weren't bad enough, things get even
worse when you realize that the reason that the characters \, {, },
$, and so on have the behavior they do is simply because TeX
specifies that they have particular "category codes". The ability to
redefine these means that writing TeX like

\catcode`\~=0
\catcode`\]=1
\catcode`\[=2
\catcode`\}=12
\catcode`\{=12
\catcode`\\=12

means that all subsequent (La)TeX code must be written like so:

~textbf]This is bold-faced text.[  This text contains a literal backslash
(~texttt]\[) and literal curly braces (~texttt]{}[).

to get output like so:

    This is bold-faced text. This text contains a literal backslash
    (\) and literal curly braces ({}).


Constructive Proof that TeX is (at least) Context-Sensitive

Additionally, inspired by TH.'s answer, here's a specific
constructive proof that TeX is not context-free. And note that it
doesn't have to rely on catcode trickery--just TeX's ability to define
arbitrarily-delimited macros. (Yes, this is implied by the first
section, but it's still nice to see a specific case in action.)

\newcount\nesting

\def\RecognizeABC#1{\begingroup
  \nesting=0
  \OneStep#1\relax
  \ifnum\nesting=1
    There was \the\nesting\ copy of each control sequence.\par
  \else
    There were \the\nesting\ copies of each control sequence.\par
  \fi
\endgroup}

\def\CheckFinished#1{%
  \ifx#1\relax
    \def\OneStep#1{\relax}%
  \fi
  \OneStep#1}

\def\OneStep\a#1\b#2\c#3\relax{%
  \advance\nesting by 1
  \CheckFinished#1#2#3\relax}

This code is set up to recognize what Wikipedia calls "the canonical
non-context free language" of n copies of \a followed by n copies of
\b followed by n copies of \c, where n is at least one. The first
line just allocates a counter; we'll use this to keep track of how
many copies of the control sequences we see, but it's just so we can
display some text at the end, not for parsing. To attempt recognition
of some text, call \RecognizeABC{...}. This simply sets \nesting to
zero (for, again, output purposes), and calls \OneStep. \OneStep is
set up to take three arguments delimited by \a, \b, and \c. The first
argument must come between \a and \b, the second between \b and \c,
and the third between \c and \relax. If there are the same number of
each of \a, \b, and \c, then this will eventually be \OneStep\a\b\c\
relax and all the arguments will be empty. With \OneStep, we just
record the fact that there's one more of each \a, \b, and \c (which
is, again, just for output), and then we see if we're done. \
CheckFinished looks to see what the next token in the input stream
is. If it's \relax, then the arguments were empty, and we're done;
redefine \OneStep not to do anything. Now, call \OneStep--whether it's
the parsing macro or nothing at all--on the remaining input stream,
recursing. Once we're done, we finally finish \RecognizeABC, where we
typeset how many copies of each control sequence we saw.

Phew! If you didn't follow all that, here's what a trace looks like,
if we only look at the important macros:

[START] \RecognizeABC{\a\a\a\b\b\b\c\c\c}
    --> \OneStep\a\a\a\b\b\b\c\c\c\relax
                  -#1-  -#2-  -#3-
    --> \CheckFinished\a\a\b\b\c\c\relax
    --> \OneStep\a\a\b\b\c\c\relax
                  #1  #2  #3
    --> \CheckFinished\a\b\c\relax
    --> \OneStep\a\b\c\relax
          (All the arguments are empty.)
    --> \CheckFinished\relax
    --> ``There were 3 copies of each control sequence.''

Thus, the following TeX

\RecognizeABC{\a\b\c}
\RecognizeABC{\a\a\b\b\c\c}
\RecognizeABC{\a\a\a\b\b\b\c\c\c}

Yields the following output:

    There was 1 copy of each control sequence.
    There were 2 copies of each control sequence.
    There were 3 copies of each control sequence.

But TeX such as

\RecognizeABC{}
\RecognizeABC{\a}
\RecognizeABC{\a\a\b\b\c\c\c}
\RecognizeABC{\a\b\c\garbage}

fails.

---------------------------------------------------------------------

Edit 1: In addition to minor bugfixes/cleanup and breaking the answer
into sections with headings, I expanded the rationale for TeX being
unparsable by anything short of a Turing machine to hopefully make it
slightly more clear. This involved removing the previous \weird
example, replacing it with something that's more of a proof. If you'd
rather have a concrete example (but non-proof) with a command which
redefines itself, consider the following TeX, which was there before:

\def\weird[#1][#2]{%
  \ifx#1\relax
    \def\weird!{I was previously called with a relaxing first argument.}%
  \else\ifx#2\relax
    \def\weird(##1){I can only be called with one argument; I was given (##1).}%
  \fi\fi
  I was called with [#1] and [#2].}

To start, \weird must be called as \weird[stuff][other stuff]. But if
its first argument is \relax, then it redefines itself to require
being called as \weird!; if its second argument is nothing, it
redefines itself to require being called as \weird(stuff); and it
always typesets text. This can't be captured by a BNF grammar (I
don't think). And of course, in the general case, we need a Turing
machine.

Share
Improve this answer
Follow 
edited Mar 28, 2018 at 22:01
 
Joel's user avatar
Joel
10333 bronze badges
answered Oct 16, 2010 at 23:36
 
Antal Spector-Zabusky's user avatar
Antal Spector-ZabuskyAntal Spector-Zabusky
13.2k77 gold badges4747 silver badges4646 bronze badges
11

  * 3
    Very nice answer! It might be easier to read if you removed the
    unnecessary comments at the ends of most of the lines. (Only six
    lines in your lists need comments, one of which is missing one.
    The others are unnecessary.)
    - TH.
    Oct 17, 2010 at 0:27
  * 3
    It's what makes TeX so much fun to program in. It's always a
    puzzle!
    - TH.
    Oct 17, 2010 at 2:15
  * 14
    Fantastic answer. This question comes up periodically, and I've
    never seen such a useful reply.
    - Will Robertson
    Oct 17, 2010 at 3:22
  * 4
    Thank you for this comprehensive answer, I knew nothing about
    catcode. It seems that parsing tex is not a simple task as I
    thought before. My final task is to extract all math formulas
    from a set of articles (in tex), I thought I could parse them to
    do it, but now I'm not sure about it. What approach could you
    recommend? I think casual articles do not use catcode, so to
    extract correct math formulas I may parse tex and process
    newcommand. The alternative is to try use existing engine to
    parse and extract the needed information from its internals.
    - rystsov
    Oct 17, 2010 at 7:42
  * 14
    @rystsov: I think the best approach is to make assumptions.
    Assume that your TeX files use only a certain set of standard
    packages and certain user-defined macros. Write your parser so
    that it only tries to understand this particular tiny subset of
    TeX. Be conservative: whenever your parser sees something that it
    does not understand, throw an error. The end result is likely
    something very specific to your own style of using TeX in your
    articles, but it might be enough for your needs. Most
    importantly, this allows you to extract "semantic" things (e.g.,
    the parameters of your own macros).
    - Jukka Suomela
    Oct 17, 2010 at 8:37

  |  Show 6 more comments
29
 

Just to add another angle to the excellent posts by TH. and Antal, is
that besides TeX being a Turing complete machine, it can also be
coerced to do lambda calculus, the basis of all functional languages.
On CTAN there is a package called lambda-lists developed by Alan
Jeffrey that implements, among other lambda concepts, unbounded
lists. Not only does TeX do all these, but impressively, the routines
were implemented in TeX's mouth, proving that TeX's mouth is as
powerful as any computer anywhere.

To understand the relationship between the lambda calculus and Turing
machines one needs to refer to the Entscheidungsproblem which is one
of the famous 23 problems proposed by David Hilbert.

In 1936 and 1937 Alonzo Church and Alan Turing respectively,
published independent papers showing that it is impossible to decide
algorithmically whether statements in arithmetic are true or false,
and thus a general solution to the Entscheidungsproblem is
impossible.

This was done by Alonzo Church in 1936 with the concept of "effective
calculability" based on his l calculus and by Alan Turing in the same
year with his concept of Turing machines. It was later recognized
that these are equivalent models of computation. The work of both
authors was heavily influenced by Kurt Godel's earlier work on his
incompleteness theorem - Wikipedia

So the lambda calculus, Turing machines and TeX are not just closely
related but they are equivalent models of computation and this is one
of many aspects that makes TeX and friends interesting!

Share
Improve this answer
Follow 
edited Aug 17, 2022 at 13:21
 
Cole Tobin's user avatar
Cole Tobin
54655 silver badges1212 bronze badges
answered Oct 17, 2010 at 17:31
 
yannisl's user avatar
yannislyannisl
118k3535 gold badges290290 silver badges563563 bronze badges
9

  * 31
    For those unfamiliar with the terminology, 'TeX's mouth' refers
    to the scanning stage before macros and registers can be set
    (although their contents can be retrieved); i.e, code involving
    only \ifx, \expandafter, \csname, \the, etc. etc., but not \def,
    \let, and others. When a macro is said to be 'expandable', this
    means it executes in TeX's mouth.
    - Will Robertson
    Oct 19, 2010 at 2:58
  * 1
    TeX's eyes look at lines of .tex-input-files and cause TeX to put
    characters in TeX's mouth.TeX's mouth creates tokens. Expansion
    of expandable tokens takes place in the gullet (and in the
    TeXbook is called a regurgitation-process). Assignments are done
    in the stomach. (See the begin of TeXbook Chapter 24: Summary of
    Vertical Mode: ...
    - Ulrich Diez
    Nov 28, 2021 at 2:10
  * 1
    ... TeXbook Chapter 24: Summary of Vertical Mode: We shall study
    TeX's digestive processes, i.e., what TeX does with the lists of
    tokens that arrive in its "stomach." Chapter 7 has described the
    process by which input files are converted to lists of tokens in
    TeX's "mouth," and Chapter 20 explained how expandable tokens are
    converted to unexpandable ones in TeX's "gullet" by a process
    similar to regurgitation. When unexpandable tokens finally reach
    TeX's gastro-intestinal tract, the real activity of typesetting
    begins, and that is what we are going to survey in these summary
    chapters.)
    - Ulrich Diez
    Nov 28, 2021 at 2:11
  * 
    Both the TUGboat-article "Lists in TeX's mouth" TUGboat, vol.11,
    no.2, pp.237-245 and the entry for the package lambda-lists at
    CTAN are inaccurate as there it is stated that expansion takes
    place in TeX's mouth although the TeXbook itself clearly says
    that expansion takes place in the gullet(, which is supposed to
    be the thing between mouth and stomach).
    - Ulrich Diez
    Nov 28, 2021 at 2:29
  * 
    @UlrichDiez To be honest, I never understood why Knuth used the
    body parts analogies to explain TeX; a more familiar from a CS
    point of view terminology would have been better. There is a
    mention of TeX in the Red Dragon book for compilers, which says .
    ( from memory) ... that the tokens swim in a sea of letters! I
    will have another look at your observations.
    - yannisl
    Nov 28, 2021 at 4:14

  |  Show 4 more comments
14
 

It's extremely unlikely. TeX can change how it parses input while
it's in the middle of reading the input. I'd have to think a bit
longer, but I think that precludes TeX from having a context-free
grammar and thus it could have no BNF.

Share
Improve this answer
Follow 
edited Oct 16, 2010 at 22:49
answered Oct 16, 2010 at 21:13
 
TH.'s user avatar
TH.TH.
62.9k1616 gold badges163163 silver badges221221 bronze badges

Add a comment  |   
5
 

I can't prove that there is a BNF grammar for Tex but I would like to
argue against couple of points made in Antal's answer:

 1. There are examples of Turing machine simulators in C/C++/Java on
    the same website mentioned above. That does not mean you can't
    write a BNF grammar for C/C++/Java.

 2. The fact that Tex lets you change the category code or meaning of
    command codes does not mean Tex is context sensitive from a
    parser's point of view. This is related to the semantics of Tex.
    I think this is analogous lexical scope in languages like Java.
    You can have two different variables with same name in the same
    java class where one is a class field and the other one is a
    local variable. And the type of these two variables can even be
    different. That does not mean Java parser is context sensitive.

Sorry that I have posted this as a separate answer but the site won't
let me make this long comment.

Share
Improve this answer
Follow 
answered Mar 14, 2013 at 7:56
 
codefx's user avatar
codefxcodefx
58744 silver badges88 bronze badges
10

  * 7
    The difference between TeX and most languages (one might say
    "sensible languages") is that these sorts of changes can affect
    the parsing of TeX. (I imagine Perl might have similar problems
    thanks to source filters and BEGIN blocks?) For instance, \ifx\f\
    g\catcode`\\{=12\fi{} is a parse error if and only if \f and \g
    are equal; and as discussed above, the Turing-completeness of TeX
    (and of TeX's mouth!) means that the equality of \f and \g is
    undecidable. Does that help clear things up?
    - Antal Spector-Zabusky
    Mar 14, 2013 at 15:34
  * 3
    @alexraasch: The difference - and this is subtle! - is that in
    TeX, parsing is Turing-complete, not simply evaluation. In C, we
    know that a + b * c is syntactically valid as long as a, b, and c
    are arbitrary syntactically valid expressions. In TeX, the
    expression \a + \b * \c may trigger a parse error depending on
    the definitions of \a, \b, and \c.
    - Antal Spector-Zabusky
    Jul 29, 2016 at 13:33
  * 2
    @alexraasch: It does show Turing-completeness, b/c it talks about
    both - parsing can do evaluation! My use of a \RecognizeABC{...}
    macro obscures that slightly, I know, b/c we're used to that
    being a guaranteed-to-parse form that just evaluates; however,
    that's not what's going on. Rather, it expands to include the
    string \OneStep...\relax, where \OneStep only parses when followed
    by an \a, a \b, & a \c (with some intervening junk, and followed
    by \relax). \OneStep keeps expanding recursively, and eventually
    either fails to parse or terminates when reaching the empty
    string.
    - Antal Spector-Zabusky
    Jul 29, 2016 at 14:09
  * 1
    @AntalSpector-Zabusky: Well, you can write a TeX program that
    contains an infinite loop but it may still be syntactically
    valid. So I think we need another proof. It will probably be
    rather complicated because of the parse-eval intermixing, as you
    said.
    - alexkelbo
    Jan 28, 2017 at 18:05
  * 1
    @alexraasch: But it's not obvious that such a program "may still
    be syntactically valid". Is the program \def\loop{\loop}\loop}\
    bye invalid? How about \def\loop{\loop}\loop\catcode`}=12 }\bye?
    Or how about \def\x{...}\def\loop{\loop}\loop\ifnum\x=0\catcode`}=
    12\fi}\bye? It's not clear what it means to say that a TeX
    program after an infinite loop even has a well-defined syntactic
    validity; you can't check such validity without running the
    program! So I'm leaning towards the conclusion that a TeX program
    can only be syntactically well-formed if parsing and evaluation
    terminates.
    - Antal Spector-Zabusky
    Jan 28, 2017 at 18:15

  |  Show 5 more comments
1
 

You can always examine the original source code for TeX (LaTeX is
built on an extension of TeX). It is highly readable (thanks to his
proprietary WEB language) and, in fact, compiles to a readable TeX
document using weave (built in to most TeX distributions).

With regards to parsing, here's a summary of the important bits (from
the most disgusting facts to the least disgusting):

  * While defining a command, anything can be put as parameter
    delimiters (comments work as is, but the spacing will be messed
    up). For example,

    \def\test |\a#1@$^&^TRDFS(#2


    |{5#1#25}

    \test|\a {f}@$^&^TRDFS(f


    | % This prints 5ff5

    TeX matches parameters based only on code points and categories
    (macro definitions are completely ignored). This is the reason
    why the top answer's "constructive proof" works out.

  * ^^@ is synonymous to an ASCII <null>...in fact, suppose we have ^
    ^n where n is an ASCII character. If the code point of n is less
    than 64, then add 64 and return. Else, subtract 64 and return.
    Moreover, ^^XX is synonymous to a character with code point 0xXX
    (in HEX).

      + It gets worse... \t^^65ste is equal to \tAste. You can even
        define \tAste using \def\t^^65ste.
      + Fortunately, this is ignored in parameter delimiters (no
        comment).
  * TeX has some context-free components with respect to a different
    alphabet. If we consider TeX's alphabet as pairs a:c where a is
    an ASCII character, and c is its (symbolic) category code, then,
    e.g., TeX numbers (see the scan_int procedure) can be written in
    EBNF as follows

    decimal_number = ("+:other_char" | "-:other_char")* digit+
    digit          = "0:other_char" | ... | "9:other_char"

  * Erasing semantics, TeX has (E)BNF using only category codes. For
    example,

    control_sequence = escape (letter)*

    This is precisely how a lexer would tokenize TeX. The context
    sensitivity comes from dynamically taking action after consuming
    each token.

Share
Improve this answer
Follow 
edited Oct 7, 2022 at 11:37
answered Dec 11, 2020 at 16:57
 
jsheaf's user avatar
jsheafjsheaf
17111 silver badge33 bronze badges
1

  * 
    Well, ^^A is ASCII code point 1, not null, null would be ^^@
    - user202729
    Jul 23, 2022 at 13:47

Add a comment  |   

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged

  * tex-core
  * parsing
  * grammar

.

Linked

 
13
What type of Chomsky hierarchy is TeX?
 
2
Is LaTeX a context-free grammar?
 
0
Can LaTeX-TeX be described with EBNF grammar?
 
77
Advantages and disadvantages of fully expandable macros
 
43
Static analysis of LaTeX documents?
 
21
What parsers for (La)TeX mathematics exist outside of the TeX
engines?
 
22
Are the TeX semantics and grammar defined somewhere in some official
documents?
 
5
Is there a formal grammar for *Equations* in TeX/LaTeX?
 
7
Automatic Way of Searching for Syntax Inconsistencies
 
11
The Universal Language Processor in TeX: Theoretical Question
Regarding the "Expressive" Power of Tex
See more linked questions

Related

 
112
For formal articles, should a displayed equation be followed by a
punctuation to conform to the language grammar ?
 
22
Are the TeX semantics and grammar defined somewhere in some official
documents?
 
1
Is there a BNF-grammar for simple constructs of the TeX language?
 
5
Is there a formal grammar for *Equations* in TeX/LaTeX?
 
6
Typesetting a calculus' grammar
 
4
Combinatory Categorial Grammar style
 
1
french language, correction, grammar
 
0
language french grammar in grammar

Hot Network Questions

  * 
    How to make Bash remove quotes after parameter expansion?
  * 
    Marginalisation with respect to arbitrary distribution
  * 
    Why do we say "he doesn't know him from Adam"?
  * 
    Python matrix class
  * 
    What percentage of light gets scattered by a mirror?
  * 
    Who owns the moon?
  * 
    Can LLMs have intention?
  * 
    How can I obtain a record of my fathers' medals from WW2?
  * 
    How do I tell which kit lens option is more all-purpose?
  * 
    How can I hang heavy bikes under a thick wooden shelf?
  * 
    Does it make sense for giants to use clubs or swords when
    fighting non-giants?
  * 
    Is the validity of induction in Coq axiomatic?
  * 
    Can I travel with my child to the UK if I am not the person named
    in their visitor's visa?
  * 
    A question about syntactic function of the clause
  * 
    Moving after copying in assignment of conditional operator result
  * 
    Application of Lie group analysis of PDE (beyond calculation of
    exact solutions)
  * 
    How do I snap the edges of hex tiles together?
  * 
    Are your memories part of you?
  * 
    Why don't professors seem to use learning strategies like spaced
    repetition and note-taking?
  * 
    Book recommendation introduction to model theory
  * 
    How can I tell whether an HDD uses CMR or SMR?
  * 
    What scientific evidence there is that keeping cooked meat at
    room temperature is unsafe past two hours?
  * 
    Could a 200m diameter asteroid be put into a graveyard orbit and
    not be noticed by people on the ground?
  * 
    Transformer with same size symbol meaning

more hot questions
Question feed

Subscribe to RSS

Question feed

To subscribe to this RSS feed, copy and paste this URL into your RSS
reader.

[https://tex.stackexc]
 
*
lang-tex

TeX - LaTeX

  * Tour
  * Help
  * Chat
  * Contact
  * Feedback

Company

  * Stack Overflow
  * Teams
  * Advertising
  * Collectives
  * Talent
  * About
  * Press
  * Legal
  * Privacy Policy
  * Terms of Service
  * Cookie Settings
  * Cookie Policy

Stack Exchange Network

  * Technology
  * Culture & recreation
  * Life & arts
  * Science
  * Professional
  * Business
  * API
  * Data

  * Blog
  * Facebook
  * Twitter
  * LinkedIn
  * Instagram

Site design / logo (c) 2024 Stack Exchange Inc; user contributions
licensed under CC BY-SA. rev 2024.6.4.10328