https://tex.stackexchange.com/questions/4201/is-there-a-bnf-grammar-of-the-tex-language Skip to main content Stack Exchange Network Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. Visit Stack Exchange [ ] Loading... 1. + Tour Start here for a quick overview of the site + Help Center Detailed answers to any questions you might have + Meta Discuss the workings and policies of this site + About Us Learn more about Stack Overflow the company, and our products 2. 3. current community + TeX - LaTeX help chat + TeX - LaTeX Meta your communities Sign up or log in to customize your list. more stack exchange communities company blog 4. 5. Log in 6. Sign up TeX - LaTeX 1. 1. Home 2. Questions 3. Tags 4. 5. Users 6. Jobs 7. Companies 8. Unanswered 2. Teams [teams-promo] Ask questions, find answers and collaborate at work with Stack Overflow for Teams. Explore Teams Create a free Team 3. Teams 4. Ask questions, find answers and collaborate at work with Stack Overflow for Teams. Explore Teams Teams Q&A for work Connect and share knowledge within a single location that is structured and easy to search. Learn more about Teams Is there a BNF grammar of the TeX language? Ask Question Asked 13 years, 8 months ago Modified 1 year, 8 months ago Viewed 19k times 114 I'm looking for a BNF grammar of the TeX language, does it exist? EDIT For those of us who are not computer scientists, a BNF grammar is one kind of formal description of a CFG: Backus Naur Form. For those who don't know their Chomsky, a CFG is a context-free grammar, which means (very roughly) that substitutions can be made independently of their use context. * tex-core * parsing * grammar Share Improve this question Follow edited Feb 6, 2017 at 9:32 Martin Schroder's user avatar Martin Schroder 15.2k88 gold badges5555 silver badges140140 bronze badges asked Oct 16, 2010 at 20:08 rystsov's user avatar rystsovrystsov 1,24322 gold badges99 silver badges66 bronze badges 1 * If it were for LaTeX, the parser is included in tree-sitter.github.io/tree-sitter - sw. 5 hours ago Add a comment | 5 Answers 5 Sorted by: Reset to default [Highest score (default) ] 171 Parsing TeX is Turing complete TeX can only be parsed by a complete Turing machine (modulo the finite space available), which precludes it from having a BNF. This comes from a combination of two features: first, TeX is Turing complete (if you need proof, this Turing machine simulator should suffice); and second, TeX can redefine macros (and their parsing rules) at runtime. Since TeX can require that macros be followed by specific characters, redefining a macro can mean redefining the syntax of TeX. Combining these facts means that we can write TeX code like the following, where \f is defined to be the algorithmic representation in TeX of some arbitrary (computable) function f : Z - Z: \def\TuringCompleteness#1#2{% \edef\Output{#1{#2}}% \ifnum\Output=0 \def\Required!{Bang!}% \else \def\Required?{Huh?}% \fi} \TuringCompleteness{\f}{0} \Required! Will this TeX parse? That depends on the value of f(0) / \f{0}! And since TeX is Turing-complete, computing this value may require a full Turing machine, and therefore so will determining whether or not \ Required! is valid TeX syntax. It gets worse: category codes Perhaps, though, you think that you can still perform a rough check for code that looks like a backslash, followed by letters, followed by braced stuff. This will already get inadequate quickly, but that's not all. If delimited macros weren't bad enough, things get even worse when you realize that the reason that the characters \, {, }, $, and so on have the behavior they do is simply because TeX specifies that they have particular "category codes". The ability to redefine these means that writing TeX like \catcode`\~=0 \catcode`\]=1 \catcode`\[=2 \catcode`\}=12 \catcode`\{=12 \catcode`\\=12 means that all subsequent (La)TeX code must be written like so: ~textbf]This is bold-faced text.[ This text contains a literal backslash (~texttt]\[) and literal curly braces (~texttt]{}[). to get output like so: This is bold-faced text. This text contains a literal backslash (\) and literal curly braces ({}). Constructive Proof that TeX is (at least) Context-Sensitive Additionally, inspired by TH.'s answer, here's a specific constructive proof that TeX is not context-free. And note that it doesn't have to rely on catcode trickery--just TeX's ability to define arbitrarily-delimited macros. (Yes, this is implied by the first section, but it's still nice to see a specific case in action.) \newcount\nesting \def\RecognizeABC#1{\begingroup \nesting=0 \OneStep#1\relax \ifnum\nesting=1 There was \the\nesting\ copy of each control sequence.\par \else There were \the\nesting\ copies of each control sequence.\par \fi \endgroup} \def\CheckFinished#1{% \ifx#1\relax \def\OneStep#1{\relax}% \fi \OneStep#1} \def\OneStep\a#1\b#2\c#3\relax{% \advance\nesting by 1 \CheckFinished#1#2#3\relax} This code is set up to recognize what Wikipedia calls "the canonical non-context free language" of n copies of \a followed by n copies of \b followed by n copies of \c, where n is at least one. The first line just allocates a counter; we'll use this to keep track of how many copies of the control sequences we see, but it's just so we can display some text at the end, not for parsing. To attempt recognition of some text, call \RecognizeABC{...}. This simply sets \nesting to zero (for, again, output purposes), and calls \OneStep. \OneStep is set up to take three arguments delimited by \a, \b, and \c. The first argument must come between \a and \b, the second between \b and \c, and the third between \c and \relax. If there are the same number of each of \a, \b, and \c, then this will eventually be \OneStep\a\b\c\ relax and all the arguments will be empty. With \OneStep, we just record the fact that there's one more of each \a, \b, and \c (which is, again, just for output), and then we see if we're done. \ CheckFinished looks to see what the next token in the input stream is. If it's \relax, then the arguments were empty, and we're done; redefine \OneStep not to do anything. Now, call \OneStep--whether it's the parsing macro or nothing at all--on the remaining input stream, recursing. Once we're done, we finally finish \RecognizeABC, where we typeset how many copies of each control sequence we saw. Phew! If you didn't follow all that, here's what a trace looks like, if we only look at the important macros: [START] \RecognizeABC{\a\a\a\b\b\b\c\c\c} --> \OneStep\a\a\a\b\b\b\c\c\c\relax -#1- -#2- -#3- --> \CheckFinished\a\a\b\b\c\c\relax --> \OneStep\a\a\b\b\c\c\relax #1 #2 #3 --> \CheckFinished\a\b\c\relax --> \OneStep\a\b\c\relax (All the arguments are empty.) --> \CheckFinished\relax --> ``There were 3 copies of each control sequence.'' Thus, the following TeX \RecognizeABC{\a\b\c} \RecognizeABC{\a\a\b\b\c\c} \RecognizeABC{\a\a\a\b\b\b\c\c\c} Yields the following output: There was 1 copy of each control sequence. There were 2 copies of each control sequence. There were 3 copies of each control sequence. But TeX such as \RecognizeABC{} \RecognizeABC{\a} \RecognizeABC{\a\a\b\b\c\c\c} \RecognizeABC{\a\b\c\garbage} fails. --------------------------------------------------------------------- Edit 1: In addition to minor bugfixes/cleanup and breaking the answer into sections with headings, I expanded the rationale for TeX being unparsable by anything short of a Turing machine to hopefully make it slightly more clear. This involved removing the previous \weird example, replacing it with something that's more of a proof. If you'd rather have a concrete example (but non-proof) with a command which redefines itself, consider the following TeX, which was there before: \def\weird[#1][#2]{% \ifx#1\relax \def\weird!{I was previously called with a relaxing first argument.}% \else\ifx#2\relax \def\weird(##1){I can only be called with one argument; I was given (##1).}% \fi\fi I was called with [#1] and [#2].} To start, \weird must be called as \weird[stuff][other stuff]. But if its first argument is \relax, then it redefines itself to require being called as \weird!; if its second argument is nothing, it redefines itself to require being called as \weird(stuff); and it always typesets text. This can't be captured by a BNF grammar (I don't think). And of course, in the general case, we need a Turing machine. Share Improve this answer Follow edited Mar 28, 2018 at 22:01 Joel's user avatar Joel 10333 bronze badges answered Oct 16, 2010 at 23:36 Antal Spector-Zabusky's user avatar Antal Spector-ZabuskyAntal Spector-Zabusky 13.2k77 gold badges4747 silver badges4646 bronze badges 11 * 3 Very nice answer! It might be easier to read if you removed the unnecessary comments at the ends of most of the lines. (Only six lines in your lists need comments, one of which is missing one. The others are unnecessary.) - TH. Oct 17, 2010 at 0:27 * 3 It's what makes TeX so much fun to program in. It's always a puzzle! - TH. Oct 17, 2010 at 2:15 * 14 Fantastic answer. This question comes up periodically, and I've never seen such a useful reply. - Will Robertson Oct 17, 2010 at 3:22 * 4 Thank you for this comprehensive answer, I knew nothing about catcode. It seems that parsing tex is not a simple task as I thought before. My final task is to extract all math formulas from a set of articles (in tex), I thought I could parse them to do it, but now I'm not sure about it. What approach could you recommend? I think casual articles do not use catcode, so to extract correct math formulas I may parse tex and process newcommand. The alternative is to try use existing engine to parse and extract the needed information from its internals. - rystsov Oct 17, 2010 at 7:42 * 14 @rystsov: I think the best approach is to make assumptions. Assume that your TeX files use only a certain set of standard packages and certain user-defined macros. Write your parser so that it only tries to understand this particular tiny subset of TeX. Be conservative: whenever your parser sees something that it does not understand, throw an error. The end result is likely something very specific to your own style of using TeX in your articles, but it might be enough for your needs. Most importantly, this allows you to extract "semantic" things (e.g., the parameters of your own macros). - Jukka Suomela Oct 17, 2010 at 8:37 | Show 6 more comments 29 Just to add another angle to the excellent posts by TH. and Antal, is that besides TeX being a Turing complete machine, it can also be coerced to do lambda calculus, the basis of all functional languages. On CTAN there is a package called lambda-lists developed by Alan Jeffrey that implements, among other lambda concepts, unbounded lists. Not only does TeX do all these, but impressively, the routines were implemented in TeX's mouth, proving that TeX's mouth is as powerful as any computer anywhere. To understand the relationship between the lambda calculus and Turing machines one needs to refer to the Entscheidungsproblem which is one of the famous 23 problems proposed by David Hilbert. In 1936 and 1937 Alonzo Church and Alan Turing respectively, published independent papers showing that it is impossible to decide algorithmically whether statements in arithmetic are true or false, and thus a general solution to the Entscheidungsproblem is impossible. This was done by Alonzo Church in 1936 with the concept of "effective calculability" based on his l calculus and by Alan Turing in the same year with his concept of Turing machines. It was later recognized that these are equivalent models of computation. The work of both authors was heavily influenced by Kurt Godel's earlier work on his incompleteness theorem - Wikipedia So the lambda calculus, Turing machines and TeX are not just closely related but they are equivalent models of computation and this is one of many aspects that makes TeX and friends interesting! Share Improve this answer Follow edited Aug 17, 2022 at 13:21 Cole Tobin's user avatar Cole Tobin 54655 silver badges1212 bronze badges answered Oct 17, 2010 at 17:31 yannisl's user avatar yannislyannisl 118k3535 gold badges290290 silver badges563563 bronze badges 9 * 31 For those unfamiliar with the terminology, 'TeX's mouth' refers to the scanning stage before macros and registers can be set (although their contents can be retrieved); i.e, code involving only \ifx, \expandafter, \csname, \the, etc. etc., but not \def, \let, and others. When a macro is said to be 'expandable', this means it executes in TeX's mouth. - Will Robertson Oct 19, 2010 at 2:58 * 1 TeX's eyes look at lines of .tex-input-files and cause TeX to put characters in TeX's mouth.TeX's mouth creates tokens. Expansion of expandable tokens takes place in the gullet (and in the TeXbook is called a regurgitation-process). Assignments are done in the stomach. (See the begin of TeXbook Chapter 24: Summary of Vertical Mode: ... - Ulrich Diez Nov 28, 2021 at 2:10 * 1 ... TeXbook Chapter 24: Summary of Vertical Mode: We shall study TeX's digestive processes, i.e., what TeX does with the lists of tokens that arrive in its "stomach." Chapter 7 has described the process by which input files are converted to lists of tokens in TeX's "mouth," and Chapter 20 explained how expandable tokens are converted to unexpandable ones in TeX's "gullet" by a process similar to regurgitation. When unexpandable tokens finally reach TeX's gastro-intestinal tract, the real activity of typesetting begins, and that is what we are going to survey in these summary chapters.) - Ulrich Diez Nov 28, 2021 at 2:11 * Both the TUGboat-article "Lists in TeX's mouth" TUGboat, vol.11, no.2, pp.237-245 and the entry for the package lambda-lists at CTAN are inaccurate as there it is stated that expansion takes place in TeX's mouth although the TeXbook itself clearly says that expansion takes place in the gullet(, which is supposed to be the thing between mouth and stomach). - Ulrich Diez Nov 28, 2021 at 2:29 * @UlrichDiez To be honest, I never understood why Knuth used the body parts analogies to explain TeX; a more familiar from a CS point of view terminology would have been better. There is a mention of TeX in the Red Dragon book for compilers, which says . ( from memory) ... that the tokens swim in a sea of letters! I will have another look at your observations. - yannisl Nov 28, 2021 at 4:14 | Show 4 more comments 14 It's extremely unlikely. TeX can change how it parses input while it's in the middle of reading the input. I'd have to think a bit longer, but I think that precludes TeX from having a context-free grammar and thus it could have no BNF. Share Improve this answer Follow edited Oct 16, 2010 at 22:49 answered Oct 16, 2010 at 21:13 TH.'s user avatar TH.TH. 62.9k1616 gold badges163163 silver badges221221 bronze badges Add a comment | 5 I can't prove that there is a BNF grammar for Tex but I would like to argue against couple of points made in Antal's answer: 1. There are examples of Turing machine simulators in C/C++/Java on the same website mentioned above. That does not mean you can't write a BNF grammar for C/C++/Java. 2. The fact that Tex lets you change the category code or meaning of command codes does not mean Tex is context sensitive from a parser's point of view. This is related to the semantics of Tex. I think this is analogous lexical scope in languages like Java. You can have two different variables with same name in the same java class where one is a class field and the other one is a local variable. And the type of these two variables can even be different. That does not mean Java parser is context sensitive. Sorry that I have posted this as a separate answer but the site won't let me make this long comment. Share Improve this answer Follow answered Mar 14, 2013 at 7:56 codefx's user avatar codefxcodefx 58744 silver badges88 bronze badges 10 * 7 The difference between TeX and most languages (one might say "sensible languages") is that these sorts of changes can affect the parsing of TeX. (I imagine Perl might have similar problems thanks to source filters and BEGIN blocks?) For instance, \ifx\f\ g\catcode`\\{=12\fi{} is a parse error if and only if \f and \g are equal; and as discussed above, the Turing-completeness of TeX (and of TeX's mouth!) means that the equality of \f and \g is undecidable. Does that help clear things up? - Antal Spector-Zabusky Mar 14, 2013 at 15:34 * 3 @alexraasch: The difference - and this is subtle! - is that in TeX, parsing is Turing-complete, not simply evaluation. In C, we know that a + b * c is syntactically valid as long as a, b, and c are arbitrary syntactically valid expressions. In TeX, the expression \a + \b * \c may trigger a parse error depending on the definitions of \a, \b, and \c. - Antal Spector-Zabusky Jul 29, 2016 at 13:33 * 2 @alexraasch: It does show Turing-completeness, b/c it talks about both - parsing can do evaluation! My use of a \RecognizeABC{...} macro obscures that slightly, I know, b/c we're used to that being a guaranteed-to-parse form that just evaluates; however, that's not what's going on. Rather, it expands to include the string \OneStep...\relax, where \OneStep only parses when followed by an \a, a \b, & a \c (with some intervening junk, and followed by \relax). \OneStep keeps expanding recursively, and eventually either fails to parse or terminates when reaching the empty string. - Antal Spector-Zabusky Jul 29, 2016 at 14:09 * 1 @AntalSpector-Zabusky: Well, you can write a TeX program that contains an infinite loop but it may still be syntactically valid. So I think we need another proof. It will probably be rather complicated because of the parse-eval intermixing, as you said. - alexkelbo Jan 28, 2017 at 18:05 * 1 @alexraasch: But it's not obvious that such a program "may still be syntactically valid". Is the program \def\loop{\loop}\loop}\ bye invalid? How about \def\loop{\loop}\loop\catcode`}=12 }\bye? Or how about \def\x{...}\def\loop{\loop}\loop\ifnum\x=0\catcode`}= 12\fi}\bye? It's not clear what it means to say that a TeX program after an infinite loop even has a well-defined syntactic validity; you can't check such validity without running the program! So I'm leaning towards the conclusion that a TeX program can only be syntactically well-formed if parsing and evaluation terminates. - Antal Spector-Zabusky Jan 28, 2017 at 18:15 | Show 5 more comments 1 You can always examine the original source code for TeX (LaTeX is built on an extension of TeX). It is highly readable (thanks to his proprietary WEB language) and, in fact, compiles to a readable TeX document using weave (built in to most TeX distributions). With regards to parsing, here's a summary of the important bits (from the most disgusting facts to the least disgusting): * While defining a command, anything can be put as parameter delimiters (comments work as is, but the spacing will be messed up). For example, \def\test |\a#1@$^&^TRDFS(#2 |{5#1#25} \test|\a {f}@$^&^TRDFS(f | % This prints 5ff5 TeX matches parameters based only on code points and categories (macro definitions are completely ignored). This is the reason why the top answer's "constructive proof" works out. * ^^@ is synonymous to an ASCII ...in fact, suppose we have ^ ^n where n is an ASCII character. If the code point of n is less than 64, then add 64 and return. Else, subtract 64 and return. Moreover, ^^XX is synonymous to a character with code point 0xXX (in HEX). + It gets worse... \t^^65ste is equal to \tAste. You can even define \tAste using \def\t^^65ste. + Fortunately, this is ignored in parameter delimiters (no comment). * TeX has some context-free components with respect to a different alphabet. If we consider TeX's alphabet as pairs a:c where a is an ASCII character, and c is its (symbolic) category code, then, e.g., TeX numbers (see the scan_int procedure) can be written in EBNF as follows decimal_number = ("+:other_char" | "-:other_char")* digit+ digit = "0:other_char" | ... | "9:other_char" * Erasing semantics, TeX has (E)BNF using only category codes. For example, control_sequence = escape (letter)* This is precisely how a lexer would tokenize TeX. The context sensitivity comes from dynamically taking action after consuming each token. Share Improve this answer Follow edited Oct 7, 2022 at 11:37 answered Dec 11, 2020 at 16:57 jsheaf's user avatar jsheafjsheaf 17111 silver badge33 bronze badges 1 * Well, ^^A is ASCII code point 1, not null, null would be ^^@ - user202729 Jul 23, 2022 at 13:47 Add a comment | You must log in to answer this question. Not the answer you're looking for? Browse other questions tagged * tex-core * parsing * grammar . Linked 13 What type of Chomsky hierarchy is TeX? 2 Is LaTeX a context-free grammar? 0 Can LaTeX-TeX be described with EBNF grammar? 77 Advantages and disadvantages of fully expandable macros 43 Static analysis of LaTeX documents? 21 What parsers for (La)TeX mathematics exist outside of the TeX engines? 22 Are the TeX semantics and grammar defined somewhere in some official documents? 5 Is there a formal grammar for *Equations* in TeX/LaTeX? 7 Automatic Way of Searching for Syntax Inconsistencies 11 The Universal Language Processor in TeX: Theoretical Question Regarding the "Expressive" Power of Tex See more linked questions Related 112 For formal articles, should a displayed equation be followed by a punctuation to conform to the language grammar ? 22 Are the TeX semantics and grammar defined somewhere in some official documents? 1 Is there a BNF-grammar for simple constructs of the TeX language? 5 Is there a formal grammar for *Equations* in TeX/LaTeX? 6 Typesetting a calculus' grammar 4 Combinatory Categorial Grammar style 1 french language, correction, grammar 0 language french grammar in grammar Hot Network Questions * How to make Bash remove quotes after parameter expansion? * Marginalisation with respect to arbitrary distribution * Why do we say "he doesn't know him from Adam"? * Python matrix class * What percentage of light gets scattered by a mirror? * Who owns the moon? * Can LLMs have intention? * How can I obtain a record of my fathers' medals from WW2? * How do I tell which kit lens option is more all-purpose? * How can I hang heavy bikes under a thick wooden shelf? * Does it make sense for giants to use clubs or swords when fighting non-giants? * Is the validity of induction in Coq axiomatic? * Can I travel with my child to the UK if I am not the person named in their visitor's visa? * A question about syntactic function of the clause * Moving after copying in assignment of conditional operator result * Application of Lie group analysis of PDE (beyond calculation of exact solutions) * How do I snap the edges of hex tiles together? * Are your memories part of you? * Why don't professors seem to use learning strategies like spaced repetition and note-taking? * Book recommendation introduction to model theory * How can I tell whether an HDD uses CMR or SMR? * What scientific evidence there is that keeping cooked meat at room temperature is unsafe past two hours? * Could a 200m diameter asteroid be put into a graveyard orbit and not be noticed by people on the ground? * Transformer with same size symbol meaning more hot questions Question feed Subscribe to RSS Question feed To subscribe to this RSS feed, copy and paste this URL into your RSS reader. [https://tex.stackexc] * lang-tex TeX - LaTeX * Tour * Help * Chat * Contact * Feedback Company * Stack Overflow * Teams * Advertising * Collectives * Talent * About * Press * Legal * Privacy Policy * Terms of Service * Cookie Settings * Cookie Policy Stack Exchange Network * Technology * Culture & recreation * Life & arts * Science * Professional * Business * API * Data * Blog * Facebook * Twitter * LinkedIn * Instagram Site design / logo (c) 2024 Stack Exchange Inc; user contributions licensed under CC BY-SA. rev 2024.6.4.10328