Don’t Use ISO/IEC 14977 Extended Backus-Naur Form (EBNF)

David A. Wheeler

2019-03-13 (original 2019-03-02)

If you need to define a language (such as a programming language or complex data structure) it’s often helpful to use some kind of Extended Backus-Naur form (EBNF). Often people do a Google search, find out that there’s an ISO/IEC standard (ISO/IEC 14977:1996), and then just use it... without realizing that this very old ISO/IEC standard has a lot of practical problems.

In this essay I will briefly explain the problems of the ISO/IEC 14977:1996 specification, and why I think you should avoid using it. I will also compare it to a common alternative, the EBNF notation from the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). It is not a disaster to use 14977, but a lot of people use it without realizing that it has a number of problems and that there are alternatives. Obviously these are my personal opinions, which you may not share... but I hope this essay will help you understand why I hold them.

But first: why not obey whatever ISO and IEC say? It may surprise you, but ISO and IEC are not gods from on high. They are just two of many standards-setting bodies, and just because they wrote a specification does not mean you should always use it. After all, ISO developed and promulgated the so-called Open Systems Interconnect (OSI) standards in the 1980s as the one true way to connect networks, and OSI was solidly trounced by the TCP/IP suite developed by IETF. Anyone who committed to the ISO-developed OSI standards wasted a lot of money! Similarly, 14977 does not have universal adoption; even ISO does not use 14977 for all the language standards it publishes. I am certainly not anti-ISO or anti-standards, far from it; I am trying to convince you that you should not use something just because it comes from ISO. If you need something from an international standards body, then it’s worth noting that the W3C and IETF are also international standards bodies that have specified different EBNF notations. The W3C one, in particular, is a reasonable alternative, and that is the one I will focus on here (as a point of comparison).

One problem with ISO and IEC is that unlike modern organizations they often charge for the IT standards they publish instead of making them freely available (W3C, IETF, and many other organizations do make standards freely available). This is a problem in general because modern systems require tens of thousands of standards (in the broad sense). This is less of a problem in this case, though the situation is not great. Thankfully ISO/IEC 14977:1996 is a publicly-available specification (that means available at no charge). On the other hand, it’s not friction-free; it’s easy to not notice the free version, you have to agree to a license before you can download it, you get a zip file that you have to decompress instead of simply getting the actual specification, and it’s a PDF file that doesn’t properly scale to different screen sizes (instead of clean responsive HTML). Compare this complicated multi-step process with the experience of getting the W3C specification: click here and start reading on any device. But getting the specification is not the primary problem in this case; the problem is using it.

The whole point of an EBNF is to make it possible to describe a grammar clearly, unabiguously, and succinctly. Here I examine the freely-available specification. Here are some of its key weaknesses (as I perceive them):

  1. It is unable to indicate International/Unicode characters, code points, or byte values. ISO/IEC 14977:1996 only supports ISO/IEC 646:1991 characters. It does have a “? ... ?” notation to informally describe a character, but that is not the same as having proper support. Thus, it cannot directly represent the full range of code points allowed by ISO/IEC 10646 / Unicode when processing text, and it’s also inadequate for describing binary formats. What’s worse, it has no way to indicate code points by value. You’d think that in the case of text formats you could quietly violate the standard by inserting Unicode characters surrounded by single or double quotes, but even in that case the specification isn’t adequate. For example, these are all different characters: "-", "‐", "‑", "‒", "–", "—", "―", "−", "﹣", and "-". Those are U+002D (‘HYPHEN-MINUS’), U+2010 (‘HYPHEN’), U+2011 (‘NON-BREAKING HYPHEN’), U+2012 (‘FIGURE DASH’), U+2013 (‘EN DASH’), U+2014 (‘EM DASH’), U+2015 (‘HORIZONTAL BAR’), U+2212 (‘MINUS SIGN’), U+FE63 (‘SMALL HYPHEN-MINUS’), U+FF0D (‘FULL-WIDTH HYPHEN-MINUS’). There is no substitute for the ability to specify code points! Since there is no way in the standard to unambiguously specify code points, this is a problem. This omission is also a problem when trying to represent binary formats. In contrast, W3C’s notation easily supports arbitary code points; just write #xN where N is a hexadecimal number.
  2. It is unable to indicate character ranges. ISO/IEC 14977:1996 has no standard way to indicate character ranges, which are ubiquitous in grammars. In contrast, W3C’s notation easily supports an arbitary range, just write “[range]”. An example should make it clear why this matters. It’s really common to say that “this character must be an ASCII uppercase, lowercase, or decimal digit”. In W3C’s notation this is expressed by [a-zA-Z0-9]. Here’s how you do that in ISO/IEC 14977:1996 (all but the last line are straight from section 8.1 of the specification, so this really is the expectation):
    letter
      = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h'
      | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p'
      | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x'
      | 'y' | 'z' |
      | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H'
      | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P'
      | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X'
      | 'Y' | 'Z';
    digit
      = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'
      | '8' | '9';
    letter or digit = letter | digit;
    
    Clearly expressions like [a-zA-Z0-9] are shorter and clearer. Ranges also make exceptions clearer, e.g., if you omitted the letter O it would be obvious in a range but not obvious in a long list. Ranges also reduce the risk of bugs; if an option is accidentally omitted, the omission might not be noticed in a long collection of alternatives.
  3. It requires a sea of commas. One of the most common operations in a grammar, by far, is concatenation (aka sequencing). ISO/IEC 14977:1996 requires that a comma be used for every concatenation, so any sequence of N symbols will have N-1 commas. This, means, that, every, rule, throughout, the, entire, grammar, is, festooned, with, commas. This doesn’t impact the ability to represent a grammar, but it makes grammars remarkably hard to read, especially if the rules themselves involve commas. Since the whole point of an EBNF notation is to create an easy-to-read grammar definition, almost doubling the number of required syntactic symbols is a real weakness. W3C’s notation uses spaces, eliminating the problem entirely.
  4. It has a bizarre “one or more” notation. Another common operation is to identify that something occurs “one or more” times. Regular expressions (which are widely used and known in the computing community) use the + symbol to represent this, e.g., z+ in POSIX Extended Regular Expressions and Perl-Compatible Regular Expressions means “one or more zs”. In contrast, ISO/IEC 14977:1996 represents “one or more” as { symbol }- which means “0 or more symbols, and then subtract the empty set”. The empty set is represented by no symbols at all (!)... which makes it easy to miss what is going on. This construct is also defect-prone. If that expression is concatenated with something afterwards, you need a comma (if you forget it then the following expression will be subtracted from the former)... but since commas are everywhere, it is easy to not notice where a comma does not occur. Most systems with a null set represent it with a symbol, to make it easy to notice... but not 14977. This whole notation is so counterintuitive that many people aren’t even aware it exists, and they end up repeating themselves to represent the construct (e.g., “foo, bar, baz {foo, bar, baz}”). It also requires re-explaining to everyone, since far more people know regular expressions than know the quirky ISO/IEC 14977:1996 notation. W3C’s notation supports “one or more” in the computing community’s standard way: just add a “+” suffix.

If you must use 14977, at least avoid the alternative representation characters. When the specification was written one of the concerns was that there were computers and typewriters (!) that did not have some of the characters such “{“ and “}”, so alternatives such as “(:” and “:)” were defined. There is no reason today to use that nonsense.

I am not the only person to notice problems with 14977, of course. The paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev (also available from the ACM) has some interesting comments. He argues that one of the most significant problems with reusing grammar knowledge in specifications and manuals is the “diversity of syntactic notations: without loss of generality, we can state that every single language document uses its own notation, which is more often than not, a dialect of the (Extended) Backus-Naur Form.” The paper backs this up with an analysis of “a corpus of 38 programming language standards (ANSI, ISO, IEEE, W3C, etc), 23 grammar containing publications of other kinds (non-endorsed books, scientific papers, manuals) and 8 derivative grammar sources, exhibiting in total 42 syntactic notations while defining 77 grammars (from Algol and C++ to SQL and XPath).” He notes that, “There was an attempt in 1996 to standardize the notation at ISO, but it only ended up adding yet another three dialects to the chaos.” He notes some reasons for the failure of 14977 adoption, and pointedly notes that ISO/IEC 14977 is not even used in all ISO language standards.

In short, while there would be a big advantage to having a single notation, the community of those who write language specifications have generally rejected 14977 for a variety of reasons. You should be aware of that rejection before committing to using it. Yes, it is published by ISO/IEC, but that does not mean that everyone uses it - or even that they should use it.

I bear no ill will to those who developed ISO/IEC 14977:1996. However, I think 14977 has a lot of problems, and there are obvious EBNF alternatives (such as W3C’s) that should normally be used instead. You would be wise to look at alternatives.


Feel free to see my home page at https://dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs.

(C) Copyright 2019 David A. Wheeler. Released under Creative Commons Attribution-ShareAlike version 3.0 or later (CC-BY-SA-3.0+).