Don’t Use ISO/IEC 14977:1996 Extended Backus-Naur Form (EBNF)

David A. Wheeler

2023-03-21 (original 2019-03-02)

If you need to define a language (such as a programming language or complex data structure) it’s often helpful to use some kind of Extended Backus-Naur form (EBNF). Often people do a Google search, find out that there’s an ISO/IEC standard (ISO/IEC 14977:1996), and then just use it... without realizing that this very old ISO/IEC standard has a lot of problems and should not be used.

In this essay I will briefly explain the problems of the ISO/IEC 14977:1996 specification, and why I think you should avoid using it. I will first discuss the many technical failings of the specification itself, then follow that with why just mindlessly obeying ISO is inappropriate (since some people may do that). When I discuss its failings I will also compare the 14977 specification to a common alternative, the EBNF notation from the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). I'll also briefly mention IETF's RFC 5234, "Augmented BNF for Syntax Specifications: ABNF". IETF's specification isn't nearly as bad as 14977, though I don't recommend using RFC 5234 outside of RFC specifications. My focus in this essay is the problems of 14977. It is not necessarily a disaster to use 14977, but a lot of people use 14977 without realizing that it has a number of very serious problems and that there are much better alternatives available. Obviously these are my personal opinions, which you may not share... but I hope this essay will help you understand why I hold them, and perhaps convince you.

The specification itself has serious problems

The whole point of an EBNF is to make it possible to describe a grammar clearly, unabiguously, and succinctly. Here I examine the freely-available specification. Here are some of its key weaknesses (as I perceive them):

It is unable to indicate International/Unicode characters, code points, or byte values. ISO/IEC 14977:1996 only supports ISO/IEC 646:1991 characters. The 14977:1996 specification does have a “? ... ?” notation to informally describe a character, but that is not the same as having proper support. Thus, it cannot directly represent the full range of code points allowed by ISO/IEC 10646 / Unicode when processing text, and it’s also inadequate for describing binary formats. What’s worse, it has no way to indicate code points by value. You’d think that in the case of text formats you could quietly violate the standard by inserting Unicode characters surrounded by single or double quotes, but even in that case the specification isn’t adequate. There is no substitute for the ability to specify code points. Imagine trying to distinguish these values without code point values: "-", "‐", "‑", "‒", "–", "—", "―", "−", "﹣", and "－". Those are U+002D (‘HYPHEN-MINUS’), U+2010 (‘HYPHEN’), U+2011 (‘NON-BREAKING HYPHEN’), U+2012 (‘FIGURE DASH’), U+2013 (‘EN DASH’), U+2014 (‘EM DASH’), U+2015 (‘HORIZONTAL BAR’), U+2212 (‘MINUS SIGN’), U+FE63 (‘SMALL HYPHEN-MINUS’), U+FF0D (‘FULL-WIDTH HYPHEN-MINUS’). Since there is no way in the standard to unambiguously specify code points, this is a problem. This omission is also a problem when trying to represent binary formats. In contrast, W3C’s notation easily supports arbitary code points; just write #xN where N is a hexadecimal number.

Oh, and a quick aside. Technically ISO/IEC 10646 and Unicode are not exactly the same specification, since they come from different organizations. In most ways the distinctions are irrelevant; the character codes and encoding forms are (intentionally) synchronized between Unicode and ISO/IEC 10646, for which everyone is grateful. The ISO/IEC 10646 specification is publicly available, probably due to competition from the Unicode consortium. After all, the Unicode consortium is a modern standards organization that publicly releases its specification, and people would probably always ignore ISO/IEC 10646 if it wasn't publicly available. That said, you should normally use Unicode, not ISO/IEC 10646, as your specification, because the "Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications".
It is unable to indicate character ranges. ISO/IEC 14977:1996 has no standard way to indicate character ranges, which are ubiquitous in grammars. In contrast, W3C’s notation easily supports an arbitary range, just write “[range]”. An example should make it clear why this matters. It’s really common to say that “this character must be an ASCII uppercase, lowercase, or decimal digit”. In W3C’s notation this is expressed by [a-zA-Z0-9]. Here’s how you do that in ISO/IEC 14977:1996 (all but the last line are straight from section 8.1 of the specification, so this really is the expectation):
```
letter
  = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h'
  | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p'
  | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x'
  | 'y' | 'z' |
  | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H'
  | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P'
  | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X'
  | 'Y' | 'Z';
digit
  = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'
  | '8' | '9';
letter or digit = letter | digit;
```
Clearly expressions like [a-zA-Z0-9] are shorter and clearer. Ranges also make exceptions clearer, e.g., if you omitted the letter O it would be obvious in a range but not obvious in a long list. Ranges also reduce the risk of bugs; if an option is accidentally omitted, the omission might not be noticed in a long collection of alternatives.
It requires a sea of commas, so using it produces hard-to-read grammars. One of the most common operations in a grammar, by far, is concatenation (aka sequencing). ISO/IEC 14977:1996 requires that a comma be used for every concatenation, so any sequence of N symbols will have N-1 commas. This, means, that, every, rule, throughout, the, entire, grammar, is, festooned, with, commas. This doesn’t impact the ability to represent a grammar, but it makes grammars remarkably hard to read, especially if the rules themselves involve commas. Since the whole point of an EBNF notation is to create an easy-to-read grammar definition, almost doubling the number of required syntactic symbols is a serious mistake. W3C’s notation uses spaces, eliminating the problem entirely.
It does not build on widely-used regex notation. The easiest language to learn and use is one that's very similar to one you already know. The vast majority of software developers today know regular expressions. Regular expressions (regexes) are built into the syntax of many programming languages including JavaScript, Ruby, and Perl. The Python programming language technically doesn't have regexes built into its syntax, but it has special string syntax designed for them, and its built-in library supports them. Regexes are widely used for input validation and many other purposes. POSIX standardizes extended regular expressions (EREs). In POSIX EREs and the regexes built into many programming languages, an atom can be followed by a count ("*" for 0 or more, "+" for 1 or more, and "?" for 0 or 1). Yet ISO 14977 and IETF's EBNF format ( RFC 5234) use a different syntax that is unnecessarily incompatible with the syntax widely used by software developers. There's no good justification for using a syntax that's different from what developers use every day.
It has a bizarre, difficult-to-understand, and easily-misunderstood “one or more” notation. Another common operation is to identify that something occurs “one or more” times. Regular expressions (which are widely used and known in the computing community) use the + symbol to represent this, e.g., z+ in POSIX Extended Regular Expressions and Perl-Compatible Regular Expressions means “one or more zs”. In contrast, ISO/IEC 14977:1996 represents “one or more” as { symbol }- which means “0 or more symbols, and then subtract the empty set”. The empty set is represented by no symbols at all (!)... which makes it easy to miss what is going on. This construct is also defect-prone. If that expression is concatenated with something afterwards, you need a comma (if you forget it then the following expression will be subtracted from the former)... but since commas are everywhere, it is easy to not notice where a comma does not occur. Most systems with a null set represent it with a symbol, to make it easy to notice... but not 14977. This whole notation is so counterintuitive it's often not used when it should be (perhaps they are afraid it will be misunderstood, or aren’t even aware it exists). As a result they end up repeating themselves to represent this common construct (e.g., “foo, bar, baz {foo, bar, baz}”). This bizarre construct also requires re-explaining to everyone, since far more people know regular expressions than know this quirky ISO/IEC 14977:1996 notation. W3C’s notation supports “one or more” in the computing community’s standard way that most software developers already know: just add a “+” suffix.
It is challenging to understand and many key terms are undefined. I think many people find the specification challenging to understand, and that is not a good property of a specification. It is abstract, and that may be partly necessary given the subject matter. But it has a number of terms and definitions that seem unintuitive to me, and there are no definitions of key basic terms like character, sign, or symbol. Compare its text with the W3C specification, which is much easier to understand.

If you must use 14977, at least avoid the alternative representation characters. When the specification was written one of the concerns was that there were computers and typewriters (!) that did not have some of the characters such as “{“ and “}”, so alternatives such as “(:” and “:)” were defined. There is no reason today to use that nonsense.

But shouldn't I mindlessly obey ISO and IEC?

For many the technical problems are enough. But others may think they should obey whatever ISO and IEC say. It may surprise you, but ISO and IEC are not gods from on high. They are just two of many standards-setting bodies. Just because they wrote a specification does not mean you should use it.

First, just because ISO or IEC wrote a document doesn't mean people use it. After all, ISO developed and promulgated the so-called Open Systems Interconnect (OSI) standards in the 1980s as the one true way to connect networks, and OSI was solidly trounced by the TCP/IP suite developed by IETF. Anyone who committed to the ISO-developed OSI standards wasted a lot of money!

In this specific case, even ISO does not use 14977 itself for all the language standards it publishes. Since even ISO doesn't always use 14977, there's no reason you need to. The 2011 paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev expressly notes that many ISO specifications do not use 14977, and considers 14977 to be a failure. Unfortunately, most ISO standards are not publicly available (as I'll discuss in a moment), so doing a survey is too expensive to do. That said, here's a specific example you can check: the Ada programming language standard (published as International Standard ISO/IEC 8652:2012) defines its own BNF format in section 1.1.4. Note that it does not use 14977 notation (for example, it does not use commas for concatenation).

I am certainly not anti-ISO or anti-standards, far from it. I am trying to convince you that you should not use something just because it comes from ISO. If you need something from an international standards body, then it’s worth noting that the W3C and IETF are also international standards bodies that have specified different EBNF notations. The W3C one, in particular, is a reasonable alternative, and that is the one I will focus on here (as a point of comparison).

A broader problem with ISO and IEC is that unlike modern standards-setting organizations, they often charge for the IT standards they publish instead of making them publicly available. By "publicly available" I mean "at no cost"; even ISO uses this terminology. In contrast, the IETF, W3C, and other modern standards-setting organizations always make standards publicly available. These fees are not justifiable today. Distributing documents costs practically nothing, and ISO and IEC do not pay their authors (or the authors' employers), so all the money paid for these standards is exploitative. These fees also greatly impede the use of standards; modern systems require at least tens of thousands of standards (in the broad sense), so while charging for one document (even though the authors get none of it) is unjustified, no one can afford to get them all even if they wanted to. These fees are especially harmful to small businesses and hobbyists, and the world depends on them. In a historical context the fees made sense, because they were necessary to purchase and use printing presses. But today, no one wants that; they want the electronic document, instantly, at no charge. I don't object at all to profit; the profit motivation has done great things for society. I object to exploitation; in some cases ISO is charging for work, yet not paying the people who do the work nor making the work available for free.

Of course, many others have made the same observation. On April 4, 2018, user mycl observed that "ISO Prolog (ISO/IEC 13211) doesn't have a free standard and it has hurt the Prolog language immeasurably. In this case the last freely available draft is quite different from the final standard, which makes the situation worse because not everyone is aware of this. I have noticed a lot of Prolog programmers don't know what's in the standard and what's not - you routinely see answers given on SO that are implementation dependent when they could easily have been expressed in strictly conforming ISO Prolog."

I feel sad about this; I think that ISO is an important organization that has lost its way. ISO has done some good work! I will continue to use ISO specifications where they are good, and I will work with ISO where appropriate. In particular, I'm delighted to work with ISO when the result will be a publicly available standard. More generally, I do think that it's important to have international standards. I think ISO needs to develop and encourage the use of international standards, not focus on charging for work done by others. If you can find a way to encourage ISO to update its practices and join the modern world, that'd be great. I want to see a successful ISO in the long run, and I think its currently policies are inappropriate for the modern world.

This problem of charging unjustified fees is less of a problem in this case, though the situation is not great. Thankfully ISO/IEC 14977:1996 is one of the few ISO publicly-available specifications (that means available at no charge). I find it bizarre that ISO thinks it's acceptable to have any standard it develops not be publicly available, obviously! On the other hand, it’s not friction-free; when I last tried, it’s easy to not notice the free version, you have to agree to a license before you can download it, you get a zip file that you have to decompress instead of simply getting the actual specification, and it’s a PDF file that doesn’t properly scale to different screen sizes (instead of clean responsive HTML or at least reflowing PDF). Compare this complicated multi-step process with the experience of getting the much-better W3C specification for the same job: click here and start reading on any device.

Hopefully I have convinced you that mindlessly obeying ISO is completely inappropriate. But getting the specification is not the primary problem in this case. The problem is using it. The specification is terrible, and far better options are available.

Conclusions

When the primary advantage of your specification is that it can be written using typewriters, perhaps that should not be your preferred spec. The weaknesses of this specification far outweigh its advantages. It is widely perceived as a failure, as it is often not used (even by the organization that created it), but because it still exists, people occasionally make the mistake of trying to use it.

I am not the only person to notice problems with 14977, of course. The paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev (also available from the ACM) has some interesting comments. He argues that one of the most significant problems with reusing grammar knowledge in specifications and manuals is the “diversity of syntactic notations: without loss of generality, we can state that every single language document uses its own notation, which is more often than not, a dialect of the (Extended) Backus-Naur Form.” The paper backs this up with an analysis of “a corpus of 38 programming language standards (ANSI, ISO, IEEE, W3C, etc), 23 grammar containing publications of other kinds (non-endorsed books, scientific papers, manuals) and 8 derivative grammar sources, exhibiting in total 42 syntactic notations while defining 77 grammars (from Algol and C++ to SQL and XPath).” He notes that, “There was an attempt in 1996 to standardize the notation at ISO, but it only ended up adding yet another three dialects to the chaos.” He notes some reasons for the failure of 14977 adoption, and pointedly notes that ISO/IEC 14977 is not even used in all ISO language standards. ISO/IEC 14977 has unintentionally become a perfect demonstration of the XKCD cartoon "Standards".

In short, while there would be a big advantage to having a single notation, the community of those who write language specifications have generally rejected ISO 14977 for a variety of reasons. You should be aware of that rejection before committing to using it. Yes, it is published by ISO/IEC, but that does not mean that everyone uses it - or even that they should use it.

I bear no ill will to those who developed ISO/IEC 14977:1996. However, I think 14977 has a lot of problems, and there are obvious EBNF alternatives that should normally be used instead. One of those alternative specifications is in the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). The W3C specification is much more similar to typical regex syntax making it much easier for today's software developers to understand), avoids the key problems of 14977:1996, and is already clearly described. More generally, you would be wise to avoid 14977:1996.

Feel free to see my home page at https://dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs.

(C) Copyright David A. Wheeler. Released under Creative Commons Attribution-ShareAlike version 3.0 or later (CC-BY-SA-3.0+).