Don’t Use ISO/IEC 14977 Extended Backus-Naur Form (EBNF)

David A. Wheeler

2020-02-18 (original 2019-03-02)

If you need to define a language (such as a programming language or complex data structure) it’s often helpful to use some kind of Extended Backus-Naur form (EBNF). Often people do a Google search, find out that there’s an ISO/IEC standard (ISO/IEC 14977:1996), and then just use it... without realizing that this very old ISO/IEC standard has a lot of problems and should not be used.

In this essay I will briefly explain the problems of the ISO/IEC 14977:1996 specification, and why I think you should avoid using it. I will first explain why just mindlessly obeying ISO is inappropriate, and then discuss the many technical failings of the specification itself. When I discuss its failings I will also compare the 14977 specification to a common alternative, the EBNF notation from the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). It is not necessarily a disaster to use 14977, but a lot of people use it without realizing that it has a number of very serious problems and that there are much better alternatives available. Obviously these are my personal opinions, which you may not share... but I hope this essay will help you understand why I hold them.

But shouldn't I mindlessly obey ISO and IEC?

But first: why not obey whatever ISO and IEC say? It may surprise you, but ISO and IEC are not gods from on high. They are just two of many standards-setting bodies. Just because they wrote a specification does not mean you should use it.

First, just because ISO or IEC wrote a document doesn't mean people use it. After all, ISO developed and promulgated the so-called Open Systems Interconnect (OSI) standards in the 1980s as the one true way to connect networks, and OSI was solidly trounced by the TCP/IP suite developed by IETF. Anyone who committed to the ISO-developed OSI standards wasted a lot of money!

In this specific case, even ISO does not use 14977 itself for all the language standards it publishes. Since even ISO doesn't always use 14977, there's no reason you need to. The 2011 paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev expressly notes that many ISO specifications do not use 14977, and considers 14977 to be a failure. Unfortunately, most ISO standards are not publicly available (as I'll discuss in a moment). That said, here's a specific example you can check: the Ada programming language standard (published as International Standard ISO/IEC 8652:2012) defines its own BNF format in section 1.1.4. Note that it does not use 14977 notation (for example, it does not use commas for concatenation).

I am certainly not anti-ISO or anti-standards, far from it. I am trying to convince you that you should not use something just because it comes from ISO. If you need something from an international standards body, then it’s worth noting that the W3C and IETF are also international standards bodies that have specified different EBNF notations. The W3C one, in particular, is a reasonable alternative, and that is the one I will focus on here (as a point of comparison).

A broader problem with ISO and IEC is that unlike modern standards-setting organizations, they often charge for the IT standards they publish instead of making them publicly available. By "publicly available" I mean "at no cost"; even ISO uses this terminology. In contrast, the IETF, W3C, and other modern standards-setting organizations always make standards publicly available. These fees are not justifiable today. Distributing documents costs practically nothing, and ISO and IEC do not pay their authors (or the authors' employers), so all the money paid for these standards is exploitative. These fees also greatly impede the use of standards; modern systems require at least tens of thousands of standards (in the broad sense), so while charging for one document (even though the authors get none of it) is unjustified, no one can afford to get them all even if they wanted to. These fees are especially harmful to small businesses and hobbyists, and the world depends on them. In a historical context the fees made sense, because they were necessary to purchase and use printing presses. But today, no one wants that; they want the electronic document, nstantly, at no charge. ISO has been exploiting the world through these unjustified fees. I don't object at all to profit; the profit motivation has done great things for society. I object to exploitation; these organizations like ISO are charging for work, yet not paying the people who do the work nor making the work available for free.

Of course, many others have made the same observation. On April 4, 2018, user mycl observed that "ISO Prolog (ISO/IEC 13211) doesn't have a free standard and it has hurt the Prolog language immeasurably. In this case the last freely available draft is quite different from the final standard, which makes the situation worse because not everyone is aware of this. I have noticed a lot of Prolog programmers don't know what's in the standard and what's not - you routinely see answers given on SO that are implementation dependent when they could easily have been expressed in strictly conforming ISO Prolog."

These fees exploit a feature of international law. In international law, certain organizations (such as the ISO and IEC) have historically been treated specially. To oversimplify, international law has recognized two kinds of standards (in French, « norme » and « standard »), and the bodies that make the first kind include ISO and IEC, while the bodies that make the second kind include IETF and W3C. This might suggest that ISO should be given more deference, but the reality is that ISO is exploiting this position to make unjustified profits. That reality undercuts any justification ISO might have for deference.

Since ISO has been exploiting its position, I see no reason to give ISO special deference. I feel sad about this; I think that ISO is an important organization that has lost its way. ISO has done some good work! I will continue to use ISO specifications where they are good, and I will even work with ISO where appropriate. More generally, I do think that it's important to have standards from organizations like ISO and IETF.

That said, I think it's important to realize that ISO must change its behavior if it wants to stay relevant in the long run. ISO needs to rediscover its actual purpose: to develop and encourage the use of international standards. If you can find a way to encourage ISO to update its practices and join the modern world, before it finally slips into irrelevance, that'd be great. I want to see a successful ISO in the long run, and I think its currently policies are inappropriate for the modern world. But until ISO fixes its policies to focus on making standards instead of making profits, it is appropriate to be a little wary of ISO standards.

This problem of charging unjustified fees is less of a problem in this case, though the situation is not great. Thankfully ISO/IEC 14977:1996 is one of the few ISO publicly-available specifications (that means available at no charge). I find it bizarre that ISO thinks it's acceptable to have any standard it develops not be publicly available, obviously! On the other hand, it’s not friction-free; when I last tried, it’s easy to not notice the free version, you have to agree to a license before you can download it, you get a zip file that you have to decompress instead of simply getting the actual specification, and it’s a PDF file that doesn’t properly scale to different screen sizes (instead of clean responsive HTML or at least reflowing PDF). Compare this complicated multi-step process with the experience of getting the much-better W3C specification for the same job: click here and start reading on any device.

Hopefully I have convinced you that mindlessly obeying ISO is completely inappropriate. But getting the specification is not the primary problem in this case. The problem is using it. The specification is terrible, and far better options are available.

The specification itself has serious problems

The whole point of an EBNF is to make it possible to describe a grammar clearly, unabiguously, and succinctly. Here I examine the freely-available specification. Here are some of its key weaknesses (as I perceive them):

  1. It is unable to indicate International/Unicode characters, code points, or byte values. ISO/IEC 14977:1996 only supports ISO/IEC 646:1991 characters. It does have a “? ... ?” notation to informally describe a character, but that is not the same as having proper support. Thus, it cannot directly represent the full range of code points allowed by ISO/IEC 10646 / Unicode when processing text, and it’s also inadequate for describing binary formats. What’s worse, it has no way to indicate code points by value. You’d think that in the case of text formats you could quietly violate the standard by inserting Unicode characters surrounded by single or double quotes, but even in that case the specification isn’t adequate. For example, these are all different characters: "-", "‐", "‑", "‒", "–", "—", "―", "−", "﹣", and "-". Those are U+002D (‘HYPHEN-MINUS’), U+2010 (‘HYPHEN’), U+2011 (‘NON-BREAKING HYPHEN’), U+2012 (‘FIGURE DASH’), U+2013 (‘EN DASH’), U+2014 (‘EM DASH’), U+2015 (‘HORIZONTAL BAR’), U+2212 (‘MINUS SIGN’), U+FE63 (‘SMALL HYPHEN-MINUS’), U+FF0D (‘FULL-WIDTH HYPHEN-MINUS’). There is no substitute for the ability to specify code points! Since there is no way in the standard to unambiguously specify code points, this is a problem. This omission is also a problem when trying to represent binary formats. In contrast, W3C’s notation easily supports arbitary code points; just write #xN where N is a hexadecimal number.

    Oh, and a quick aside. Technically ISO/IEC 10646 and Unicode are not exactly the same specification, since they come from different organizations. In most ways the distinctions are irrelevant; the character codes and encoding forms are (intentionally) synchronized between Unicode and ISO/IEC 10646, for which everyone is grateful. The ISO/IEC 10646 specification is publicly available, probably due to competition from the Unicode consortium. After all, the Unicode consortium is a modern standards organization that publicly releases its specification, and people would probably always ignore ISO/IEC 10646 if it wasn't publicly available. That said, you should normally use Unicode, not ISO/IEC 10646, as your specification. The "Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications".
  2. It is unable to indicate character ranges. ISO/IEC 14977:1996 has no standard way to indicate character ranges, which are ubiquitous in grammars. In contrast, W3C’s notation easily supports an arbitary range, just write “[range]”. An example should make it clear why this matters. It’s really common to say that “this character must be an ASCII uppercase, lowercase, or decimal digit”. In W3C’s notation this is expressed by [a-zA-Z0-9]. Here’s how you do that in ISO/IEC 14977:1996 (all but the last line are straight from section 8.1 of the specification, so this really is the expectation):
    letter
      = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h'
      | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p'
      | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x'
      | 'y' | 'z' |
      | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H'
      | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P'
      | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X'
      | 'Y' | 'Z';
    digit
      = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'
      | '8' | '9';
    letter or digit = letter | digit;
    
    Clearly expressions like [a-zA-Z0-9] are shorter and clearer. Ranges also make exceptions clearer, e.g., if you omitted the letter O it would be obvious in a range but not obvious in a long list. Ranges also reduce the risk of bugs; if an option is accidentally omitted, the omission might not be noticed in a long collection of alternatives.
  3. It requires a sea of commas, so using it produces hard-to-read grammars. One of the most common operations in a grammar, by far, is concatenation (aka sequencing). ISO/IEC 14977:1996 requires that a comma be used for every concatenation, so any sequence of N symbols will have N-1 commas. This, means, that, every, rule, throughout, the, entire, grammar, is, festooned, with, commas. This doesn’t impact the ability to represent a grammar, but it makes grammars remarkably hard to read, especially if the rules themselves involve commas. Since the whole point of an EBNF notation is to create an easy-to-read grammar definition, almost doubling the number of required syntactic symbols is a serious mistake. W3C’s notation uses spaces, eliminating the problem entirely.
  4. It has a bizarre, difficult-to-understand, and easily-misunderstood “one or more” notation. Another common operation is to identify that something occurs “one or more” times. Regular expressions (which are widely used and known in the computing community) use the + symbol to represent this, e.g., z+ in POSIX Extended Regular Expressions and Perl-Compatible Regular Expressions means “one or more zs”. In contrast, ISO/IEC 14977:1996 represents “one or more” as { symbol }- which means “0 or more symbols, and then subtract the empty set”. The empty set is represented by no symbols at all (!)... which makes it easy to miss what is going on. This construct is also defect-prone. If that expression is concatenated with something afterwards, you need a comma (if you forget it then the following expression will be subtracted from the former)... but since commas are everywhere, it is easy to not notice where a comma does not occur. Most systems with a null set represent it with a symbol, to make it easy to notice... but not 14977. This whole notation is so counterintuitive that many people aren’t even aware it exists, and they end up repeating themselves to represent the construct (e.g., “foo, bar, baz {foo, bar, baz}”). It also requires re-explaining to everyone, since far more people know regular expressions than know the quirky ISO/IEC 14977:1996 notation. W3C’s notation supports “one or more” in the computing community’s standard way: just add a “+” suffix.
  5. It is challenging to understand and many key terms are undefined. I think many people find the specification challenging to understand, and that is not a good property of a specification. It is abstract, and that may be partly necessary given the subject matter. But it has a number of terms and definitions that seem unintuitive to me, and there are no definitions of key basic terms like character, sign, or symbol. Compare its text with the W3C specification, which is much easier to understand.

If you must use 14977, at least avoid the alternative representation characters. When the specification was written one of the concerns was that there were computers and typewriters (!) that did not have some of the characters such “{“ and “}”, so alternatives such as “(:” and “:)” were defined. There is no reason today to use that nonsense.

But when the primary advantage of your specification is that it can handle old typewriters, perhaps that should not be your preferred spec. The weaknesses of this specification far outweigh its advantages. It is widely perceived as a failure, but because it still exists, people occasionally make the mistake of trying to use it.

Conclusions

I am not the only person to notice problems with 14977, of course. The paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev (also available from the ACM) has some interesting comments. He argues that one of the most significant problems with reusing grammar knowledge in specifications and manuals is the “diversity of syntactic notations: without loss of generality, we can state that every single language document uses its own notation, which is more often than not, a dialect of the (Extended) Backus-Naur Form.” The paper backs this up with an analysis of “a corpus of 38 programming language standards (ANSI, ISO, IEEE, W3C, etc), 23 grammar containing publications of other kinds (non-endorsed books, scientific papers, manuals) and 8 derivative grammar sources, exhibiting in total 42 syntactic notations while defining 77 grammars (from Algol and C++ to SQL and XPath).” He notes that, “There was an attempt in 1996 to standardize the notation at ISO, but it only ended up adding yet another three dialects to the chaos.” He notes some reasons for the failure of 14977 adoption, and pointedly notes that ISO/IEC 14977 is not even used in all ISO language standards. ISO/IEC 14977 has unintentionally become a perfect demonstration of the XKCD cartoon "Standards".

In short, while there would be a big advantage to having a single notation, the community of those who write language specifications have generally rejected ISO 14977 for a variety of reasons. You should be aware of that rejection before committing to using it. Yes, it is published by ISO/IEC, but that does not mean that everyone uses it - or even that they should use it.

I bear no ill will to those who developed ISO/IEC 14977:1996. However, I think 14977 has a lot of problems, and there are obvious EBNF alternatives (such as W3C’s) that should normally be used instead. You would be wise to look at alternatives.


Feel free to see my home page at https://dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs.

(C) Copyright 2019-2020 David A. Wheeler. Released under Creative Commons Attribution-ShareAlike version 3.0 or later (CC-BY-SA-3.0+).