If you need to define a language (such as a programming language or complex data structure) it’s often helpful to use some kind of Extended Backus-Naur form (EBNF). Often people do a Google search, find out that there’s an ISO/IEC standard (ISO/IEC 14977:1996), and then just use it... without realizing that this very old ISO/IEC standard has a lot of problems and should not be used.
In this essay I will briefly explain the problems of the ISO/IEC 14977:1996 specification, and why I think you should avoid using it. I will first discuss the many technical failings of the specification itself, then follow that with why just mindlessly obeying ISO is inappropriate (since some people may do that). When I discuss its failings I will also compare the 14977 specification to a common alternative, the EBNF notation from the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). I'll also briefly mention IETF's RFC 5234, "Augmented BNF for Syntax Specifications: ABNF". IETF's specification isn't nearly as bad as 14977, though I don't recommend using RFC 5234 outside of RFC specifications. My focus in this essay is the problems of 14977. It is not necessarily a disaster to use 14977, but a lot of people use 14977 without realizing that it has a number of very serious problems and that there are much better alternatives available. Obviously these are my personal opinions, which you may not share... but I hope this essay will help you understand why I hold them, and perhaps convince you.
The whole point of an EBNF is to make it possible to describe a grammar clearly, unabiguously, and succinctly. Here I examine the freely-available specification. Here are some of its key weaknesses (as I perceive them):
letter = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'; digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'; letter or digit = letter | digit;Clearly expressions like [a-zA-Z0-9] are shorter and clearer. Ranges also make exceptions clearer, e.g., if you omitted the letter O it would be obvious in a range but not obvious in a long list. Ranges also reduce the risk of bugs; if an option is accidentally omitted, the omission might not be noticed in a long collection of alternatives.
If you must use 14977, at least avoid the alternative representation characters. When the specification was written one of the concerns was that there were computers and typewriters (!) that did not have some of the characters such as “{“ and “}”, so alternatives such as “(:” and “:)” were defined. There is no reason today to use that nonsense.
For many the technical problems are enough. But others may think they should obey whatever ISO and IEC say. It may surprise you, but ISO and IEC are not gods from on high. They are just two of many standards-setting bodies. Just because they wrote a specification does not mean you should use it.
First, just because ISO or IEC wrote a document doesn't mean people use it. After all, ISO developed and promulgated the so-called Open Systems Interconnect (OSI) standards in the 1980s as the one true way to connect networks, and OSI was solidly trounced by the TCP/IP suite developed by IETF. Anyone who committed to the ISO-developed OSI standards wasted a lot of money!
In this specific case, even ISO does not use 14977 itself for all the language standards it publishes. Since even ISO doesn't always use 14977, there's no reason you need to. The 2011 paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev expressly notes that many ISO specifications do not use 14977, and considers 14977 to be a failure. Unfortunately, most ISO standards are not publicly available (as I'll discuss in a moment), so doing a survey is too expensive to do. That said, here's a specific example you can check: the Ada programming language standard (published as International Standard ISO/IEC 8652:2012) defines its own BNF format in section 1.1.4. Note that it does not use 14977 notation (for example, it does not use commas for concatenation).
I am certainly not anti-ISO or anti-standards, far from it. I am trying to convince you that you should not use something just because it comes from ISO. If you need something from an international standards body, then it’s worth noting that the W3C and IETF are also international standards bodies that have specified different EBNF notations. The W3C one, in particular, is a reasonable alternative, and that is the one I will focus on here (as a point of comparison).
A broader problem with ISO and IEC is that unlike modern standards-setting organizations, they often charge for the IT standards they publish instead of making them publicly available. By "publicly available" I mean "at no cost"; even ISO uses this terminology. In contrast, the IETF, W3C, and other modern standards-setting organizations always make standards publicly available. These fees are not justifiable today. Distributing documents costs practically nothing, and ISO and IEC do not pay their authors (or the authors' employers), so all the money paid for these standards is exploitative. These fees also greatly impede the use of standards; modern systems require at least tens of thousands of standards (in the broad sense), so while charging for one document (even though the authors get none of it) is unjustified, no one can afford to get them all even if they wanted to. These fees are especially harmful to small businesses and hobbyists, and the world depends on them. In a historical context the fees made sense, because they were necessary to purchase and use printing presses. But today, no one wants that; they want the electronic document, instantly, at no charge. I don't object at all to profit; the profit motivation has done great things for society. I object to exploitation; in some cases ISO is charging for work, yet not paying the people who do the work nor making the work available for free.
Of course, many others have made the same observation. On April 4, 2018, user mycl observed that "ISO Prolog (ISO/IEC 13211) doesn't have a free standard and it has hurt the Prolog language immeasurably. In this case the last freely available draft is quite different from the final standard, which makes the situation worse because not everyone is aware of this. I have noticed a lot of Prolog programmers don't know what's in the standard and what's not - you routinely see answers given on SO that are implementation dependent when they could easily have been expressed in strictly conforming ISO Prolog."
I feel sad about this; I think that ISO is an important organization that has lost its way. ISO has done some good work! I will continue to use ISO specifications where they are good, and I will work with ISO where appropriate. In particular, I'm delighted to work with ISO when the result will be a publicly available standard. More generally, I do think that it's important to have international standards. I think ISO needs to develop and encourage the use of international standards, not focus on charging for work done by others. If you can find a way to encourage ISO to update its practices and join the modern world, that'd be great. I want to see a successful ISO in the long run, and I think its currently policies are inappropriate for the modern world.
This problem of charging unjustified fees is less of a problem in this case, though the situation is not great. Thankfully ISO/IEC 14977:1996 is one of the few ISO publicly-available specifications (that means available at no charge). I find it bizarre that ISO thinks it's acceptable to have any standard it develops not be publicly available, obviously! On the other hand, it’s not friction-free; when I last tried, it’s easy to not notice the free version, you have to agree to a license before you can download it, you get a zip file that you have to decompress instead of simply getting the actual specification, and it’s a PDF file that doesn’t properly scale to different screen sizes (instead of clean responsive HTML or at least reflowing PDF). Compare this complicated multi-step process with the experience of getting the much-better W3C specification for the same job: click here and start reading on any device.
Hopefully I have convinced you that mindlessly obeying ISO is completely inappropriate. But getting the specification is not the primary problem in this case. The problem is using it. The specification is terrible, and far better options are available.
When the primary advantage of your specification is that it can be written using typewriters, perhaps that should not be your preferred spec. The weaknesses of this specification far outweigh its advantages. It is widely perceived as a failure, as it is often not used (even by the organization that created it), but because it still exists, people occasionally make the mistake of trying to use it.
I am not the only person to notice problems with 14977, of course. The paper “BNF was Here: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions” by Vadim Zaytsev (also available from the ACM) has some interesting comments. He argues that one of the most significant problems with reusing grammar knowledge in specifications and manuals is the “diversity of syntactic notations: without loss of generality, we can state that every single language document uses its own notation, which is more often than not, a dialect of the (Extended) Backus-Naur Form.” The paper backs this up with an analysis of “a corpus of 38 programming language standards (ANSI, ISO, IEEE, W3C, etc), 23 grammar containing publications of other kinds (non-endorsed books, scientific papers, manuals) and 8 derivative grammar sources, exhibiting in total 42 syntactic notations while defining 77 grammars (from Algol and C++ to SQL and XPath).” He notes that, “There was an attempt in 1996 to standardize the notation at ISO, but it only ended up adding yet another three dialects to the chaos.” He notes some reasons for the failure of 14977 adoption, and pointedly notes that ISO/IEC 14977 is not even used in all ISO language standards. ISO/IEC 14977 has unintentionally become a perfect demonstration of the XKCD cartoon "Standards".
In short, while there would be a big advantage to having a single notation, the community of those who write language specifications have generally rejected ISO 14977 for a variety of reasons. You should be aware of that rejection before committing to using it. Yes, it is published by ISO/IEC, but that does not mean that everyone uses it - or even that they should use it.
I bear no ill will to those who developed ISO/IEC 14977:1996. However, I think 14977 has a lot of problems, and there are obvious EBNF alternatives that should normally be used instead. One of those alternative specifications is in the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). The W3C specification is much more similar to typical regex syntax making it much easier for today's software developers to understand), avoids the key problems of 14977:1996, and is already clearly described. More generally, you would be wise to avoid 14977:1996.
Feel free to see my home page at https://dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs.
(C) Copyright David A. Wheeler. Released under Creative Commons Attribution-ShareAlike version 3.0 or later (CC-BY-SA-3.0+).