David A. Wheeler's Blog

Wed, 22 Oct 2008

Estimating the Total Development Cost of a Linux Distribution

There’s a new and interesting paper from the Linux Foundation that estimates the total development cost of a Linux distro. Before looking at it, some background would help…

In 2000 and 2001 I published the first estimates of a GNU/Linux distribution’s development costs. The second study (released in 2001, lightly revised in 2002) was titled More than a Gigabuck. That study analyzed Red Hat Linux 7.1 as a representative GNU/Linux distribution, and found that it would cost over $1 billion (over a Gigabuck) to develop this GNU/Linux distribution by conventional proprietary means in the U.S. (in year 2000 U.S. dollars). It included over 30 million physical source lines of code (SLOC), and had it been developed using conventional proprietary means, it would have taken 8,000 person-years of development time to create. My later paper Linux Kernel 2.6: It’s Worth More! focused on how to estimate the development costs for just the Linux kernel (this was picked up by Groklaw).

The Linux Foundation has just re-performed this analysis with Fedora 9, and released it as “Estimating the Total Development Cost of a Linux Distribution”. Here’s their press release. I’d like to thank the authors (Amanda McPherson, Brian Proffitt, and Ron Hale-Evans), because they’ve reported a lot of interesting information.

For example, they found that it would take approximately $10.8 billion to rebuild the Fedora 9 distribution in today’s dollars; it would take $1.4 billion to develop just the Linux kernel alone. This isn’t the value of the distribution; typically people won’t write software unless the software had more value to them than what it cost them (in time and effort) to write it. They state that quite clearly in the paper; they note that these numbers estimate “how much it would cost to develop the software in a Linux distribution today, from scratch. It’s important to note that this estimates the cost but not the value to the greater ecosystem…”. To emphasize that point, the authors reference a 2008 IDC study (“The Role of Linux Commercial Servers and Workloads”) which claims that Linux represents a $25 billion ecosystem. I think IDC’s figure is (in fact) a gross underestimation of the ecosystem value, understandably so (ecosystem value is very hard to measure). Still, the cost to redevelop a system is a plausible lower bound for the value of something (as long as people keep using it). More importantly, it clearly proves that very large and sophisticated systems can be developed as free-libre / open source software (FLOSS).

They make a statement about me that I’d like to expand on: “[Wheeler] concluded—as we did—that Software Lines of Code is the most practical method to determine open source software value since it focuses on the end result and not on per-company or per-developer estimates.” That statement is quite true, but please let me explain why. Directly measuring the amount of time and money spent in development would be, by far, the best way of finding those numbers. But few developers would respond to a survey requesting that information, so direct measurement is completely impractical. Thus, using well-known industry models is the best practical approach to doing so, in spite of their limitations.

I was delighted with their section on the “Limitations and Advantages to this Study’s Approach”. All studies have limitations, and I think it’s much better to acknowledge them than hide them. They note several reasons why this approach grossly underestimates the real effort in developing a distribution, and I quite agree with them. In particular: (1) collaboration often takes additional time (though it often produces better results because you see all sides); (2) deletions are work yet they are not counted; (3) “bake-offs” to determine the best approach (where only the winner is included) produce great results but the additional efforts for the alternatives aren’t included in the estimates. (I noted the bake-off problem in my paper on the Linux kernel.) They note that some drivers aren’t often used, but I don’t see that as a problem; after all, it still took effort to develop them, so it’s valid to include them in an effort estimate. Besides, one challenge to creating an operating system is this very issue - to become useful to many, you must develop a large number of drivers - even though many of the drivers have a relatively small set of users.

This is not a study of “all FLOSS”; many FLOSS programs are not included in Fedora (as they note in their limitations). Others have examined Debian and the Perl CPAN library using my approach (see my page on SLOC), and hopefully someday someone will actually try to measure “all FLOSS” (good luck!!). However, since the Linux Foundation measured a descendent of what I used for my original analysis, it’s valid to examine what’s happened to the size of this single distribution over time. That’s really interesting, because that lets us examine overall trends. So let’s take advantage of that! In terms of physical source lines of code (SLOC) we have:

Distribution         Year   SLOC(million)
Red Hat Linux 6.2    2001    17
Red Hat Linux 7.1    2002    30
Fedora 9             2008   204
If Fedora was growing linearly, the first two points estimate a rate of 13MSLOC/year, and Fedora 9 would have 108 MSLOC (30+6*13). Fedora 9 is almost twice that size, which shows clearly that there’s exponential growth. Even if you factored in the month of release (which I haven’t done), I believe you’d still have clear evidence of exponential growth. This observation is consistent with “The Total Growth of Open Source” by Amit Deshpande and Dirk Riehle (2008), which found that “both the growth rate as well as the absolute amount of source code is best explained using an exponential model”.

Another interesting point: Charles Babcock predicted, in Oct. 19, 2007, that the Linux kernel would be worth $1 billion in the first 100 days of 2009. He correctly predicted that it would pass $1 billion, but it happened somewhat earlier than he thought: by Oct. 2008 it’s already happened, instead of waiting for 2009. I think the reason it happened slightly earlier is that Charles Babcock’s rough estimate was based on a linear approximation (“adding 2,000 lines of code a day”). But these studies all seem to indicate that mature FLOSS programs - including the Linux kernel - are currently growing exponentially, not linearly. Since the rate is also increasing, the date of arrival at $1 billion was sooner than Babcock’s rough estimate. Babcock’s fundamental point - that the Linux kernel keeps adding value at a tremendous pace - is still absolutely correct.

I took a look at some of the detailed data, and some very interesting factors were revealed. By lines of code, here were the largest programs in Fedora 9 (biggest first):

  Gcc-4.3.0-2 0080428
  Enterprise Security Client 1.0.1

The Linux kernel is no surprise; as I noted in the past, it’s full of drivers, and there’s a continuous stream of new hardware that need drivers. The Linux Foundation decided to count both gcc3 and gcc4; since there was a radical change in approach between gcc3 and gcc4, I think that’s fair in terms of effort estimation. (My tool ignores duplicate files, which helps counter double-counting of effort.) Firefox wasn’t included by name in the Gigabuck study, but Mozilla was, and Firefox is essentially its descendent. It’s unsurprising that Firefox is big; it does a lot of things, and trying to make things “look” simple often takes more code (and effort).

What’s remarkable is that many of the largest programs in Fedora 9 were not even included in the “Gigabuck” study - these are whole new applications that were added to Fedora since that time. These largest programs not in the Gigabuck study are: OpenOffice.org (an office suite, aka word processor, spreadsheet, presentation, and so on), Enterprise Security Client, eclipse (a development environment), Mono (an implementation of the C# programming language and its underlying “.NET” environment), bigloo (an implementation of the Scheme programming language), and paraview (a data analysis and visualization application for large datasets). OpenOffice.org’s size is no surprise; it does a lot. I’m a little concerned that “Enterprise Security Client” is so huge - a security client should be small, not big, so that you can analyze it thoroughly for trustworthiness. Perhaps someone will analyze that program further to see why this is so, and if that’s a reason to be concerned.

Anyway, take a look at “Estimating the Total Development Cost of a Linux Distribution”. It conclusively shows that large and useful systems can be developed as FLOSS.

An interesting coincidence: Someone else (Heise) almost simultaneously released a study of just the Linux kernel, again using SLOCCount. Kernel Log: More than 10 million lines of Linux source files notes that the Linux kernel version 2.6.27 has 6,399,191 SLOC. “More than half of the lines are part of hardware drivers; the second largest chunk is the arch/ directory which contains the source code of the various architectures supported by Linux.” In that code, “96.4 per cent of the code is written in C and 3.3 percent in Assembler”. They didn’t apply the corrective factors specific to Linux kernels that I discussed in Linux Kernel 2.6: It’s Worth More!, but it’s still interesting to see. And their conclusion is inarguable: “There is no end in sight for kernel growth which has been ongoing in the Linux 2.6 series for several years - with every new version, the kernel hackers extend the Linux kernel further to include new functions and drivers, improving the hardware support or making it more flexible, better or faster.”

path: /oss | Current Weblog | permanent link to this entry