Estimating Linux's Size
David A. Wheeler (dwheeler@dwheeler.com)
November 3, 2000
Version 1.02

This paper presents size estimates (and their implications) of the source code of a distribution of the Linux operating system (OS), a combination often called GNU/Linux. The distribution used in this paper is Red Hat Linux version 6.2, including the kernel, software development tools, graphics interfaces, client applications, and so on. Other distributions and versions will have different sizes.

In total, this distribution includes well over 17 million lines of physical source lines of code (SLOC). Using the COCOMO cost model, this is estimated to have required over 4,500 person-years of development time. Had this Linux distribution been developed by conventional proprietary means, it's estimated that it would have cost over $600 million to develop in the U.S. (in year 2000 dollars).

Many other interesting statistics emerge. The largest components (in order) were the linux kernel (including device drivers), the X-windows server (for the graphical user interface), gcc (a compilation system), and emacs (a text editor and far more). The languages used, sorted by the most lines of code, were C, C++, LISP (including Emacs' LISP and Scheme), shell (including ksh), Perl, Tcl (including expect), assembly (all kinds), Python, yacc/bison, Java, lex/flex, awk, objective-C, C-shell, Ada, Pascal, sed, and Fortran. More information is available at http://www.dwheeler.com/sloc.

1. Introduction

The Linux operating system (also called GNU/Linux) has gone from an unknown to a powerful market force. One survey found that more Internet servers use Linux than any other operating system [Zoebelein 1999]. IDC found that 25% of all server operating systems purchased in 1999 were Linux, making it second only to Windows NT's 38% [Shankland 2000a].

There appear to be many reasons for this, and not simply because Linux can be obtained at no or low cost. For example, experiments suggest that Linux is highly reliable. A 1995 study of a set of individual components found that the GNU and Linux components had a significantly higher reliability than their proprietary Unix competitors (6% to 9% failure rate with GNU and Linux, versus an average 23% failure rate with the proprietary software using their measurement technique) [Miller 1995]. A ten-month experiment in 1999 by ZDnet found that, while Microsoft's Windows NT crashed every six weeks under a ``typical'' intranet load, using the same load and request set the Linux systems (from two different distributors) never crashed [Vaughan-Nichols 1999].

However, possibly the most important reason for Linux's popularity among many developers and users is that its source code is generally ``open source software'' and/or ``free software'' (where the ``free'' here means ``freedom''). A program that is ``open source software'' or ``free software'' is essentially a program whose source code can be obtained, viewed, changed, and redistributed without royalties or other limitations of these actions. A more formal definition of ``open source software'' is available at OSI [1999], a more formal definition of ``free software'' is available at FSF [2000], and other general information about these topics is available at Wheeler [2000a]. Quantitative rationales for using open source / free software is given in Wheeler [2000b]. The Linux operating system is actually a suite of components, including the Linux kernel on which it is based, and it is packaged, sold, and supported by a variety of distributors. The Linux kernel is ``open source software''/``free software'', and this is also true for all (or nearly all) other components of a typical Linux distribution. Open source software/free software frees users from being captives of a particular vendor, since it permits users to fix any problems immediately, tailor their system, and analyze their software in arbitrary ways.

Surprisingly, although anyone can analyze Linux for arbitrary properties, I have found little published analysis of the amount of source lines of code (SLOC) contained in a Linux distribution. The only published data I've found was developed by Microsoft in the documents usually called ``Halloween I'' and ``Halloween II''. Unfortunately, the meaning, derivation, and assumptions of their numbers is not explained, making the numbers hard to use and truly understand. Even worse, although the two documents were written by essentially the same people at the same time, the numbers in the documents appear (on their surface) to be contradictory. The so-called ``Halloween I'' document claimed that the Linux kernel (x86 only) was 500,000 lines of code, the Apache web server was 80,000 lines of code, the X-windows server was 1.5 million, and a full Linux distribution was about 10 million lines of code [Halloween I]. The ``Halloween II'' document seemed to contradict this, saying that ``Linux'' by 1998 included 1.5 million lines of code. Since ``version 2.1.110'' is identified as the version number, presumably this only measures the Linux kernel, and it does note that this measure includes all Linux ports to various architectures [Halloween II]. However, this asks as many questions as it answers - what exactly was being measured, and what assumptions were made? You could infer from these documents that the Linux kernel's support for other architectures took one million lines of code - but this appeared unlikely. Another study, [Dempsey 1999], did analyze open source programs, but it primarily focused on stastics about developers and used basic filesize numbers to report about the software.

This paper bridges this gap. In particular, it shows estimates of the size of Linux, and it estimates how much it would cost to rebuild a typical Linux distribution using traditional software development techniques. Various definitions and assumptions are included, so that others can understand exactly what these numbers mean.

For my purposes, I have selected as my ``representative'' Linux distribution Red Hat Linux version 6.2. I believe this distribution is reasonably representative for several reasons:

  1. Red Hat Linux is the most popular Linux distribution sold in 1999 according to IDC [Shankland 2000b]. Red Hat sold 48% of all copies in 1999; the next largest distribution in market share sales was SuSE at 15%. Not all Linux copies are ``sold'' in a way that this study would count, but the study at least shows that Red Hat's distribution is a popular one.
  2. Many distributions (such as Mandrake) are based on older versions of Red Hat Linux.
  3. All major general-purpose distributions support (at least) the kind of functionality supported by Red Hat Linux, if for no other reason than to compete with Red Hat.
  4. All distributors start with the same set of open source software projects from which to choose components to integrate. Therefore, other distributions are likely to choose the same components or similar kinds of components with often similar size.

Different distributions and versions would produce different size figures, but I hope that this paper will be enlightening even though it doesn't try to evaluate ``all'' distributions. Note that some distributions (such as SuSE) may decide to add many more applications, but also note this would only create larger (not smaller) sizes and estimated levels of effort. At the time that I began this project, version 6.2 was the latest version of Red Hat Linux available, so I selected that version for analysis.

Section 2 briefly describes the approach used to estimate the ``size'' of this distribution (most of the details are in Appendix A). Section 3 discusses some of the results (with the details in Appendix B). Section 4 presents conclusions, followed by the two appendices.

2. Approach

My basic approach was to:
  1. install the source code files,
  2. categorize the files, creating for each package a list of files for each programming language; each file in each list contains source code in that language (excluding duplicate file contents and automatically generated files),
  3. count the lines of code for each language for each component, and
  4. use the original COCOMO model to estimate the effort to develop each component, and then the cost to develop using traditional methods.

This was not as easy as it sounds; the steps and assumptions made are described in Appendix A.

A few summary points are worth mentioning here, however, for those who don't read appendix A. I included software for all architectures, not just the i386. I did not include ``old'' versions of software (with the one exception of bash, as discussed in appendix A). I used md5 checksums to identify and ignore duplicate files, so if the same file contents appeared in more than one file, it was only counted once. The code in makefiles and RPM package specifications was not included. Various heuristics were used to detect automatically generated code, and any such code was also excluded from the count. A number of other heuristics were used to determine if a language was a source program file, and if so, what its language was.

The ``physical source lines of code'' (physical SLOC) measure was used as the primary measure of SLOC in this paper. Less formally, a physical SLOC in this paper is a line with something other than comments and whitespace (tabs and spaces). More specifically, physical SLOC is defined as follows: ``a physical source line of code is a line ending in a newline or end-of-file marker, and which contains at least one non-whitespace non-comment character.'' Comment delimiters (characters other than newlines starting and ending a comment) were considered comment characters. Data lines only including whitespace (e.g., lines with only tabs and spaces in multiline strings) were not included.

Note that the ``logical'' SLOC is not the primary measure used here; one example of a logical SLOC measure would be the ``count of all terminating semicolons in a C file.'' The ``physical'' SLOC was chosen instead of the ``logical'' SLOC because there were so many different languages that needed to be measured. I had trouble getting freely-available tools to work on this scale, and the non-free tools were too expensive for my budget (nor is it certain that they would have fared any better). Since I had to develop my own tools, I chose a measure that is much easier to implement. Park [1992] actually recommends the use of the physical SLOC measure (as a minimum), for this and other reasons. There are disadvantages to the ``physical'' SLOC measure. In particular, physical SLOC measures are sensitive to how the code is formatted. However, logical SLOC measures have problems too. First, as noted, implementing tools to measure logical SLOC is more difficult, requiring more sophisticated analysis of the code. Also, there are many different possible logical SLOC measures, requiring even more careful definition. Finally, a logical SLOC measure must be redefined for every language being measured, making inter-language comparisons more difficult. For more information on measuring software size, including the issues and decisions that must be made, see Kalb [1990], Kalb [1996], and Park [1992].

This decision to use physical SLOC also implied that for an effort estimator I needed to use the original COCOMO cost and effort estimation model (see Boehm [1981]), rather than the newer ``COCOMO II'' model. This is simply because COCOMO II requires logical SLOC as an input instead of physical SLOC.

For programmer salary averages, I used a salary survey from the September 4, 2000 issue of ComputerWorld; their survey claimed that this annual programmer salary averaged $56,286 in the United States. I was unable to find a publicly-backed average value for overhead, also called the ``wrap rate.'' This value is necessary to estimate the costs of office space, equipment, overhead staff, and so on. I talked to two cost analysts, who suggested that 2.4 would be a reasonable overhead (wrap) rate. Some Defense Systems Management College (DSMC) training material gives examples of 2.3 (125.95%+100%) not including general and administrative (G&A) overhead, and 2.8 when including G&A (125% engineering overhead, plus 25% on top of that amount for G&A) [DSMC]. This at least suggests that 2.4 is a plausible estimate. Clearly, these values vary widely by company and region; the information provided in this paper is enough to use different numbers if desired.

3. Results

Given this approach, here are some of the results. Section 3.1 presents the largest components (by SLOC), section 3.2 presents results specifically from the Linux kernel's SLOC, section 3.3 presents total counts by language, section 3.4 presents total counts of files (instead of SLOC), section 3.5 presents total SLOC counts, and section 3.6 presents effort and cost estimates.

3.1 Largest Components by SLOC

Here are the top 25 largest components (as measured by number of source lines of code):
SLOC	Directory	SLOC-by-Language (Sorted)
1526722 linux           ansic=1462165,asm=59574,sh=2860,perl=950,tcl=414,
                        yacc=324,lex=230,awk=133,sed=72
1291745 XFree86-3.3.6   ansic=1246420,asm=14913,sh=13433,tcl=8362,cpp=4358,
                        yacc=2710,perl=711,awk=393,lex=383,sed=57,csh=5
720112  egcs-1.1.2      ansic=598682,cpp=75206,sh=14307,asm=11462,yacc=7988,
                        lisp=7252,exp=2887,fortran=1515,objc=482,sed=313,perl=18
652087  gdb-19991004    ansic=587542,exp=37737,sh=9630,cpp=6735,asm=4139,
                        yacc=4117,lisp=1820,sed=220,awk=142,fortran=5
625073  emacs-20.5      lisp=453647,ansic=169624,perl=884,sh=652,asm=253,
                        csh=9,sed=4
467120  binutils-2.9.5.0.22 ansic=407352,asm=27575,exp=12265,sh=7398,yacc=5606,
                        cpp=4454,lex=1479,sed=557,lisp=394,awk=24,perl=16
415026  glibc-2.1.3     ansic=378753,asm=30644,sh=2520,cpp=1704,awk=910,
                        perl=464,sed=16,csh=15
327021  tcltk-8.0.5     ansic=240093,tcl=71947,sh=8531,exp=5150,yacc=762,
                        awk=273,perl=265
247026  postgresql-6.5.3 ansic=207735,yacc=10718,java=8835,tcl=7709,sh=7399,
                        lex=1642,perl=1206,python=959,cpp=746,asm=70,csh=5,sed=2
235702  gimp-1.0.4      ansic=225211,lisp=8497,sh=1994
231072  Mesa            ansic=195796,cpp=17717,asm=13467,sh=4092
222220  krb5-1.1.1      ansic=192822,exp=19364,sh=4829,yacc=2476,perl=1528,
                        awk=393,python=348,lex=190,csh=147,sed=123
206237  perl5.005_03    perl=94712,ansic=89366,sh=15654,lisp=5584,yacc=921
205082  qt-2.1.0-beta1  cpp=180866,ansic=20513,yacc=2284,sh=538,lex=464,
                        perl=417
200628  Python-1.5.2    python=100935,ansic=96323,lisp=2353,sh=673,perl=342,
                        sed=2
199982  gs5.50          ansic=195491,cpp=2266,asm=968,sh=751,lisp=405,perl=101
193916  teTeX-1.0       ansic=166041,sh=10263,cpp=9407,perl=3795,pascal=1546,
                        yacc=1507,awk=522,lex=323,sed=297,asm=139,csh=47,lisp=29
155035  bind-8.2.2_P5   ansic=131946,sh=10068,perl=7607,yacc=2231,cpp=1360,
                        csh=848,awk=753,lex=222
140130  AfterStep-APPS-20000124 ansic=135806,sh=3340,cpp=741,perl=243
138931  kdebase         cpp=113971,ansic=23016,perl=1326,sh=618
138118  gtk+-1.2.6      ansic=137006,perl=479,sh=352,awk=274,lisp=7
138024  gated-3-5-11    ansic=126846,yacc=7799,sh=1554,lex=877,awk=666,csh=235,
                        sed=35,lisp=12
133193  kaffe-1.0.5     java=65275,ansic=62125,cpp=3923,perl=972,sh=814,
                        asm=84
131372  jade-1.2.1      cpp=120611,ansic=8228,sh=2150,perl=378,sed=5
128672  gnome-libs-1.0.55 ansic=125373,sh=2178,perl=667,awk=277,lisp=177

Note that the operating system kernel (linux) is the largest single component, at over 1.5 million lines of code (mostly in C). See section 3.2 for a more discussion discussion of the linux kernel.

The next largest component is the X windows server, a critical part of the graphical user interface (GUI). Given the importance of GUIs, the long history of this program (giving it time to accrete functionality), and the many incompatible video displays it must support, this is perhaps not surprising.

Next is the gcc compilation system, including the C and C++ compilers, which is confusingly named ``egcs'' instead. The naming conventions of gcc can be confusing, so a little explanation is in order. Officially, the compilation system is called ``gcc''. Egcs was a project to experiment with a more open development model for gcc. Red Hat Linux 6.2 used one of the gcc releases from the egcs project, and called the release egcs-1.1.2 to avoid confusion with the official (at that time) gcc releases. The egcs experiment was a success; egcs as a separate project no longer exists, and current gcc development is based on the egcs code and development model. To sum it up, the compilation system is named ``gcc'', and the version of gcc used here is a version developed by ``egcs''.

Following this is the symbolic debugger and emacs. Emacs is probably not a real surprise; some users use nothing but emacs (e.g., reading their email via emacs), using emacs as a kind of virtual operating system. This is followed by the set of utilities for binary files, and the C library (which is actually used by most other language libraries as well). This is followed by TCL/Tk (a combined language and widget set), PostgreSQL (a relational DBMS), and the GIMP (an excellent client application for editing bitmapped drawings).

Note that language implementations tend to be written in themselves, particularly for their libraries. Thus there is more Perl than any other single language in the Perl implementation, more Python than any other single language in Python, and more Java than any other single language in Kaffe (an implementation of the Java Virtual Machine and library).

3.2 Examination of the Linux Kernel's SLOC

Since the largest single component was the linux kernel (at over 1.5 million SLOC), I examined it further, to learn why it was so large and determine its ramifications.

I found that over 870,000 lines of this code was in the ``drivers'' subdirectory, thus, the primary reason the kernel is so large is that it supports so many different kinds of hardware. The linux kernel's design is expressed in its source code directory structure, and no other directory comes close to this size - the second largest is the ``arch'' directory (at over 230,000 SLOC), which contains the architecture-unique code for each CPU architecture. Supporting many different filesystems also increases its size, but not as much as expected; the entire filesystem code is not quite 88,000 SLOC. See the appendix for more detail.

Richard Stallman and others have argued that the resulting system often called ``Linux'' should instead be called ``GNU/Linux'' [Stallman 2000]. In particular, by hiding GNU's contributions (through not including GNU's name), many people are kept unaware of the GNU project and its purpose, which is to encourage a transition to ``free software'' (free as in freedom). Certainly, the resulting system was the intentional goal and result of the GNU project's efforts. Another argument used to justify the term ``GNU/Linux'' is that it is confusing if both the entire operating system and the operating system kernel are both called ``Linux''. Using the term ``Linux'' is particularly bizarre for GNU/Hurd, which takes the Debian GNU/Linux distribution and swaps out one component: the Linux kernel.

The data here can be used to justify calling the system either ``Linux'' or ``GNU/Linux.'' It's clear that the largest single component in the operating system is the Linux kernel, so it's at least understandable how so many people have chosen to name the entire system after its largest single component (``Linux''). It's also clear that there are many contributors, not just the GNU project itself, and some of those contributors do not agree with the GNU project's philosophy. On the other hand, many of the largest components of the system are essentially GNU projects: gcc (packaged under the name ``egcs''), gdb, emacs, binutils (a set of commands for binary files), and glibc (the C library). Other GNU projects in the system include binutils, bash, gawk, make, textutils, sh-utils, gettext, readline, automake, tar, less, findutils, diffutils, and grep. This is not even counting GNOME, a GNU project. In short, the total of the GNU project's code is much larger than the Linux kernel's size. Thus, by comparing the total contributed effort, it's certainly justifiable to call the entire system ``GNU/Linux'' and not just ``Linux.''

These measurements at least debunk one possible explanation of the Halloween documents' measures. Since Halloween I claimed that the x86-only code for the Linux kernel measured 500,000 SLOC, while Halloween II claimed that the kernel (all architectures) was 1.5 million SLOC, one explanation of this difference would be that the code for non-x86 systems was 1 million SLOC. This isn't so; I computed a grand total of 267,320 physical SLOC of non-i86 code (including drivers and architecture-specific code). It seems unlikely that over 700,000 lines of code would have been removed (not added) in the intervening time.

However, other measures (and explanations) are more promising. I also ran the CodeCount tools on the linux operating system kernel. Using the CodeCount definition of C logical lines of code, CodeCount determined that this version of the linux kernel included 673,627 logical SLOC in C. This is obviously much smaller than the 1,462,165 of physical SLOC in C, or the 1,526,722 SLOC when all languages are combined for the Linux kernel. When I removed all non-i86 code and re-ran the CodeCount tool on just the C code, a logical SLOC of 570,039 of C code was revealed. Since the Halloween I document reported 500,000 SLOC (when only including x86 code), it appears very likely that the Halloween I paper counted logical SLOC (and only C code) when reporting measurements of the linux kernel. However, the other Halloween I measures appear to be physical SLOC measures: their estimate of 1.5 million SLOC for the X server is closer to the 1.2 million physical SLOC measured here, and their estimate of 80,000 SLOC for Apache is close to the 77,873 SLOC measured here (as shown in Appendix B). These variations in measurements should be expected, since the versions I am measuring are slightly different than the ones they measured, and it is likely that some assumptions are different as well. Meanwhile, Halloween II reported a measure of 1.5 million lines of code for the linux kernel, essentially the same value given here for physical SLOC.

In short, it appears that Halloween I used the ``logical SLOC'' measure when measuring the Linux kernel, while all other measures in Halloween I and II used physical SLOC as the measure. I have attempted to contact the Microsoft author to confirm this, but as of yet I have not received such confirmation. In any case, this example clearly demonstrates the need to carefully identify the units of measure and assumptions made in any measurement of SLOC.

3.3 Total Counts by Language

Here are the various programming languages, sorted by the total number of source lines of code:

ansic:    14218806 (80.55%)
cpp:       1326212 (7.51%)
lisp:       565861 (3.21%)
sh:         469950 (2.66%)
perl:       245860 (1.39%)
asm:        204634 (1.16%)
tcl:        152510 (0.86%)
python:     140725 (0.80%)
yacc:        97506 (0.55%)
java:        79656 (0.45%)
exp:         79605 (0.45%)
lex:         15334 (0.09%)
awk:         14705 (0.08%)
objc:        13619 (0.08%)
csh:         10803 (0.06%)
ada:          8217 (0.05%)
pascal:       4045 (0.02%)
sed:          2806 (0.02%)
fortran:      1707 (0.01%)

Here you can see that C is pre-eminent (with over 80% of the code), followed by C++, LISP, shell, and Perl. Note that the separation of Expect and TCL is somewhat artificial; if combined, they would be next (at 232115), followed by assembly. Following this in order are Python, yacc/bison, Java, lex/flex, awk, objective-C, C-shell, Ada, Pascal, sed, and Fortran. Some of the languages with smaller counts (such as objective-C and Ada) show up primarily as test cases or bindings to support users of those languages. Nevertheless, it's nice to see at least some support for a variety of languages, since each language has some strength for some type of application.

C++ has over a million lines of code, a very respectable showing, and yet at least in this distribution it is far less than C. One could ask why there's so much more C code, particularly against C++. One possible argument is that well-written C++ takes fewer lines of code than does C; while this is often true, that's unlikely to entirely explain this. Another important factor is that many of the larger programs were written before C++ became widely used, and no one wishes to rewrite their C programs into C++. Also, there are a significant number of software developers who prefer C over C++ (e.g., due to simplicity of understanding the entire language), which would certainly affect these numbers. There have been several efforts in the past to switch from C to C++ in the Linux kernel, and they have all failed (for a variety of reasons).

The fact that LISP places so highly (it's in third place) is a little surprising. LISP is used in many components, but its high placement is due to the widespread use of emacs. Emacs itself is written in primarily in its own variant of LISP, and the emacs package itself accounts for 80% (453647/565861) of the total amount of LISP code. In addition, many languages include sophisticated (and large) emacs modes to support development in those languages: Perl includes 5584 lines of LISP, and Python includes another 2333 of LISP that is directly used to support elaborate Emacs modes for program editing. The ``psgml'' package is solely an emacs mode for editing SGML documents. The components with the second and third largest amounts of LISP are xlispstat-3-52-17 and scheme-3.2, which are implementations of LISP and Scheme (a LISP dialect) respectively. Other programs (such as the GIMP and Sawmill) also use LISP or one of its variants as a ``control'' language to control components built in other languages (in these cases C). LISP has a long history of use in the hacking (computer enthusiast) community, due to powerful influences such as MIT's old ITS community. For more information on the history of hackerdom, including the influence of ITS and LISP, see [Raymond 1999].

3.4 Total Counts of Files

Of course, instead of counting SLOC, you could count just the number of files in various categories, looking for other insights.

Lex/flex and yacc/bison are widely-used program generators. They make respectable showings when counting SLOC, but their widespread use is more obvious when examining the file counts. There are 57 different lex/flex files, and 110 yacc/bison files. Since some build directories use lex/flex or yacc/bison more than once, the count of build directories using these tools is smaller but still respectable: 38 different build directories use lex/flex, and 62 different build directories use yacc/bison.

Other insights can be gained from the file counts shown in appendix B. The number of source code files counted were 72,428. Not included in this count were 5,820 files which contained duplicate contents, and 817 files which were detected as being automatically generated.

These values can be used to compute average SLOC per file across the entire system. For example, for C, there was 14218806 SLOC contained in 52088 files, resulting in an ``average'' C file containing 273 (14218806/52088) physical source lines of code.

3.5 Total SLOC Counts

Given all of these assumptions, the counting programs compute a total of 17,652,561 physical source lines of code (SLOC); I will simplify this to ``over 17 million physical SLOC''. This is an astounding amount of code; compare this to reported sizes of other systems:
ProductSLOC
NASA Space Shuttle flight control 420K (shuttle) + 1.4 million (ground)
Sun Solaris (1998-2000) 7-8 million
Microsoft Windows 3.1 (1992) 3 million
Microsoft Windows 95 15 million
Microsoft Windows 98 18 million
Microsoft Windows NT (1992) 4 million
Microsoft Windows NT 5.0 (1998) 20 million

These numbers come from Bruce Schneier's Crypto-Gram [Schneier 2000], except for the Space Shuttle numbers which come from a National Academy of Sciences study [NAS 1996]. Numbers for later versions of Microsoft products are not shown here because their values have great uncertainty in the published literature. The assumptions of these numbers are unclear (e.g., are these physical or logical lines of code?), but they are likely to be comparable physical SLOC counts.

Schneier also reports that ``Linux, even with the addition of X Windows and Apache, is still under 5 million lines of code''. At first, this seems to be contradictory, since this paper counts over 17 million SLOC, but Schneier appears to be literally correct in the context of his statement. The phrasing of his sentence suggests that Schneier is considering some sort of ``minimal'' system, since he considers ``even the addition of X Windows'' as a significant addition. As shown in appendix section B.4, taking the minimal ``base'' set of components in Red Hat Linux, and then adding the minimal set of components for graphical interaction (the X Windows's graphical server, library, configuration tool, and a graphics toolkit) and the Apache web server, the total is about 4.4 million physical SLOC - which is less than 5 million. This minimal system doesn't include some useful (but not strictly necessary) components, but a number of useful components could be added while still staying under a total of 5 million SLOC.

However, note the contrast. Many Linux distributions include with their operating systems many applications (e.g., bitmap editors) and development tools (for many different languages). As a result, the entire delivered system for such distributions (including Red Hat Linux 6.2) is much larger than the 5 million SLOC stated by Schneier. In short, this distribution's size appears similar to the size of Windows 98 and Windows NT 5.0 in 1998.

Microsoft's recent legal battles with the U.S. Department of Justice (DoJ) also involve the bundling of applications with the operating system. However, it's worth noting some differences. First, and most important legally, a judge has ruled that Microsoft is a monopoly, and under U.S. law monopolies aren't allowed to perform certain actions that other organizations may perform. Second, anyone can take Linux, bundle it with an application, and redistribute the resulting product. There is no barrier such as ``secret interfaces'' or relicensing costs that prevent anyone from making an application work on or integrate with Linux. Third, many Linux distributions include alternatives; users can choose between a number of options, all on the CD-ROM. Thus, while Linux distributions also appear to be going in the direction of adding applications to their system, they do not do so in a way that significantly interferes with a user's ability to select between alternatives.

It's worth noting that SLOC counts do not necessarily measure user functionality very well. For example, smart developers often find creative ways to simplify problems, so programs with small SLOC counts can often provide greater functionality than programs with large SLOC counts. However, there is evidence that SLOC counts correlate to effort (and thus development time), so using SLOC to estimate effort is still valid.

Creating reliable code can require much more effort than creating unreliable code. For example, it's known that the Space Shuttle code underwent rigorous testing and analysis, far more than typical commercial software undergoes, driving up its development costs. However, it cannot be reasonably argued that reliability differences between Linux and either Solaris or Windows NT would necessary cause Linux to take less effort to develop for a similar size. To see this, let's pretend that Linux had been developed using traditional proprietary means and a similar process to these other products. As noted earlier, experiments suggest that Linux, or at least certain portions of it, is more reliable than either. This would either cost more money (due to increased testing) or require a substantive change in development process (e.g., through increased peer review). Therefore, Linux's reliability suggests that developing Linux traditionally (at the same level of reliability) would have taken at least the same amount of effort if similar development processes were used as compared to similarly-sized systems.

3.6 Effort and Cost Estimates

Finally, given all the assumptions shown, are the effort values:
Total Physical Source Lines of Code (SLOC) = 17652561
Total Estimated Person-Years of Development = 4548.36
Average Programmer Annual Salary = 56286
Overhead Multiplier = 2.4
Total Estimated Cost to Develop = $ 614421924.71

See appendix A for more data on how these effort values were calculated; you can retrieve more information from http://www.dwheeler.com/sloc.

4. Conclusions

Red Hat Linux version 6.2 includes well over 17 million lines of physical source lines of code (SLOC). Using the COCOMO cost model, this is estimated to have required over 4,500 person-years of development time. Had this Linux distribution been developed by conventional proprietary means, it's estimated that it would have cost over $600 million to develop in the U.S. (in year 2000 dollars). Clearly, this demonstrates that it is possible to build large-scale systems using open source approaches.

Many other interesting statistics emerge. The largest components (in order) were the linux kernel (including device drivers), the X-windows server (for the graphical user interface), gcc (a compilation system, with the package name of ``egcs''), and emacs (a text editor and far more). The languages used, sorted by the most lines of code, were C, C++, LISP (including Emacs' LISP and Scheme), shell (including ksh), Perl, Tcl (including expect), assembly (all kinds), Python, yacc/bison, Java, lex/flex, awk, objective-C, C-shell, Ada, Pascal, sed, and Fortran. Here you can see that C is pre-eminent (with over 80% of the code), More information is available in the appendices and at http://www.dwheeler.com/sloc.

It would be interesting to re-run these values on other Linux distributions (such as SuSE and Debian), other open source systems (such as FreeBSD), and other versions of Red Hat (such as Red Hat 7). SuSE and Debian, for example, by policy include many more packages, and would probably produce significantly larger estimates of effort and development cost. It's known that Red Hat 7 includes more source code; Red Hat 7 has had to add another CD-ROM to contain the binary programs, and adds such capabilities as a word processor (abiword) and secure shell (openssh).

Some actions by developers could simplify further similar analyses. The most important would be for programmers to always mark, at the top, any generated files (e.g., with a phrase like ``Automatically generated''). This would do much more than aid counting tools - programmers are likely to accidentally manually edit such files unless the files are clearly marked as files that should not be edited. It would be useful if developers would use file extensions consistently and not ``reuse'' extension names for other meanings; the suffixes(7) manual page lists a number of already-claimed extensions. This is more difficult for less-used languages; many developers have no idea that ``.m'' is a standard extension for objective-C. It would also be nice to have high-quality open source tools for performing logical SLOC counting on all of the languages represented here.

It should be re-emphasized that these are estimates; it is very difficult to precisely categorize all files, and some files might confuse the size estimators. Some assumptions had to be made (such as not including makefiles) which, if made differently, would produce different results. Identifying automatically-generated files is very difficult, and it's quite possible that some were miscategorized.

Nevertheless, there are many insights to be gained from the analysis of entire open source systems, and hopefully this paper has provided some of those insights. It is my hope that, since open source systems make it possible for anyone to analyze them, others will pursue many other lines of analysis to gain further insight into these systems.


Appendix A. Details of Approach

My basic approach was to:
  1. install the source code files,
  2. categorize the files, creating for each package a list of files for each programming language; each file in each list contains source code in that language (excluding duplicate file contents and automatically generated files),
  3. count the lines of code for each language for each component, and
  4. use the original COCOMO model to estimate the effort to develop each component, and then the cost to develop using traditional methods.

This was not as easy as it sounds; each step is described below. Some steps I describe in some detail, because it's sometimes hard to find the necessary information even when the actual steps are easy. Hopefully, this detail will make it easier for others to do similar activities or to repeat the experiment.

A.1 Installing Source Code

Installing the source code files turned out to be nontrivial. First, I inserted the CD-ROM containing all of the source files (in ``.src.rpm'' format) and installed the packages (files) using:

  mount /mnt/cdrom
  cd /mnt/cdrom/SRPMS
  rpm -ivh *.src.rpm

This installs ``spec'' files and compressed source files; another rpm command (``rpm -bp'') uses the spec files to uncompress the source files into ``build directories'' (as well as apply any necessary patches). Unfortunately, the rpm tool does not enforce any naming consistency between the package names, the spec names, and the build directory names; for consistency this paper will use the names of the build directories, since all later tools based themselves on the build directories.

I decided to (in general) not count ``old'' versions of software (usually placed there for compatibility reasons), since that would be counting the same software more than once. Thus, the following components were not included: ``compat-binutils'', ``compat-egcs'', ``compat-glib'', ``compat-libs'', ``gtk+10'', ``libc-5.3.12'' (an old C library), ``libxml10'', ``ncurses3'', and ``qt1x''. I also didn't include egcs64-19980921 and netscape-sparc, which simply repeated something on another architecture that was available on the i386 in a different package. I did make one exception. I kept both bash-1.14.7 and bash2, two versions of the shell command processor, instead of only counting bash2. While bash2 is the later version of the shell available in the package, the main shell actually used by the Red Hat distribution was the older version of bash. The rationale for this decision appears to be backwards compatibility for older shell scripts; this is suggested by the Red Hat package documentation in both bash-1.14.7 and bash2. It seemed wrong to not include one of the most fundamental pieces of the system in the count, so I included it. At 47067 lines of code (ignoring duplicates), bash-1.14.7 is one of the smaller components anyway. Not including this older component would not substantively change the results presented here.

There are two directories, krb4-1.0 and krb5-1.1.1, which appear to violate this rule - but don't. krb5-1.1.1 is the build directory created by krb5.spec, which is in turn installed by the source RPM package krb5-1.1.1-9.src.rpm. This build directory contains Kerberos V5, a trusted-third-party authentication system. The source RPM package krb5-1.1.1-9.src.rpm eventually generates the binary RPM files krb5-configs-1.1.1-9, krb5-libs-1.1.1-9, and krb5-devel-1.1.1-9. You might guess that ``krb4-1.0'' is just the older version of Kerberos, but this build directory is created by the spec file krbafs.spec and not just an old version of the code. To quote its description, ``This is the Kerberos to AFS bridging library, built against Kerberos 5. krbafs is a shared library that allows programs to obtain AFS tokens using Kerberos IV credentials, without having to link with official AFS libraries which may not be available for a given platform.'' For this situation, I simply counted both packages, since their purposes are different.

I was then confronted with a fundamental question: should I count software that only works for another architecture? I was using an i86-type system, but some components are only for Alpha or Sparc systems. I decided that I should count them; even if I didn't use the code today, the ability to use these other architectures in the future was of value and certainly required effort to develop.

This caused complications for creating the build directories. If all installed packages fit the architecture, you can install the uncompressed software by typing:

cd /usr/src/redhat/SPECS and typing the command
rpm -bp *.spec
Unfortunately, the rpm tool notes that you're trying to load code for the ``wrong'' architecture, and (at least at the time) there was no simple ``override'' flag. Instead, I had to identify each package as belonging to SPARC or ALPHA, and then use the rpm option --target to forcibly load them. For example, I renamed all sparc-specific SPARC file files to end in ``.sparc'' and could then load them with:
rpm -bp --target sparc-redhat-linux *.spec.sparc
The following spec files were non-i86: (sparc) audioctl, elftoaout, ethtool, prtconf, silo, solemul, sparc32; (alpha) aboot, minlabel, quickstrip. In general, these were tools to aid in supporting some part of the boot process or for using system-specific hardware.

Note that not all packages create build directories. For example, ``anonftp'' is a package that, when installed, sets up an anonymous ftp system. This package doesn't actually install any software; it merely installs a specific configuration of another piece of software (and unsets the configuration when uninstalled). Such packages are not counted at all in this sizing estimate.

Simply loading all the source code requires a fair amount of disk space. Using ``du'' to measure the disk space requirements (with 1024 byte disk blocks), I obtained the following results:

$ du -s /usr/src/redhat/BUILD /usr/src/redhat/SOURCES /usr/src/redhat/SPECS
2375928	/usr/src/redhat/BUILD
592404	/usr/src/redhat/SOURCES
4592	/usr/src/redhat/SPECS
Thus, these three directories required 2972924 1K blocks - approximately 3 gigabytes of space. Much more space would be required to compile it all.

A.2 Categorizing Source Code

My next task was to identify all files containing source code (not including any automatically generated source code). This is a non-trivial problem; there are 181,679 ordinary files in the build directory, and I had no interest in doing this identification by hand.

In theory, one could just look at the file extensions (.c for C, .py for python), but this is not enough in practice. Some packages reuse extensions if the package doesn't use that kind of file (e.g., the ``.exp'' extension of expect was used by some packages as ``export'' files, and the ``.m'' of objective-C was used by some packages for module information extracted from C code). Some files don't have extensions, particularly scripts. And finally, files automatically generated by another program should not be counted, since I wished to use the results to estimate effort.

I ended up writing a program of over 600 lines of Perl to perform this identification, which used a number of heuristics to categorize each file into categories. There is a category for each language, plus the categories non-programs, unknown (useful for scanning for problems), automatically generated program files, duplicate files (whose file contents duplicated other files), and zero-length files.

The program first checked for well-known extensions (such as .gif) that cannot be program files, and for a number of common generated filenames. It then peeked at the first line for "#!" followed by a legal script name. If that didn't work, it used the extension to try to determine the category. For a number of languages, the extension was not reliable, so for those languages the categorization program examined the file contents and used a set of heuristics to determine if the file actually belonged that category. If all else failed, the file was placed in the ``unknown'' category for later analysis. I later looked at the ``unknown'' items, checking the common extensions to ensure I had not missed any common types of code.

One complicating factor was that I wished to separate C, C++, and objective-C code, but a header file ending with ``.h'' or ``.hpp'' file could be any of them. I developed a number of heuristics to determine, for each file, what language it belonged to. For example, if a build directory has exactly one of these languages, determining the correct category for header files is easy. Similarly, if there is exactly one of these in the directory with the header file, it is presumed to be that kind. Finally, a header file with the keyword ``class'' is almost certainly not a C header file, but a C++ header file.

Detecting automatically generated files was not easy, and it's quite conceivable I missed a number of them. The first 15 lines were examined, to determine if any of them included at the beginning of the line (after spaces and possible comment markers) one of the following phrases: ``generated automatically'', ``automatically generated'', ``this is a generated file'', ``generated with the (something) utility'', or ``do not edit''. A number of filename conventions were used, too. For example, any ``configure'' file is presumed to be automatically generated if there's a ``configure.in'' file in the same directory.

To eliminate duplicates, the program kept md5 checksums of each program file. Any given md5 checksum would only be counted once. Build directories were processed alphabetically, so this meant that if the same file content was in both directories ``a'' and ``b'', it would be counted only once as being part of ``a''. Thus, some packages with names later in the alphabet may appear smaller than would make sense at first glance. It is very difficult to eliminate ``almost identical'' files (e.g., an older and newer version of the same code, included in two separate packages), because it is difficult to determine when ``similar'' two files are essentially the ``same'' file. Changes such as the use of pretty-printers and massive renaming of variables could make small changes seem large, while the many small files in the system could easy make different files seem the ``same.'' Thus, I did not try to make such a determination, and just considered files with different contents as different.

It's important to note that different rules could be used to ``count'' lines of code. Some kinds of code were intentionally excluded from the count. Many RPM packages include a number of shell commands used to install and uninstall software; the estimate in this paper does not include the code in RPM packages. This estimate also does not include the code in Makefiles (which can be substantive). In both cases, the code in these cases is often cut and pasted from other similar files, so counting such code would probably overstate the actual development effort. In addition, Makefiles are often automatically generated.

On the other hand, this estimate does include some code that others might not count. This estimate includes test code included with the package, which isn't visible directly to users (other than hopefully higher quality of the executable program). It also includes code not used in this particular system, such as code for other architectures and OS's, bindings for languages not compiled into the binaries, and compilation-time options not chosen. I decided to include such code for two reasons. First, this code is validly represents the effort to build each component. Second, it does represent indirect value to the user, because the user can later use those components in other circumstances even if the user doesn't choose to do so by default.

So, after the work of categorizing the files, the following categories of files were created for each build directory (common extensions are shown in parentheses, and the name used in the data tables below are shown in brackets):

  1. C (.c) [ansic]
  2. C++ (.C, .cpp, .cxx, .cc) [cpp]
  3. LISP (.el, .scm, .lsp, .jl) [lisp]
  4. shell (.sh) [sh]
  5. Perl (.pl, .pm, .perl) [perl]
  6. Assembly (.s, .S, .asm) [asm]
  7. TCL (.tcl, .tk, .itk) [tcl]
  8. Python (.py) [python]
  9. Yacc (.y) [yacc]
  10. Java (.java) [java]
  11. Expect (.exp) [exp]
  12. lex (.l) [lex]
  13. awk (.awk) [awk]
  14. Objective-C (.m) [objc]
  15. C shell (.csh) [csh]
  16. Ada (.ada, .ads, .adb) [ada]
  17. Pascal (.p) [pascal]
  18. sed (.sed) [sed]
  19. Fortran (.f) [fortran]

Note that we're counting Scheme as a dialect of LISP, and Expect is being counted separately from TCL. The command line shells Bourne shell, the Bourne-again shell (bash), and the K shell are all counted together as ``shell'', but the C shell (csh and tcsh) is counted separately.

A.3 Counting Lines of Code

Every language required its own counting scheme. This was more complex than I realized; there were a number of languages involved.

I originally tried to use USC's ``CodeCount'' tools to count the code. Unfortunately, this turned out to be buggy and did not handle most of the languages used in the system, so I eventually abandoned it for this task and wrote my own tools. Those who wish to use this tool are welcome to do so; you can learn more from its web site at http://sunset.usc.edu/research/CODECOUNT.

I did manage to use the CodeCount to compute the logical source lines of code for the C portions of the linux kernel. This came out to be 673,627 logical source lines of code, compared to the 1,462,165 lines of physical code (again, this ignores files with duplicate contents).

Since there were a large number of languages to count, I used the ``physical lines of code'' definition. In this definition, a line of code is a line (ending with newline or end-of-file) with at least one non-comment non-whitespace character. These are known as ``non-comment non-blank'' lines. If a line only had whitespace (tabs and spaces) it was not counted, even if it was in the middle of a data value (e.g., a multiline string). It is much easier to write programs to measure this value than to measure the ``logical'' lines of code, and this measure can be easily applied to widely different languages. Since I had to process a large number of different languages, it made sense to choose the measure that is easier to obtain.

Park [1992] presents a framework of issues to be decided when trying to count code. Using Park's framework, here is how code was counted in this paper:

  1. Statement Type: I used a physical line-of-code as my basis. I included executable statements, declarations (e.g., data structure definitions), and compiler directives (e.g., preprocessor commands such as #define). I excluded all comments and blank lines.
  2. How Produced: I included all programmed code, including any files that had been modified. I excluded code generated with source code generators, converted with automatic translators, and those copied or reused without change. If a file was in the source package, I included it; if the file had been removed from a source package (including via a patch), I did not include it.
  3. Origin: I included all code included in the package.
  4. Usage: I included code in or part of the primary product; I did not include code external to the product (i.e., additional applications able to run on the system but not included with the system).
  5. Delivery: I counted code delivered as source; not surprisingly, I didn't count code not delivered as source. I also didn't count undelivered code.
  6. Functionality: I included both operative and inoperative code. An examples of intentionally ``inoperative'' code is code turned off by #ifdef commands; since it could be turned on for special purposes, it made sense to count it. An examples of unintentionally ``inoperative'' code is dead or unused code.
  7. Replications: I included master (original) source statements. I also included ``physical replicates of master statements stored in the master code''. This is simply code cut and pasted from one place to another to reuse code; it's hard to tell where this happens, and since it has to be maintained separately, it's fair to include this in the measure. I excluded copies inserted, instantiated, or expanded when compiling or linking, and I excluded postproduction replicates (e.g., reparameterized systems).
  8. Development Status: Since I only measured code included in the packages used to build the delivered system, I declared that all software I was measuring had (by definition) passed whatever ``system tests'' were required by that component's developers.
  9. Languages: I included all languages, as identified earlier in section A.2.
  10. Clarifications: I included all statement types. This included nulls, continues, no-ops, lone semicolons, statements that instantiate generics, lone curly braces ({ and }), and labels by themselves.

Park includes in his paper a ``basic definition'' of physical lines of code, defined using his framework. I adhered to Park's definition unless (1) it was impossible in my technique to do so, or (2) it would appear to make the result inappropriate for use in cost estimation (using COCOMO). COCOMO states that source code:

``includes all program instructions created by project personnel and processed into machine code by some combination of preprocessors, compilers, and assemblers. It excludes comment cards and unmodified utility software. It includes job control language, format statements, and data declarations. Instructions are defined as lines of code.''

In summary, though in general I followed Park's definition, I didn't follow Park's ``basic definition'' in the following ways:

  1. How Produced: I excluded code generated with source code generators, converted with automatic translators, and those copied or reused without change. After all, COCOMO states that the only code that should be counted is code ``produced by project personnel'', whereas these kinds of files are instead the output of ``preprocessors and compilers.'' If code is always maintained as the input to a code generator, and then the code generator is re-run, it's only the code generator input's size that validly measures the size of what is maintained. Note that while I attempted to exclude generated code, this exclusion is based on heuristics which may have missed some cases.
  2. Origin: Normally physical SLOC doesn't include an unmodified ``vendor-supplied language support library'' nor a ``vendor-supplied system or utility''. However, in this case this was exactly what I was measuring, so I naturally included these as well.
  3. Delivery: I didn't count code not delivered as source. After all, since I didn't have it, I couldn't count it.
  4. Functionality: I included unintentionally inoperative code (e.g., dead or unused code). There might be such code, but it is very difficult to automatically detect in general for many languages. For example, a program not directly invoked by anything else nor installed by the installer is much more likely to be a test program, which I'm including in the count. Clearly, discerning human ``intent'' is hard to automate. Hopefully, unintentionally inoperative code is a small amount of the total delivered code.
Otherwise, I followed Park's ``basic definition'' of a physical line of code, even down to Park's language-specific definitions where Park defined them for a language.

One annoying problem was that one file wasn't syntactically correct and it affected the count. File /usr/src/redhat/BUILD/cdrecord-1.8/mkiso had an #ifdef not taken, and the road not taken had a missing double-quote mark before the word ``cannot'':

 #ifdef  USE_LIBSCHILY
         comerr(Cannot open '%s'.\n", filename);
 #endif
       perror ("fopen");
       exit (1);
 #endif
I solved this by hand-patching the source code (for purposes of counting). There were also some files with intentionally erroneous code (e.g., compiler error tests), but these did not impact the SLOC count.

Several languages turn out to be non-trivial to count: