Counting Source Lines of Code (SLOC)
Click here to
get the paper, ``More than a Gigabuck: Estimating GNU/Linux's Size,''
which presents my latest GNU/Linux
size estimates, approach, and analysis.
My latest size-estimation paper is
More than a Gigabuck: Estimating GNU/Linux's Size
(June 2001).
Here are a few interesting facts quoting from the paper
(which measures Red Hat Linux 7.1):
-
It would cost over $1 billion (a Gigabuck)
to develop this Linux distribution by conventional proprietary means
in the U.S. (in year 2000 U.S. dollars).
- It includes over 30 million physical source lines of code (SLOC).
- It would have required about 8,000 person-years of
development time, as determined using the widely-used basic COCOMO model.
- Red Hat Linux 7.1 represents over a 60% increase in
size, effort, and traditional development costs over Red Hat Linux 6.2
(which was released about one year earlier).
Many other interesting statistics emerge; here are a few:
-
The largest components (in order) were the
Linux kernel (including device drivers), Mozilla
(Netscape's open source web system including a web browser,
email client, and HTML editor),
the X window system (the infrastructure for the graphical user interface),
gcc (a compilation system),
gdb (for debugging),
basic binary tools,
emacs (a text editor and far more),
LAPACK (a large Fortran library for numerical linear algebra),
the Gimp (a bitmapped graphics editor), and
MySQL (a relational database system).
Note that some projects (in particular KDE and GNOME) are in aggregate
large enough to be one of the largest components, but because they are
developed and distributed as a large number of smaller components,
their totals don't appear in the list of largest components.
-
The languages used, sorted by the most lines of code, were
C (71% - was 81%), C++ (15% - was 8%),
shell (including ksh),
Lisp, assembly, Perl, Fortran, Python, tcl, Java,
yacc/bison, expect, lex/flex, awk, Objective-C, Ada, C shell,
Pascal, and sed.
-
The predominant software license is the GNU GPL.
Slightly over half of the software is simply licensed using the GPL,
and the software packages using the copylefting licenses (the GPL and LGPL),
at least in part or as an alternative, accounted for 63% of the code.
In all ways, the copylefting licenses (GPL and LGPL) are the dominant licenses
in this Linux distribution.
In contrast, only 0.2% of the software is public domain.
You can get:
-
``More than a Gigabuck: Estimating GNU/Linux's Size'',
my latest SLOC analysis paper which analyzes Red Hat Linux 7.1.
You can also get some of the supporting information
(intended for those who want to do further analysis), such as the
complete summary,
summary SLOC analysis of
the Linux 2.4 kernel,
map of build directories
to RPM spec files,
spec summaries,
counts of files, and
detailed file-by-file SLOC counts.
You can also get version 1.0,
version 1.01,
version 1.02,
version 1.03,
version 1.04
or
version 1.05
of the paper.
-
``Estimating Linux's Size,''
the previous paper which analyzes Red Hat Linux 6.2.
Various background files and previous editions are also available.
You can see the
ChangeLog, along with older
versions of the paper
(original paper (version 1.0),
version 1.01,
version 1.02 and
version 1.03).
version 1.04).
You can also see some of the summary data:
SLOC sorted by size,
filecounts,
unsorted SLOC counts,
unsorted SLOC counts with long lines,
and
SLOC counts formatted for computer processing (tab-separated data).
For license information, you can see
the licenses allocated
to each build directory.
If you want to know what a particular package does, you can find out
briefly by looking at the
package (specification
file) descriptions.
- Linux Kernel 2.6: It's Worth More! does a deeper analysis of effort of just the Linux kernel.
When referring to this information, please refer to the URL
http://www.dwheeler.com/sloc.
This is not a legal requirement; of course you are always allowed
to deep link to anything you want to!
This is just a friendly recommendation, since
some of the other URLs may change, and I may add more measurements later.
If you want to get the tools I used, they're available.
I call the set SLOCCount, and you can get SLOCCount at
http://www.dwheeler.com/sloccount.
Here are some testimonials:
- "This is a remarkable piece of work. I'm impressed, and expect to
get good use out of some of the statistics." - Eric S. Raymond
- "I have just read your paper on estimating GNU/Linux size.
BEAUTIFUL PAPER. WONDERFUL. My highest praise for your efforts.
This is really great work.
I enjoyed reading it." - Wesley Strawn
Others have been inspired by my paper
More than a Gigabuck: Estimating GNU/Linux's Size to
do more analysis, which is great:
-
One group did an analysis of the Debian GNU/Linux distribution, using my tool
sloccount.
You can see their very interesting paper
Counting Potatoes: The size of Debian 2.2 at
http://people.debian.org/~jgb/debian-counting,
or you can see an older version of it in
Upgrade.
They found that Debian 2.2 includes more than 55 million physical SLOC, and
would have cost nearly $1.9 billion USD using over 14,000 person-years
to develop using traditional proprietary techniques.
-
In 2005 they measured Debian again, and reported results in
Measuring Libre Software Using Debian 3.1 (Sarge)
as A Case Study: Preliminary Results.
Debian 3.1 ("Sarge") had grown to about 230 million source lines of code,
with an estimated 60,000 person-years and $8 billion USD redevelopment cost.
This was contained in 8,600 source packages, generating about
15,300 binary packages.
Top languages were C (57%) C++ (16.8%), Shell (9%), LISP (3%), Perl (2.8%),
Python (1.8%), Java (1.6%), FORTRAN (1.2%), PHP (0.93%), Pascal (0.62%),
and Ada (0.61%).
The largest programs (in order of size) were
OpenOffice.org (1.1.3, mostly C++),
the Linux kernel (2.6.8, mostly C),
the web authoring system NVU (0.80, mostly C),
internet suite Mozilla (1.7.7, mostly C++),
compiler suite GCC (3.4.3, mostly C but significant amounts of Ada and C++),
truetype font server XFS-XTT (1.4.1, mostly C),
and XFree86 (4.3.0, mostly C).
-
Another person
analyzed Perl's CPAN library and determined it would have
cost $677 million to develop;
this CPAN analysis was a Slashdot article on July 30, 2004.
-
The Linux Foundation re-performed the analysis in 2008 with Fedora 9, releasing
"Estimating the Total Development Cost of a Linux Distribution".
Here's their press release.
-
Debian developer James Bromberger posted
"Debian Wheezy: US$19 Billion. Your price... FREE!"
in February 2012, where he determined that the newest Debian distribution
("Wheezy") would have taken $19 billion U.S. dollars to develop as
proprietary software.
This was picked up in the news article
"Perth coder finds new Debian 'worth' $18 billion" by Liam Tung,
IT News, February 14, 2012.
Comparitive numbers are hard to find.
Gary McGraw (of Cigital) has searched public information to find
Windows SLOC size.
According to his sources, Windows NT 5.0 (in 2000) was 20M SLOC,
Windows 2000 (in 2001) was 35M SLOC, and Windows XP (in 2002) was 40M SLOC.
(This information is from his briefing
Building Secure Software: How to avoid security problems
the right way).
Another source claims that
Windows NT's original release (in 1992) contained 4 million lines,
while NT 4.0 (released in 1996) expanded to 16.5 million lines.
(
"Crash-Proof Computing" by Tom R. Halfhill,
Byte, April 1998).
"This Car Runs on Code" by Robert N. Charette (IEEE Spectrum, 2009-02-01)
stated that "It takes dozens of microprocessors running 100 million lines
of code to get a premium car out of the driveway, and this software is
only going to get more complex".
"Codebases" at Information is Beautiful creates
an interesting visualization of various lines-of-code numbers.
Lines of code is a Google doc
spreadsheet of various sizes, with URLs to the information sources.
Palle Pedersen
done a rough-order-of-magnitude analysis of all
Free-libre / open source software,
starting with some extremely simplifying assumptions.
"Assuming an average open source project is 35,000 lines
of code and the average cost of a software developer is
$30/hour (~$60,000/year), a simple COCOMO II calculator tells us
that the average open source project costs $630,000 to develop.
This cost translates into $18 per line of code.
Extrapolating that to 1.7 billion lines of code gives
us an estimated value of $30.6 billion/year...
if the open source community was a country with a GDP of $30.6 billion,
it would rank 77 right between Bulgaria and Lithuania...
putting the open source community ahead of most countries in the world...
Such an economic force should not be underestimated, and this is
yet another indication that open source has become a significant part
[of] the technology world."
The specific number may be significantly off, no one knows,
but I think the conclusion (OSS has become a significant part) is spot-on.
A post by agenaille on reddit claims that the
web application healthcare.gov
is roughly 3.7 million lines of code
(including HTML, CSS, and XML that is arguably not code).
I have not found a way to independently verify this.
"The Total Growth of Open Source"
by Amit Deshpande and Dirk Riehle (2008)
analyzed a set of over 5000 FLOSS projects, and found that they were
growing at an exponential rate.
Indeed, their 2008 results were that the
"total amount of source code and the total number of projects double
about every 14 months".
There are lots of related statistics.
For example, the
TIOBE Programming Community Index (TPCI)
tracks the popularity of programming languages.
Wikipedia: Size in volumes estimates the size of Wikipedia in volumes
(hint: it's gigantic).
Remember,
there's more to a program than how many lines of code it has, as the August 26, 2003 Dilbert strip shows.
You can also view
my home page
(http://www.dwheeler.com), or related pages such as my pages on
"Why
open source software / free software (OSS/FS)? Look at the Numbers!", my
open source
software / free software references, and
how to write
secure programs.
This site is hosted by Webframe.org.