8.3. Handle Metacharacters

Many systems, such as SQL interpreters and the command line shell, have metacharacters, that is, characters in their input that are not interpreted as data. Such characters might commands, or delimit data from commands or other data. If there’s a language specification for that system’s interface that you’re using, then it certainly has metacharacters. If your program invokes those other systems and allows attackers to insert such metacharacters, the usual result is that an attacker can completely control your program.

8.3.1. SQL injection

Most database systems include a language that can let you create arbitrary queries, and typically many other functions too. The SQL language is especially common, and many other languages for databases are similar to the SQL language.

SQL and its related languages, unsurprisingly, include metacharacters. When metacharacters are provided as input to trigger SQL metacharacters, it’s often called SQL injection. Even if the language is technically not SQL, if it's an attack on a language for a database system it's typically still called a SQL injection attack. There are many ways to trigger SQL injection attacks; attackers can insert single or double quotes, semicolons (which act as command separators), "--" which is a comment token, and so on. See SPI Dynamic’s paper “SQL Injection: Are your Web Applications Vulnerable?” for further discussion on this.

Perhaps the best single approach for countering SQL injection are prepared statements. Prepared statement allow programmers to identify placeholders; a pre-exisitng library then escapes it properly for that specific implementation. This approach has many advantages. First, since the library does the escaping for you, it is simpler and more likely to get right. Second, it tends to produce easier-to-maintain code, since the code tends to be easier to read. Prepared statements are especially important when dealing with SQL, because different SQL engines have different syntax rules.

There are other approaches, of course. You can write your own escape code, but this is difficult to get correct, and typically a waste of time since there are usually existing libraries to do the job. You can also use stored procedures, which can also help prevent SQL injection.

There are other solutions that limit inputs. Different SQL implementations have different metacharacters, so blacklisting is even more a bad idea for countering SQL injection. As discussed in Chapter 5, define a very limited pattern and only allow data matching that pattern to enter; if you limit your pattern to ^[0-9]$ or ^[0-9A-Za-z]*$ then you won’t have a problem. If you must handle data that may include SQL metacharacters, a good approach is to convert it (as early as possible) to some other encoding before storage, e.g., HTML encoding (in which case you’ll need to encode any ampersand characters too). Also, prepend and append a quote to all user input, even if the data is numeric; that way, insertions of white space and other kinds of data won’t be as dangerous.

8.3.2. Shell injection

Many metacharacter problems involve shell metacharacters. An attack that tries to exploit a vulnerabliity in shell metacharacter processing is called a shell injection attack. For example, the standard Unix-like command shell (typically stored in /bin/sh) interprets a number of characters specially. If these characters are sent to the shell, then their special interpretation will be used unless escaped; this fact can be used to break programs. According to the WWW Security FAQ [Stein 1999, Q37], these metacharacters are:

& ; ` ' \ " | * ? ~ < > ^ ( ) [ ] { } $ \n \r

The # character is a comment character, and thus is also a metacharacter. The separator values can be changed by setting the IFS environment variable, but if you can’t trust the source of this variable you should have thrown it out or reset it anyway as part of your environment variable processing.

Unfortunately, in real life this isn’t a complete list. Here are some other characters that can be problematic:

Forgetting one of these characters can be disastrous, for example, many programs omit backslash as a shell metacharacter [rfp 1999]. As discussed in the Chapter 5, a recommended approach by some is to immediately escape at least all of these characters when they are input.

So simply creating a list of characters that are forbidden is a bad idea (because that is a blacklist). Instead, identify the characters that are acceptable, and then forbid or correctly escape all others (a whitelist).

What makes the shell metacharacters particularly pervasive is that several important library calls, such as popen(3) and system(3), are implemented by calling the command shell, meaning that they will be affected by shell metacharacters too. Similarly, execlp(3) and execvp(3) may cause the shell to be called. Many guidelines suggest avoiding popen(3), system(3), execlp(3), and execvp(3) entirely and use execve(3) directly in C when trying to spawn a process [Galvin 1998b]. At the least, avoid using system(3) when you can use the execve(3); since system(3) uses the shell to expand characters, there is more opportunity for mischief in system(3). In a similar manner the Perl and shell backtick (`) also call a command shell; for more information on Perl see Section 10.2.

8.3.3. Problematic pathnames and filenames

A "pathname" is a sequence of bytes that describes how to find a file system object. On Unix-like systems, a pathname is a sequence of one or more filenames separated by one or more "/". On Windows systems a pathname is more complicated but the idea is the same. In practice, many people use the term "filename" to refer to pathnames.

Unfortunately, pathnames are often at least partly controlled by an untrusted user. For example, it is often useful to use file/directory names as a key to identify relevant data, but this can lead to untrusted users controlling filenames. Another example is when monitoring or managing of shared systems (e.g., virtual machines or containerized filesystems); in this case an untrusted monitoree controls filenames. Even when an attacker should not be able to gain this kind of control, it is often important to counter this kind of problem as a defense-in-depth measure, to counter attackers who gain a small amount of control.

An obvious case is that systems are not supposed to allow redirection outside of some direction (e.g., a "document root" of a web server). If a web application allowed ".", "/", and/or "\", it might be easy to foil that rule. For example, if a program tries to access a path that is a concatentation of "trusted_root_path" and "username", the attacker might be able to create a username "../../../mysecrets" and foil the limitations. As always, use a very limited whitelist for information that will be used to create filenames.

Microsoft Windows pathnames can be difficult to deal with securely. Windows pathname interpretations vary depending on the version of Windows and the API used (many calls use CreateFile which supports \\.\ - and these interpret pathnames differently than the other calls that do not). Perhaps most obviously, "letter:" and "\\server\share..." have a special meaning in Windows. A nastier issue is that there are reserved filenames, whose form depend on the API used and the local configuration. The built-in reserved device names are as follows: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Even worse, drivers can create more reserved names - so you actually cannot know ahead-of-time what names are reserved. You should avoid creating filenames with reserved names, both with and without an extension; if attacker can trick the program into reading/writing the name (e.g., com1.txt), it may (depending on API) cause read or write to a device instead of a file. In this case, even simple alphanumerics can cause disaster and be interpreted as metacharacters - this is a rare situation, since usually alphanumerics are safe. Windows supports "/" as a directory separator, but it conventionally uses "\" as the directory separator (which is annoying because \ is widely used as an escape character). In Windows, don't end a file or directory name with a space or period; the underlying file system may support it, but the Windows shell and user interface generally do not. More info is available at http://msdn.microsoft.com/en-us/library/aa365247.aspx.

Filenames and pathnames on Unix-like systems are not always easy to deal with either. On most Unix-like systems, a filename can be any sequence of bytes that does not include \0 (the terminator) or slash. One common misconception is that Unix filenames are a string of characters. Unix filenames are not a string of one or more characters; they are merely a sequence of bytes, so a filename does not need to be a legal sequence of characters. For example, while it's a common convention to interpret filenames as a UTF-8 encoding of characters, most systems do not actually enforce this. Indeed, they tend to enforce nothing, so many problematic filenames can be created, including filenames with spaces (or only spaces), control characters (including newline, tab, escape, etc.), bytes that are not legal UTF-8, or including a leading "-" (the marker for command options). These problematic filenames can cause trouble later. Some potential problems with filenames are specific to the shell, but filename problems are not limited to the shell.

A common problem is that "-" is the option flag on many commands, but it is a legal beginning of a filename. A simple solution is to prefix all globs or filenames where needed with "./" so that they cannot begin with "-". So for example, never use "*.pdf" to refer to a set of PDFs; use "./*.pdf".

Be careful about displaying or storing pathnames, since they can include newlines, tabs, escape (which can begin terminal controls), or sequences that are not legal strings. On some systems, merely displaying filenames can invoke terminal controls, which can then run commands with the privilege of the one displaying.

For more detailed information, see Filenames and Pathnames in Shell: How to do it correctly.

8.3.4. Other injection issues

A number of programs, especially those designed for human interaction, have “escape” codes that perform “extra” activities. One of the more common (and dangerous) escape codes is one that brings up a command line. Make sure that these “escape” commands can’t be included (unless you’re sure that the specific command is safe). For example, many line-oriented mail programs (such as mail or mailx) use tilde (~) as an escape character, which can then be used to send a number of commands. As a result, apparently-innocent commands such as “mail admin < file-from-user” can be used to execute arbitrary programs. Interactive programs such as vi, emacs, and ed have “escape” mechanisms that allow users to run arbitrary shell commands from their session. Always examine the documentation of programs you call to search for escape mechanisms. It’s best if you call only programs intended for use by other programs; see Section 8.4.

The issue of avoiding escape codes even goes down to low-level hardware components and emulators of them. Most modems implement the so-called “Hayes” command set. Unless the command set is disabled, inducing a delay, the phrase “+++”, and then another delay forces the modem to interpret any following text as commands to the modem instead. This can be used to implement denial-of-service attacks (by sending “ATH0”, a hang-up command) or even forcing a user to connect to someone else (a sophisticated attacker could re-route a user’s connection through a machine under the attacker’s control). For the specific case of modems, this is easy to counter (e.g., add "ATS2-255" in the modem initialization string), but the general issue still holds: if you’re controlling a lower-level component, or an emulation of one, make sure that you disable or otherwise handle any escape codes built into them.

Many “terminal” interfaces implement the escape codes of ancient, long-gone physical terminals like the VT100. These codes can be useful, for example, for bolding characters, changing font color, or moving to a particular location in a terminal interface. However, do not allow arbitrary untrusted data to be sent directly to a terminal screen, because some of those codes can cause serious problems. On some systems you can remap keys (e.g., so when a user presses "Enter" or a function key it sends the command you want them to run). On some you can even send codes to clear the screen, display a set of commands you’d like the victim to run, and then send that set “back”, forcing the victim to run the commands of the attacker’s choosing without even waiting for a keystroke. This is typically implemented using “page-mode buffering”. This security problem is why emulated tty’s (represented as device files, usually in /dev/) should only be writeable by their owners and never anyone else - they should never have “other write” permission set, and unless only the user is a member of the group (i.e., the “user-private group” scheme), the “group write” permission should not be set either for the terminal [Filipski 1986]. If you’re displaying data to the user at a (simulated) terminal, you probably need to filter out all control characters (characters with values less than 32) from data sent back to the user unless they’re identified by you as safe. Worse comes to worse, you can identify tab and newline (and maybe carriage return) as safe, removing all the rest. Characters with their high bits set (i.e., values greater than 127) are in some ways trickier to handle; some old systems implement them as if they weren’t set, but simply filtering them inhibits much international use. In this case, you need to look at the specifics of your situation.

A related problem is that the NIL character (character 0) can have surprising effects. Most C and C++ functions assume that this character marks the end of a string, but string-handling routines in other languages (such as Perl and Ada95) can handle strings containing NIL. Since many libraries and kernel calls use the C convention, the result is that what is checked is not what is actually used [rfp 1999].

When calling another program or referring to a file it may be wise to specify its full path (e.g, /usr/bin/sort). For program calls, this will eliminate possible errors in calling the “wrong” command, even if the PATH value is incorrectly set. For other file referents, this reduces problems from “bad” starting directories.