Encodef
This is the main web site for encodef, a suite of programs that
tries to make it easier to process filenames in Unix*/Linux/POSIX systems.
You can get code, etc., from the
encodef project page.
Historically, Unix/Linux/POSIX allow almost any byte in a filename,
but this flexibility is the source of many problems.
I describe the problem in
Fixing Unix/Linux/POSIX Filenames:
Control Characters (such as Newline), Leading Dashes, and Other Problems.
I discuss ways of writing shell programs to work around this,
using existing tools, in
Filenames and Pathnames in Shell: How to do it correctly.
The “encodef” program takes filenames (which may include newlines,
tabs, ESC, leading dash, space, and other nastiness), and encodes
them into a format that’s easier to process.
The “decodef” program reverses the process.
The “xargsf” program is a stub prototype so that you can see how these
integrate into the standard xargs program.
The Encodef man page has more details.
At this point, it's "usable", and more than adequate for prototyping
and testing ideas about encoding filenames.
Feel free to
download the encodef source code
in tarball format (Free-libre/open source software, MIT license).
It includes a self-test suite, so you can get more confidence that it works.
Also,
it follows common compilation and
installation processes, which let you easily control how to
install it (e.g., by setting DESTDIR and --prefix).
Here are a few thoughts, based in part on my experimentation with them:
- If POSIX systems always forbid or escaped
bad filenames (like having control characters), many problems disappear.
- If bad filenames are possible,
then there must be a way to easily deal with them.
Forbidding the creation of bad filenames helps somewhat
(because then in certain cases they won't happen), but
then you still have to be able process bad filenames.
- The conventional way to do this is to use the null byte \0
to terminate/separate filenames.
This is widely supported by find and xargs, you can then
store these in files, and some programs can process them.
These are very efficient.
I believe that the POSIX standard should be modified so that the POSIX shell's
read command could also easily process null-terminated data
(I suggest using -0), and there's a good argument for grep as well.
- You could also escape the filenames, and that's
what encodef does.
In the long term, if encoding is to be supported,
I believe that at least xargs and printf(1)
should be modified to directly support decoding,
and find should be modified to directly support encoding.
The advantage of encoding is that then any text processing tool
can process (encoded) filenames.
But encoding/decoding has higher overhead, creates new issues
(which characters are encoded? Which encoding system?), and it's more
work for utilities to implement encoding compared to using
null byte separators.
- If bad filenames must exist, I think that it'd be best if POSIX added
support for both null byte termination and encoding.
Null byte termination could be used for simple common cases
(using find(1), shell read, grep, storing them in a file, using xargs).
For more complicated cases, encoding/decoding could be used so that
the full suite of POSIX tools could be used.
If the encoder/decoder could also process null byte termination, then it
could fill the gap when more complex tools are needed.
A tool like pax could be trivially modified to output files with
null byte terminators; the encoding tool could then transform that
to nicely encoded filenames with newline terminators.
- If you're going to encode, it's best to encode a large number of
characters. This reduces the risk of improperly handling metacharacters,
and also increases the likelihood that testing will detect when you've
forgotten to decode an encoded filename.
Some systems, like FreeBSD, have the tools
vis(1) and
unvis(1), but vis and unvis
are terrible tools for this problem:
-
vis(1) expects filenames on its command line, which it reads.
That means it doesn't work easily with find(1); you end up with very
complicated expressions that have to create multiple processes with
each filename.
It doesn't even slightly compete with the simpler
find . -exec encodef {} \+
- unvis(1)'s decoder doesn't consider the complications of decoding in
shell. The shell's command substitution removes all trailing newlines;
a filename decoder should optionally append some static character so that
the shell can get the data without corruption.
You might want to look at my
Secure Programming HOWTO
web page, or some of my other writings such as
Open Standards and Security,
Open Source Software and Software Assurance (Security),
and
High Assurance (for Security or Safety) and Free-Libre / Open Source Software (FLOSS).
You can also view
my home page.