Everything is a text file

One of the things which makes UNIX systems so powerful is the ease with which one can move data around. What makes this possible is the fact that, with few exceptions, everything is a text stream:

  • Configuration files are plain text.
  • Data are usually stored as flat text files.
  • Even executables, which most Windows users consider to be synonymous with "binaries", are frequently text files: shell scripts, Perl scripts, Python scripts, and the like.
  • Most of the UNIX core programs produce output as text streams or text files.
(On Windows, all of the above, except configurations, are usually represented as binary data, and configuration data, as stored in the registry, is not really amenable to editing by hand.)

What this means is that any text tool you learn, from less to emacs, can be put to use in almost any situation; you don't have to learn specialized tools for every new task you want to perform. Moreover, it means that applications which understand text automatically get an interface they can use to talk to each other. Text is the universal language of computing.

This model is so useful that Linux even creates text interfaces for many system internals which are not not naturally represented as text files or streams. The /proc filesystem, inspired by Plan 9, is one such "virtual filesystem" which exposes certain system vital signs. For example, /proc/cpuinfo provides information about the CPU(s):

prompt$ head /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 9
model name      : Intel(R) Pentium(R) M processor 1400MHz
stepping        : 5
cpu MHz         : 1400.000
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no

The /proc filesystem contains a wealth of system information (current processes, memory usage, hardware and software configuration) all in the guise of text files that you can read. You can write to some of these files to change configurations as well. For example, when executed as root,

echo www.example.com > /proc/sys/kernel/hostname
changes the host name, and
echo 1 > /proc/sys/vm/drop_caches
drops the system page cache.

So what? It means that when you want to write a program to read or manipulate some aspect of the system, you don't have to rely on bindings which are fragile, or require special headers, or are unavailable in your language of choice. All your program needs to do is read from or write to a file, which is (usually) a piece of cake.

Sometimes flat text isn't enough. What if you need structured or hierarchical data? As it turns out, a filesystem provides a fine hierarchical storage mechanism in the form of directories. For example, the /proc filesystem stores information about the process with PID N inside files in /proc/N/. But when structure is stored in directories instead of, say, nested XML elements, or in the keys of a Windows registry file, you can bring to bear all the tools you already know that operate on files. If you're deploying an application, it's trivial to copy or extract a configuration directory to ~/.appname/. It's not quite as easy to unzip an XML configuration fragment into a larger XML configuration file.

The idea of universal interfaces has found traction even as flat text files have lost ground. In the OpenDocument format, each document (extensions .odt, .odp, etc.) is really a ZIP file containing an XML description of the file and any extra media. (On the other hand, prior to Microsoft Office 2007, essentially all knowledge of the Office file format was obtained through reverse-engineering.)

The other week, I found myself in a situation where I had to save all the (numerous) images in a Word document to individual files. The fastest way was in fact to use OpenOffice to convert to an OpenDocument; when I unzipped that, one of the directories inside contained all the pictures which were in the original document. Common interfaces and tools help you to break free from the limitations of specialized tools when necessary.

Further reading: Unix philosophy, proc filesystem. The /sys filesystem is worth a look too.

No comments:

Post a Comment