blah

THE WRITER MUST EAT -> patreon.com/trn1ty <-

| \    |   | blah!
|\ | `\|\  | the rantings and ravings
|/ |(_|| | * of a depraved lunatic

$!NAVIGATION

2024-01-20

: why mm(1)

I started working on mm(1) probably around 2020-2021, when I was first
acquainting myself with the inner workings of UNIX-like operating systems which
I had been using for a couple years by then. I can't remember how I noticed it
but it bothered me that there was this cat(1p) utility which took multiple
input files and streamed them successively to standard output:

 [ input ] [ input ] [ input ]...
     |_______  |  _______|
            _|_|_|_
           |       |
           |cat(1p)|
           |_______|
               |
               V
        standard output

And then this tee(1p) utility which took from standard input and streamed its
bytes to multiple outputs:

        standard input
               V
            ___|___ 
           |       |
           |tee(1p)|
           |_______|
       ______| | |__________
      |        |            |
 [ output ] [ output ] [ output ]...

And they were separate utilities despite both doing the job of writing input(s)
to output(s). I imagined a hypothetical utility mm(1) that does it all:

 [ input ] [ input ] [ input ]...
     |_______  |  _______|
            _|_|_|_
           |       |
           | mm(1) |
           |_______|
       ______| | |__________
      |        |            |
 [ output ] [ output ] [ output ]...

And attempted to write this magical "mm" (as in, "middleman") utility that
would act as a "middleman" for streams before giving up (due to lack of C or
POSIX API experience) for a couple years to practice making easier programs in
UNIX environments.

There are a couple reasons to implement cat(1p) and tee(1p) as separate
utilities:

1) Ease of implementation

	Differentiating input arguments from output arguments would require
	either having a separator mark (which would be ineligant and exclude
	that mark from being a useable file name) or option parsing.

	Imagine a separator mark in the context of a hypothetical utility
	insouts(1):

	$ PS1='\n$ '

	$ insouts -h
	Usage: insouts (input...) "][" (output...)

	$ printf %s\\n hello\ world
	hello world

	$ printf %s\\n hello\ world >in1

	$ insouts ][

	$ insouts ][ ][ /dev/stdout
	Usage: insouts (input...) "][" (output...)

	$ insouts ./][ ][ /dev/stdout
	hello world

	What a mess! The file ][ can no longer easily be used with insouts(1),
	which may be acceptable (it's not a sensible file name anyway), but
	it's sacrificed for horrendously ugly syntax featuring stressfully
	unmatched square brackets.

	I've written programs that have used separator marks for arguments,
	namely pscat(1), psrelay(1), and psroute(1) so far, and there are a
	number of additional caveats that come with their particular flavor of
	marker and I've been hesitant about the syntax since I came up with it
	half a year ago. Best not to make more things about which to fret.

	Now imagine option parsing:

	$ PS1='\n$ '

	$ insouts
	Usage: insouts (-i [input])... (-o [output])...

	$ insouts -i in1
	hello world

	$ insouts -i in1 -i ][ -i out1
	hello world
	hello world
	hello world

	This works for everything and is how mm(1) works. The issue is with
	regards to code itself. Imagine a very basic cat(1) implementation in
	C:

	#include 
	int main(int argc, char *argv[]){
		int c;
		FILE *f;
		int i;

		for(i = 1; i < argc; ++i){
			if((f = fopen(argv[i])) == NULL){
				perror(argv[i]);
				return 1;
			}
			while((c = getc(f)) != EOF)
				putchar(c);
			fclose(f);
		}
	}

	This doesn't conform to POSIX (which requires 'cat -u' to be supported)
	but illustrates the ease of using cat(1)'s arguments: For each
	argument, open it as a file, write it out, close it, and that's it.

	mm(1)'s option parsing for '-i' and '-o' alone, as of writing, are 24
	lines alone, excluding the functions they call. The above program is 16
	lines of code. This weight does also come from supporting "-" as a
	euphemism for /dev/stdin or /dev/stdout depending on whether it was
	used for '-i' or '-o' and trying to create an output file if it doesn't
	exist and without these two features that are unsupported by the above
	program the code for '-i' and '-o' would be considerably lighter, but
	the point is that option parsing adds complexity that can be avoided by
	simply having two utilities.

	Furthermore, options have drawbacks for users.
	
2) Ease of use

	One relatively common use of cat(1p) is to catenate all files matching
	a glob pattern. Imagine:

	$ PS1='\n$ '

	$ ls
	in1
	in2
	in3

	$ cat "$f"; done

	$ mm . While '-i' and '-o' are 24 lines in
total, the rest of the options logic is necessary for cat(1p) and tee(1p) and
is unavoidable and outweighs the '-i' and '-o' options, plus much of the '-i'
and '-o' logic is still necessary in both cat(1p) and tee(1p) (supporting "-"
and, in tee(1p)'s case, creating an output if it doesn't exist). Though there
is additional memory juggling due to supporting arbitrary inputs and outputs,
in most uses actual memory use isn't noticeably affected (10 extra bytes for 5
file arguments, or one tenth of the data used by this parenthetical statement).

It is possible to write implementations of cat(1p) and tee(1p) in POSIX shell
script as wrappers on mm(1) and I have done so, so users who want to use globs
can simply call cat or tee as usual.

mm -i input -o output tends to be intuitive for existing shell users once they
learn the name "middleman".

$!NAVIGATION

No rights reserved, all rights exercised, rights turned to lefts, left in this
corner of the web.