** TODO Magic Pipes :PROJECT:MEDIUM: :PROPERTIES: :ID: aa0e602d-0513-4b9f-a80c-d9e480e1341b :END: See http://www.snell-pym.org.uk/archives/2009/06/25/magic-pipes/ Rather than taking expressions and evaluating them with INPUT etc. bound to a value, perhaps accept an expression that must evaluate to a procedure. Makes it easier it use existing procedures, lets people choose their own bound names, and is more like Scheme HOFs. User expressions are read using the full Chicken reader, but s-expressions read from standard input are always read with a limited read-table that disables #, #+, #<#, and any other non-standard read syntax that might open up security vulnerabilities. *** Global arguments These are accepted by any of the tools below: + -u Use the supplied unit. + -d Evaluate the supplied expression. + -i Evaluate the contents of the supplied file. (-u), (-d) and (-i) arguments are processed in the order supplied, before evaluating any other user-supplied expressions. The expressions evaluated by (-d) and (-i) have access to the current error port, but not input or output. The intent of these is to set up utility procedures/macros to be used by other expressions later on. *** mpfilter ... Reads s-expressions from stdin, and outputs them to stdout if, when passed to all the procedures listed on the command line, they all return true. current-input-port and current-output-port are banned, but current-error-port is accessible. *** mpmap Reads s-expressions from stdin, applies the procedure on the command line to them, and then writes the results to stdout. current-input-port and current-output-port are banned, but current-error-port is accessible. If the procedure returns multiple values, they are outputs as separate s-expressions; thus "mpmap '(lambda (x) (values x x))'" will duplicate each s-expression in the input. *** mpfold [-o ] [] The first expression evaluates to a two-argument procedure, the second (the initial accumulator) to any value; #f is the default if none is specified. (-o) specifies a single-argument output procedure; the identity function is the default. Applies the procedure to each s-expression from the input in turn, with the current accumulator as the second argument. At the end, outputs the result of applying the output procedure to the final accumulator. current-input-port is banned, but current-error-port is accessible. current-output-port is usable, for convenience in writing pipelines that summarise each line of some input then finally write a "totals" line. *** mpsort [-c] [-r] [-p ] [ []] The first expression must produce a two-argument comparison procedure, and defaults to "smart<" if none is present. The second expression must produce a single-argument key extraction procedure, which defaults to the identity. Reads in all the expressions from the input, sorts them by applying the comparison procedure to the results of applying the extraction procedure to the expressions, then returns the result. If (-c) is specified, then the extraction procedure is assumed to be expensive, and its result computed and cached at load time. If (-r) is specified, then the sort order is reversed. Provide smart< and smart> procedures, which compare things in a type-agnostic way: < for numbers, string< for strings, recursive testing for pairs and vectors. As usual, the procedures have no access to current input or output ports, but can write to the error port. If (-p) is specified, then rather than sorting in-memory, we instead start the specified number of threads, each of which reads sexpressions from a bounded FIFO and sends them to a child mpsort process. A master thread then reads sexpressions from standard input and round-robins them to the FIFOs, skipping any FIFOs that are "full" and blocking if they all are. Each child process also has a reader thread that reads its sorted output and loads them into another FIFO, and a final output thread merges the sorted FIFO outputs into a final sorted output to standard output. #!eof is used as a marker in the FIFOs to record the actual end of the file, to distinguish EOF from an empty FIFO due to the source not having produced anything yet. Is it worth having an option to go multi-machine by running mpsort from inetd (perhaps in parallel mode to use multiple cores) on remote machines and parallelising via TCP rather than running a child process? That would be kind of cool and not too hard. Or for huge sorts (where there's not enough memory available), we could have a flag that splits the input into temporary files of up to a certain size, sorts them individually one by one, then merges the results together. *** mplookup [-m] [-f|-F] {lookup |revlookup } It would be convenient to have a simple command-line tool to handle look-up tables, mapping one s-expression to one or more other s-expressions. By default, each output s-expression is a list of results from the corresponding input s-expression, which is empty if the there is no mapping. If (-f) is specified then the first result is returned only, not wrapped in a list, and #f used if there is none. If (-F) is used then the first result is returned, and #f if there is none or more than one. File type detection is performed on the map file. There is support for sqlite databases in a special format (ending .mbm; magic binary map), or plain text files with a sequence of ( . ) pairs (ending .msm; magic sexpr map) or /etc/aliases format files (default), which are treated as string->string mappings. If (-m) is specified, then the map file is not a file name, but the name of a meta-map from a list: uid<->name, gid<->name, ip->list of hostnames, hostname->list of ips, hostname->list of arbitrary DNS records, port<->service, ... But more heavyweight things like a PostgreSQL/MySQL lookup tool would be best handled by using mpmap with a suitable interface egg. **** mplookup-set [ ] In the given map file (which, if nonexistant, is created), set expr1 to map to expr2. If the exprs are omitted, then sexprs are read from standard input, and must be pairs, the first element of which is treated as expr1 and the second as expr2, and are all set into the map in order. **** mplookup-delete [] Deletes the given mapping from the given map file. If the expression is omitted, then expressions are read from stdin and removed from the map file. If the map file does not exist, an error is raised. **** mplookup-dump Spits out the contents of the map file as a sequence of pairs, with the car being the key and the cdr the value. This can be piped into mplookup-set to effect map file format conversions. *** FIXME: mprandom ??? Take random samples of the input - either pick any s-expression with a given chance, or read all the s-expressions into RAM and pick N at random *** FIXME: mpshuffle ??? Read input s-expressions into a list, shuffle, and output the result. *** mpflatten Reads input s-expressions, and if they are lists, writes the elements of the list as separate s-expressions, otherwise writes them as-is. *** mpgroup [-a] [-t] [-f|-l] The expression must be a single-argument procedure. It is applied to each input s-expression to obtain a "key" for each input s-expression. As usual, the procedure has no access to current input or output ports, but can write to the error port. If (-a) is specified, then the s-expressions are accumulated in memory by their keys, into a hashtable. If (-f) is specified, the only the first s-expression for each key is kept; if (-l) is specified, the only the last is kept. At the end, the hash table is written out; if (-t) is specified, it is written as one list per key, the first element being the key value and the rest being the s-expressions with that key. If (-t) is not specified, then it is just one list per key, but without the key as the first element. The order of the keys listed in undefined, but if neither (-f) nor (-l) are specified, the s-expressions within a key are in the order they were read. If (-a) is not specified, then the s-expressions are not accumulated and spat out in a single batch; instead, they are output in the same order that they were read in, but grouped into lists of s-expressions having the same key in a contiguous run. If (-t) is specified, the key value is prepended to the list. If (-f) is specified, then only the first s-expression in each run of the same key value is listed (and if (-t) is not specified, then it is output as-is rather than as a single-element list). Likewise, if (-l) is specified, the only the last s-expression in each run with of the same key value is listed, and unless (-t) is specified, it's written as-is without a single-element list enclosing it. *** mpforeach ... Run the supplied Scheme procedure(s) on each s-expression from the input. Ignore anything returned, and the Scheme procedure can access stdout/stderr if required, but has no access to stdin. *** mpparse [|-p ] [-o ] The argument, if present, must be a valid SRE; or, if (-p) is used, a POSIX regexp. If not present, it defaults to "(seq bos (* any) eos)" Reads in lines of text from stdin and converts them to s-expressions by applying the regular expression. Lines that do not match the regexp are ignored. If (-o) is specified, then the expression must be a single-argument procedure which is applied to each irregex match object to generate the output s-expression. If not, then a default is used which has the following behaviour: If the regexp has no captures, then the entire matching string is returned. If it has only numbered submatches, then a list of the submatches is returned. If it has named (and maybe also numbered) submatches, then an alist of them is returned, with names used where available and numbers where not. *** mpprintf [-n] ... Calls "printf" on each input sexpr, with the arguments (concatenated with spaces) as the format string. Appends a newline unless (-n) is specified. *** mpls [-r] [-x|-l|-a|-o ] [-f ]... []... Write an "ls"-equivalent tool that outputs sepxressions, with a choice of formats (see list below) or (-o) an arbitrary function to be applied to each filename (with access to all the posix unit functions, such as file-stat and friends) to generate the output, and optional filter expression(s) (-f) which are ANDed together. By default, the filter accepts all files, and the output is just the filenames as strings. Give it the option (-r) to recurse, in which case the filenames passed to the function are multi-stage relative paths. Takes an optional list of files on the command line to just list those, a la "ls". Standard formats: + -x - a pair with the filename as the first element and the result of file-stat (a vector) as the second + -l - a list with the filename as the first element, a single-letter type code as the second (d=directory, r=regular, etc.), mode, uid, gid, size, mtime, and for symlinks, the link target as an extra element. + -a - an alist, with all the data from -x, but as a nicely accessible alist. Add some utility functions to provide advanced "find" functionality, such as (older (file-creation-time f) (days 5)). In the expressions, current-input-port and current-output-port are banned, but current-error-port is accessible. mpls -r -f 'regular-file?' -f '(lambda (file) (older (file-creation-time file) (days 5)))' | mpforeach 'delete-file'