OUE(1)                                                                  OUE(1)



NAME
       oue - only unique element filter, dicer version

SYNOPSIS
       oue  [-cdilNsSvz]  [-span]  [-a  c]  [-b  length] [-D state] [-e every]
       [-f first] [-k key] [-p pad] [-r memory] [-R report] [files]
       oue -I prev  [-cdilNsSvz]  [-span]  [-a  c]  [-b  length]  [-B  replay]
       [-D  state]  [-e  every]  [-f  first]  [-k  key]  [-p  pad] [-r memory]
       [-R report] [-x extract] [files]
       oue -h
       oue -H
       oue -V

DESCRIPTION
       The common shell idiom to get a unique list of elements in  a  pipeline
       is:
              sort -u
       which waits for the entirety of stdin to be processed before any output
       is delivered to stdout.  That delay is easily avoided with  the  common
       perl idiom to "touch" an element in a hash at the top of a loop to pro-
       tect the body of the loop from duplicate elements (via a guarded next):
              perl -e 'while (<>) { next if $s{$_}; $s{$_}=1; print $_;}'

       Oue  provides  an approximation of the perl idiom for pipelines.  Input
       elements (lines or groups of lines) are only output the first time they
       are  parsed.   Optionally,  the record of unique elements may be shared
       with other processes (in sequence, or parallel) via a GDBM state  file.

       Version  1.x  of this program was, in fact, a perl program.  Later ver-
       sions are not, because they use the dicer to process  groups  of  lines
       into  keys  and memories.  The later versions are incompatible with the
       original, which was never widely released.

DATA MODEL
       Oue expects a stream of lines to process.  The -span  option  sets  the
       number  of  lines  it  takes to form an element.  The last element in a
       file might be short, in which case the pad string is repeated  to  fill
       in  any  missing  lines.   Once a complete element is read oue builds a
       name for the record with the key dicer expression.   If  this  name  is
       already  in  the  state GDBM the record is discarded.  Otherwise a prev
       GDBM is consulted, when specified: any key from that GDBM with the same
       name  also  excludes processing (this restriction may be inverted by -v
       below).

       When the element has been allowed by the previous checks oue  builds  a
       memory for the element from the memory dicer expression.  The pair con-
       structed from (key, memory) is then stored in the state GDBM to prevent
       any future repetition of the item.

       In  both of the above dicer expression the span lines are available, as
       markup, starting at 1, so %1 is the first, %2 is the second, and  %{13}
       would be the thirteenth.  Some of the same dicer expression that xapply
       uses are allowed on those elements: brackets  ([...])  for  the  dicer,
       parenthesis  ((...))  for the character mixer, and curlys ({number}) to
       group numbers.  In addition to the numbered lines there are a few  oth-
       ers (see -H output for a complete list):

       %f
              The file which provided the element.  The expander %i represents
              the position of that file on the  command-line  (1st,  2nd,  ...
              Nth).  Under -i elements from the prev file are from file number
              zero.

       %n
              The line number of the first line of the  current  element  from
              the above file.  Replay elements are from line zero.

       %u
              The count of the unique elements discovered so far from the cur-
              rent process, not counting this one (so 0, 1, 2,  ...  ).   Note
              this  spans  input  files,  and  could  be aligned with xapply's
              expander with the same name.

       %*
              The element's lines joined with spaces.

       %@
              The element's lines joined with newlines (viz. '\n').

       %$
              The last line in the element, as in xapply.

       %0
              The empty string, for compatibility with xapply(1l).

       %%
              A literal percent character (which works  for  any  c  specified
              under -a).

REPORTING
       A report of each of these unique elements is produced on stdout via the
       report dicer expression.  In this context there are at least two  addi-
       tional dicer data sources:

       %k
              The key for the element built from the key dicer

       %m (also spelled %r)
              The memory built from the memory dicer

       %v
              Under -v the previous memory (from the prev GDBM)

       %c
              The  count  of  the  number  of occurrences of this key from the
              input files.  Allowed via -c, but most useful  under  -l  and/or
              -e.

       %o
              The old count of the occurrences from any prev record, see -x.

       %t
              The  total  of  all  occurrences  (so  far).  This may include a
              recovered counter from a prev GDBM file; if one were  specified,
              then %t is the current sum or %c and %o.

       %e
              The  element  accumulator  bound  to  the current key (described
              below).

       %p
              The previous value of %e for the current key (also below).

       %Uabove (or %Labove)
              The results of the expander are folded to upper (lower) case.

       The default report outputs only the name of  the  record  (%k),  except
       under  -c  when  the  default  is  "%t %k".  Any prev GDBM has the same
       report generated when -i is specified, unless -B specifies a  different
       template.   Any  data  in  state  is not reported on, unless it is also
       specified as the prev GDBM.  In this context the numbered lines of  the
       element  are only available for new elements, any element from the prev
       GDBM sees the pad value for every line (as the only part of the  origi-
       nal lines stored in the GDBM is the part recorded by the memory dicer).
       A single reminder of this fact displayed to stderr helps debug this  (a
       little).

       The  space  allocated  for  the  construction  of  names, memories, and
       reports is limited by the length value specified under -b.  The default
       of "10k" is usually enough for a name, length is multiplied by span for
       the construction of the memory and report strings.

       Three additional modes are available to produce other  useful  reports:
       count  mode  (under -c), duplicate mode (under -d), and last occurrence
       mode (under -l).  Each of these may require oue to  buffer  all  output
       until  the  end of the input.  This causes the output to be shuffled by
       the GDBM code used to track the status of each element.  To output  the
       results  in  the  input order specify -s below, be aware this slows the
       output for large files a  lot.   Under  any  of  these  the  switch  -S
       silently  compresses  sequential  duplicate  keys  into  only the first
       occurrence.

ACCUMULATORS
       Some reports would be impossible without  an  "accumulator"  to  gather
       information  about the lines as they are processed.  Two specifications
       are used to control a per-key buffer (markup as %e)  that  may  contain
       facts gathered while processing each element that maps to the same key.

       The first instance of each  key  initializes  the  accumulator  to  the
       expansion of first, by default the empty string.  For every instance of
       each key the accumulator is copied to %p "the previous value"  and  the
       every specification is expanded to fill the accumulator again.

       The value of %e is most useful under -l, as in other cases it will con-
       tinue to update as the lines are processed, but there is no way to out-
       put  the  final value (since the first instance of each key outputs the
       only notification).

OPTIONS
       If the program is called as oue then no options are forced.

       -span
              Specify the number of input lines read to form an element.   The
              default is 1 line per element.

       -a c
              As  in xapply, change the escape character to c from the default
              percent (%).

       -b length
              Specify a bigger dicer buffer size.   This  value  is  a  scaled
              integer  using  the  common  'k'  for kilobytes (specify '?' for
              help).  The default is "10k" (10240 bytes).   Basically  if  you
              are  keying  on more than 132 characters you might want to think
              about this solution a little.

       -B replay
              Rather than using the report expression to replay  the  elements
              from  prev  use  this  template.  All the markup described above
              works as specified.  The line number (%n)  is  always  zero  for
              pairs  from  prev,  and  %f  is prev as specified on the command
              line.

       -c
              Process keys for their total count, rather than just uniqueness.
              Similar  to  uniq's -c option, but the input keys do not have to
              be sorted.  This mode combines with -l and/or -d as needed.  The
              prev  GDBM  may  specify  a starting count for each element (the
              recorded value is taken as an integer count if possible).   When
              the  count is not the first item in the memory use -x to specify
              a dicer expression to extract the count from  prev.   Note  that
              the output is not in stable order unless -s is also specified.

       -d
              Accept  only  keys that are not unique (duplicated) in the input
              stream.  This is also similar to uniq's -d option.  The starting
              count  is  gathered from prev as under -c.  Note that the output
              is not in stable order in combination with -l or -c,  unless  -s
              is also specified.

       -D state
              Record  the  elements seen in the GDBM file state, which is usu-
              ally spelled with a .db on the end.  Subsequent runs provisioned
              with  the  same  file will not repeat any elements from previous
              runs (see -i below).

              The default state is a file created  under  $TMPDIR,  see  envi-
              ron(7), which is removed on exit.  If $TMPDIR specifies a nonex-
              istent directory oue tries /tmp and /var/tmp as a fallback.

       -e every
              Set the update dicer expanded into the per-key  accumulator  for
              each  occurrence  of  every  key.  A good example value would be
              "%p,%n" which adds the current line number to the previous value
              --  causing  the  catenation  each line number to the end of the
              accumulator.  There are several ways to remove the leading comma
              that  results from this markup: replace the %p" with "%P" (which
              asks oue to consume any markup from the dicer expression to  the
              next  escape (c) or the end of string when the value expanded is
              the empty string), or when you present the "%e" remove the first
              character  with the mixer "%(e,2-$)", or use the dicer to remove
              the first field on commas "%[e,-1]".  The first  way  works  for
              multi-character separators better.

       -f first
              Set the dicer markup which generates the new per-key accumulator
              when a key is first discovered in the input  elements.   A  good
              example would be "%f", which sets the accumulator to the name of
              the file that created the key, or "%f:%n" to remember  the  file
              and line number.

       -h
              Print only a brief help message.

       -H
              Print only a brief reminder of the markup escapes.

       -i
              Report  on  the elements drawn from the prev GDBM (below).  This
              allows the prev GDBM to act as a 'replay device' to form a union
              operation  on the state set.  The keys from prev are always fil-
              tered from the state GDBM processing (viz. never added to  state
              itself).

       -I prev
              Repeat  all  the elements recorded in the tied GDBM prev.  Often
              used after many  runs  in  a  summary  report.   New  lines  are
              accepted  as  usual, to provide just the list the common conven-
              tion is to explicitly list  /dev/null  as  the  only  member  of
              files.

       -k key
              Build  the  element name from the lines via the dicer expression
              key.  The default key includes the whole element, which is  fine
              for  single  lines most of the time.  Setting the key to a fixed
              string yields exactly one unique element, of  course.   This  is
              often  done  by  (not) changing c while the specification of key
              uses a different (the old) value.

       -l
              Rarely a process needs to select the last instance of an element
              rather  than the first.  This is an expensive operation for long
              element lists with repeated keys, but  it  is  better  than  the
              alternatives.   Works  in combination with both -c and -d.  Note
              that the output is not in stable order unless -s is also  speci-
              fied.

       -N
              All  shared  accesses  to  the state and prev database files are
              protected with GDBM's locking, unless this flag is set.

       -p pad
              Complete short records with this  token.   The  default  is  the
              empty  string.   There  is  no  way to drop incomplete elements,
              which might be a bug.  Elements are not allowed to  span  files:
              if you want that apply the cat(1) filter to the files.

       -r memory
              Rather  than  recording the whole element in the state GDBM this
              dicer expression  creates  the  memory  for  the  element.   The
              default  memory  is  the  string ".", because, for most applica-
              tions, the name is all that is required.  In this context the %k
              data source is also available.

       -R report
              Report  on  name/memory  pairs as they are recovered or created.
              To suppress the report for elements recovered from  prev  either
              do  not  specify  -i  or  specify  -B as the empty string (which
              defeats -i).

              The empty report string suppress all output, acting as grep's -s
              option.   This  builds the state file slightly faster than using
              >/dev/null.

       -s
              Output in stable order.  Produce the state GDBM and report  out-
              put  after  reading  all  input,  in the order the elements were
              first encountered.  This creates another temporary  file,  which
              may slow performance.

       -S
              Compress  sequential  duplicate  keys  into a single occurrence.
              This is useful to remove noise from an otherwise  clear  signal.
              Reset for each input file (that is to say sequential keys across
              file boundaries  are  unique  occurrences).   This  really  only
              impacts the counts under -c.

       -v
              Invert  the sense of the prev GDBM.  Any key which doesn't exist
              in prev is discarded without consulting state.   This  allows  a
              intersection  operation  between  element  lists.  The option is
              named for grep's inversion option.  This also works in  combina-
              tion  with  -d  to select non-duplicated elements from the input
              stream.  (Under -c and -d each selected element  should  have  a
              count of 1.)

       -V
              Show only ksb-style version information.

       -x extract
              When  counting  under  either  -c or -d use extract to parse the
              previous memory value (or key) to  find  the  last  count.   The
              default  value  is "%v" which draws the count from the a leading
              integer in the previous value.  If the  integer  were  the  last
              word  (separated  by  white-space) the value %[v $] (quoted from
              the shell) would extract it.  After the extraction  any  leading
              white-space  is  removed  before  strtoul(3) converts the digits
              with a base set to 0 (numbers in hex, octal, or decimal are con-
              verted correctly).

       -z
              Expect  find's  -print0  output as input files.  All input lines
              are terminated with a NUL character rather than a NL.  Any  out-
              put  is  sent  with the same encoding as the input, which is not
              always what you'd want -- but might be what xapply  wants.   See
              ascii(7) and find(1).

DETAILS
       The  state and prev specifications may indicate the same GDBM file.  In
       that case the the -i flag replays the elements from  the  common  file,
       then  additional  elements  are processed into that same file, if no -i
       flag is presented the specification of a prev file which is the same as
       the state file is a no-op.

LOGICAL OPERATIONS
       Like comm(1), oue is often used to perform set operations on key lists:

       To union two key lists use the same state file for both.

       To intersect two key lists build a state file from the first list  with
       output to /dev/null, then use that as prev under -v for the other.

       To  disjunction  two  key lists build the intersection in a state file,
       then use that as prev for both lists.  This is the long way around, but
       it works.

       The  intersection  operation  may be done in a single pass, if it is an
       invariant that each list has only unique elements:  use  -d  with  both
       files as input.

EXAMPLES
       spell /etc/motd | oue | fmt
              Check the message of the day file for unique misspellings.

       jot -r 10 1 100 | oue | fmt
              Sometimes  outputs  less  than  10  elements (about 37.2% of the
              time).

       jot -r 10 1 100 | oue -D memory.db | fmt
              As this command is repeated it outputs fewer and fewer  numbers,
              until at last all 100 integers have been selected.

       oue -iI memory.db /dev/null | wc -l ; rm memory.db
              See  how  many of possible integers we hit after some updates to
              memory.db, then zero the score board.

       generate-host-names | oue | xapply -f -P4 ... -
              Visit each host in the list generated only once, visit  four  of
              them in parallel.

        ... | xapply -mf -P4 'expose %1 | oue -D dupes.db' -
              Eliminate duplicates from each peer process before they are out-
              put to the common stdout.

       oue -k '%[1:7]' /etc/passwd
              Output each unique shell from /etc/passwd.

       oue -c -k '%[1:7]' /etc/passwd
              Output the count of the unique shells from /etc/passwd.

       oue -k %[1:7] -r %[1:1] -R "%m uses %k" /etc/passwd
              Output the first login  from  /etc/passwd  that  uses  a  unique
              shell.

       oue -ck %[1:7] -r %[1:1] -R "%t use %k (first %m)" /etc/passwd
              Report  the  first  login  from  /etc/passwd  that uses a unique
              shell, and how many others use the same one.

       oue -l -ck %[1:7] -r %[1:1] -R "%t use %k (last %m on %f:%n)" /etc/passwd
              Same as the above, but report the last use  of  the  shell,  and
              which line specified it.

       oue -lk %[1:7] /etc/passwd |
       oue -lck '%[1/$]' -e '%P %1' -R '%c paths to %e'
              Output   only   the   shells   with  multiple  full  paths  from
              /etc/passwd.

       oue -d -k '%[1:1]' /etc/group
              Report duplicate group names (change to field 1 to  3  to  catch
              duplicate gids).

       oue -dc -k '%[1:3]' -R "%t %1" /etc/group
              Report  the  count  of  duplicate  group  gids with the count of
              offending lines.   Without  the  -c  switch  the  output  always
              reports only a count of two, the rest are ignored as the key met
              the duplicate criteria.

       oue -dl -k '%[1:3]' -R "%t %1" /etc/group
              The same output as above, for a different reason.  We asked  for
              the last offending element (so we get the larger counts).

       oue -k "%[1 1]" -r "%[1 -1]" -R "%k %m"
              Compute  uniqueness  based  on  the first word in each line, but
              report the whole line.  When a line has no spaces the first word
              is  repeated,  which might be a bug or a feature. There are many
              ways to filter the incoming stream for format before oue  parses
              it.

       oue -k "%[1 1]" -r "%1" -R "%m"
              Same  a  above  but  don't  duplicate  the first word when it is
              alone.  This makes the state GDBM a bit bigger as is  saves  two
              copies of the first word for each unique line.

       jot 98 2 |xapply -f factor - |sed -n -e 's/^\([0-9]*\): \1$/\1/p' |
       oue -D prime.db >/dev/null
              Build  a  GDBM file of the primes below 100, which is referenced
              below.

       jot -r 10 1 100 | oue -I prime.db | fmt
              Same as the first example, but never  include  any  prime  below
              100.

       jot -r 10 1 100 | oue -I prime.db -D memory.db | fmt
              Same as the second example, but never include a prime below 100.
              The primes are not included in the memory.db GDBM as well.

       factor $N | tr ' ' '\n' | grep -v : | oue -cl -R '%1^%c' | xargs
              Output the factorization of $N as a product of primes raise to a
              power.

       yes dup | head -100 | oue -clS
              Reports 1 unique occurrence of the word "dup"; all 100 are adja-
              cent so the -S compresses them into a single match.

       last | oue -k '%[1 1]' -R %1 | ${PAGER:-less}
              Report just the last login time for  each  account.   This  uses
              oue's  percent  markup  to select just the login from each line,
              but report the whole line.

       who | oue -k '%[1 1]' -r %1 -R %m
              A similar filter to compress the output of who(1).  We  remember
              the  whole  line to show the first login record for current each
              unique login.

       oue -ld -e '%P,%[1:1]' -k '%[1:3]' -R '%k:%e' /etc/group
              Report any duplicate group ids from /etc/group, and the list  of
              groups that share each.

       oue -ld -e '%P,%[1:1]' -k '%[1:4]' -R '%k:%e' /etc/passwd
              Report logins from /etc/passwd that share a common primary login
              group.

       grep . *.report | oue -k '%[1:-1]' -R '%[1.1]:%[1:-1]'
              Search each report file for unique  notifications.   Report  the
              name  of  the reporting host and the unique message.  Add a prev
              file which includes all the noise lines and you've got something
              to filter nightly reports.

       oue -V
              Output the standard version information.

       oue -b '?' /dev/null
              Output the scalar table for the length specification.

       find ... -print0 | oue -z ... | tr '\000' '|'
              Change  the NUL character separator used by find(1) and oue into
              a pipe (|) with tr(1).  (This leaves an extra pipe on the end of
              the output, sadly.)

       oue -k "%[1 1]" -r "%f:%n" -R "%k from %m" ...
              A  more  useful  record  of where oue found each unique element.
              Note that stdin is reported as the file named "-".

       find . -name RCS -prune -o -type f -name *,v -print 2>/dev/null | \
                   oue -k '%[1/-$]' -R '%1'
              Report the first RCS delta  file  from  each  non-RCS  directory
              below the current.  See rcs(1).

       rm -f /tmp/my.db ; oue -D /tmp/my.db /dev/null ; \
                   xapply 'oue -D /tmp/my.db <%1 >%u && mv %1 %u' *.cl
              Replace each file in the current directory that matches the *.cl
              glob with only the lines not repeated in any other file.

       rm -f /tmp/my.db ; oue -D /tmp/my.db known.ok >/dev/null ; \
                   xapply 'oue -D /tmp/my.db <%1 >%u && mv %1 %u' *.cl
              To make the previous spell more useful, include a file of common
              lines  to suppress in every file (and redirect the output to the
              null device).

       oue -I /tmp/my.db ../old/*.cl
              As a follow-up to the last two examples: show unique lines  from
              the  sibling  old directory's *.cl files that are not present in
              any file matched in the current directory.

       TDB='mktemp -t twentyone'
       find */ -type f -mtime -21 -print |oue -SD $TDB -k '%[1/-$]' >/dev/null
       find * -type d -print |oue -I $TDB
       rm $TDB
              Find all the directories under the current that  have  no  files
              updated  in  the last 21 days.  Faster and more clear than using
              comm(1).

BUGS
       The overload of -v to both invert the prev selection and invert  dupli-
       cate selection may force some filters to be split into 2 processes.

       The  use  of  rm(1) to remove state files (viz. prev or state) is quite
       likely to race with parallel instances of oue.  Some  protocol-specific
       invariant  should  be  used to assure that any targeted state files are
       not (soon to be) in use.

       This is not compatible with the perl version, but it is far  more  use-
       ful.

       There  is  no  easy  way  to merge state GDBM files, but a C or perl(1)
       program to do that is trivial.  I've also never needed to do  a  merge.
       One may use cp(1) to copy GDBM files, as long as they are not presently
       open.

AUTHOR
       KS Braunsdorf
       NonPlayer Character Guild
       oue swirl spam dot ksb dot npcguild.org remove spam dot.

SEE ALSO
       sort(1), uniq(1), xapply(1l), rm(1), jot(1) or seq(1), wc(1),  gdbm(3),
       comm(1), rcs(1), apply(1), perl(1), factor(6), xargs(1), grep(1)



                                     LOCAL                              OUE(1)

NAME | SYNOPSIS | DESCRIPTION | DATA MODEL | REPORTING | ACCUMULATORS | OPTIONS | DETAILS | LOGICAL OPERATIONS | EXAMPLES | BUGS | AUTHOR | SEE ALSO