~minshall/csvedepli

grep- and sed-like programs for .csv files
~minshall/csvedepli

New ticket tracker added

1 year, 1 day ago
~minshall/csvedepli

New git repository added

1 year, 1 day ago

#Table of Contents

#building and installing

first, clone the repository

git clone https://git.sr.ht/~minshall/csvedepli

then, you will need the following, which is a prerequisite for building the package:

gengetopt is probably available as a package on your system.

additionally, the package uses the ctlc template library – which is packaged as a git submodule. The makefile includes a mechanism (see the initialized target) to automatically set this up when building the default (all) target. but, once you have cloned the repository, you can do this manually by

make submodules

to initialize the submodule.

once the prerequisites are installed, the following supposedly builds and installs the binaries and man pages:

make
sudo make install

#other make targets, prerequisites

  • manpages:, manhtmls:: building the man pages (and the .html versions thereof) requires:

    • emacs (probably a package on your system)
  • readme:: building README.md (from README.org) also requires

  • check:: to run the unit tests, requires

    • cram (possibly a package on your system)

#introduction, motivation

the goal is to allow grep-style searching and reporting, and sed-style searching, reporting, and modifying, operations to be performed on .csv files, in a way that allows some independence from such things as quoting and/or the order of columns (relying, instead, on column names, rather than numbers – this is an *option*, not the default).

in this introduction, i assume you are familiar with sed(1), grep(1), and the notion of regular expressions. apologies if you are not – but, there are many resources on the web to find out about them.

csvedepli is also an experiment to see if writing the patterns and actions in the same "form" – i.e., a table – as the data to which those are to be applied might help simplify the task. (i think the jury is still out.)

#motivation

cleaning up data is a regular task for people working with data sets, especially data sets imported from various sources. sedcsv and grepcsv offer some modest help with this work.

as a running example, i will present some data based loosely on some the early .csv files that have been provided daily by Johns Hopkins University since the beginning, in 2020, of the Covid-19 pandemic. In the early days, the naming of geographic entities – countries, states, etc., – was somewhat fluid, which is what motivates these examples.

let us say that we had received data like the following.1

Province_State Country_Region Last Update Confirmed Deaths Recovered Latitude Longitude FIPS Admin2
Berkeley, CA US 2020-03-03T23:53:03 1 0 0 37.8715 -122.273    
Taiwan Taipei and environs 2020-03-10T19:13:15 47 1 17 23.7 121.0    
Guam US 2020-03-15T18:20:19 3 0 0 13.4443 144.7937    
  Guam 2020-03-16T02:13:16 3 0 0 13.4443 144.7937    
  Taiwan* 2020-03-23 06:45:13 195 2 28 23.7 121.0    
  The Gambia 2020-03-17T23:33:02 1 0 0 13.4667 -16.6    
  Gambia, The 2020-03-18T14:13:56 1 0 0 13.4432 -15.3101    
  Gambia 3/22/20 23:45 1 0 0 13.4432 -15.3101    

as already mentioned, in the early days the naming for countries, states, etc., was not yet standardized.

let's say our standard was to use "United States" for the United States; to list Guam as a province/state within the United States; to list Taiwan as "Taiwan, Province of China"; and "Gambia, The" for the Gambia.

to standardize our input, we might write a sedcsv script in some data application (spreadsheet, R data frame, emacs org-mode table) something like this:1

FIPS Admin2 Country_Region Province_State
    /^US$/;c United States  
c 66000   /^Guam$/;c United States c Guam
  c Alameda County   /^Berkeley, CA$/;c California
    /^Taipei and environs$/;c Taiwan, Province of China /^Taiwan$/;c
    /^Taiwan$/;c Taiwan, Province of China /^Taiwan$/;c
    /^Taiwan[*]$/;c Taiwan, Province of China  
    /^The Gambia$/;c Gambia, The  
    /^Gambia$/;c Gambia, The  

and then export this table and the original data to .csv files, to be used as inputs to sedcsv.

in the above "script", the second row, for example, says that if a row, *in the input data file*, has a Country_Region cell with the exact value2 Guam, then *replace*

  • the FIPS field with the value "66000",
  • the Country_Region field with "United States", and
  • the Province_State field with "Guam".

with similar instructions for data that claim to have come from the state of Berkeley, or various variant expressions of Taiwan and the Gambia.

notice that the order of columns in the data and script tables is different. running either sedcsv or grepcsv with the --labels argument says to match columns in the script file and those in the data file(s) using the provided column names; otherwise, column numbers are used.

#representation

one issue to keep in mind is the difference between what you see in a table, spreadsheet, etc., within some program, and you see in a .csv file exported from that table. for example, if the table looks like this:

one cell   after another

(notice the empty, second, column.) then the .csv may well look like this:

one cell,,after another

but, it doesn't *need* to look like that. it could, depending on your data analysis software, and how you have it configured, look like

"one cell","","after another"

this, when imported back into a table, gives the original table:

one cell   after another

i.e., without double quotes surrounding each cell.

the above data table, with "Berkeley", "Guam", etc., when converted to a .csv file, might look like this:

Province_State,Country_Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude,FIPS,Admin2
"Berkeley, CA",US,2020-03-03T23:53:03,1,0,0,37.8715,-122.273,,
Taiwan,Taipei and environs,2020-03-10T19:13:15,47,1,17,23.7,121.0,,
Guam,US,2020-03-15T18:20:19,3,0,0,13.4443,144.7937,,
,Guam,2020-03-16T02:13:16,3,0,0,13.4443,144.7937,,
,Taiwan*,2020-03-23 06:45:13,195,2,28,23.7,121.0,,

and, here is how we might use the script (also from above) acting on this data:

sedcsv --labels -f fix-cv19-script.csv cv19-data.csv

to produce this .csv file:

Province_State,Country_Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude,FIPS,Admin2
California,United States,2020-03-03T23:53:03,1,0,0,37.8715,-122.273,,Alameda County
,"Taiwan, Province of China",2020-03-10T19:13:15,47,1,17,23.7,121.0,,
Guam,United States,2020-03-15T18:20:19,3,0,0,13.4443,144.7937,,
Guam,United States,2020-03-16T02:13:16,3,0,0,13.4443,144.7937,66000,
,"Taiwan, Province of China",2020-03-23 06:45:13,195,2,28,23.7,121.0,,
,"Gambia, The",2020-03-17T23:33:02,1,0,0,13.4667,-16.6,,
,"Gambia, The",2020-03-18T14:13:56,1,0,0,13.4432,-15.3101,,
,"Gambia, The",3/22/20 23:45,1,0,0,13.4432,-15.3101,,

there are certain rules for how a .csv file should be formatted. 3

  • any cell that contains a double quotation mark, a newline, or a comma, *must* be encapsulated in a pair of double quotation marks.
    • any double quotes inside a cell *must* be /escaped/ by placing a second double quote next to it.
  • any other cell *may* be quoted

in the above .csv representation, you see that the string that appeared as plain Berkeley, CA, in the table, appears surrounded by quotes in the .csv file, e.g., "Berkeley, CA".

if we had happened to have a table that included cells with "internal" double quotes, we would see a pair of quotes in an output .csv file for each single quote in the input table. for example, this table

He said, "she will be next."
"Good," she replied.

when converted to a .csv table, looks at *least* like:

"He said,","""she will be next."""
"""Good,""",she replied.

(there might also be quotes – quite unnecessary, but quite RFC4180-legal – around the second cell of the second row "she replied.".)

this is an issue because both sedcsv and grepcsv allow you to specify a representation against which to compare and, in the case of sedcsv, modify. the default is neutral, but if you are looking exclusively at one or more .csv files, you might find it more natural to write your patterns and commands using --representation csv.

as an example, say you wanted to change the gender of the pronoun "she" in the second column, but only in the "Good," line. you would write a script like this:1

/^"Good,"$/ s/she/he/

and then might execute it like this, specifying a representation of csv:

sedcsv --representation csv -f dquotes-script.csv dquotes.csv
He said, "she will be next."
"Good," she replied.

but, the gender of the pronoun "she" has not changed in the second row. if we look at the script file as a .csv file:

"/^""Good,""$/",s/she/he/

and compare it to the data table, converted to a .csv file:

"He said,","""she will be next."""
"""Good,""",she replied.

we see the problem. the .csv representation has more double quotation marks that we might have (naively) expected.

on the other hand, if we use the same script and data files, but invoke sedcsv with --representation neutral (the default):

sedcsv --representation neutral -f dquotes-script.csv dquotes.csv
He said, "she will be next."
"Good," he replied.

it's again a source of possible confusion that the representation we are talking about here is that of the *data* file, but that it is in the *script* file that you may need to compensate for whatever representation you are using.

#guards versus commands

this topic mostly affects sedcsv, but is also relevant when running grepcsv with the -S- (--sed-format) option.

a traditional sed(1) script consists of a series of "commands", separated by colon (';'). a given command can have an address part, which restricts the command to certain portions of the file. the address can be a regular expression, one or more line numbers.

sedcsv adds an extra concept4, called a "guard", currently a regular expression (though, really, any sed address could probably be implemented). each column in a script file may have one or more guards.

*all* the guards in a given script row must succeed for a given data row before *any* of the commands in that script row will be attempted on that data row. we might invent some terminology here

  • "script row M is *compatible with* data row N": if and only if all the guards in M pass when applied to N.

so, if script row M is compatible with data row N, then each command in script row M will be attempted for data row N. any of these commands may *itself* contain an *address*; assuming script row M is compatible with data row N, then only those commands in script row M with an address that matches the equivalent column in data row N (or, that have no address) will have its command applied. note that, for a given script row and data row, this means that an *address* for a command in one column does not affect the processing of a command in the same, or any other, column.

sorry if this is a source of confusion.

  • guards: sed-style addresses (currently, only a regular expression) separated from other addresses and/or commands by a semi-colon (';')
  • addresses: sed-style addresses, immediately preceding its associated single-letter sed command. (i.e., *not* suffixed with a semi-colon.

take this script table as an example:

declarative interrogative
/this/s/t/T/ /what/a ?
/that/;s/t/T/ /how/;a ?

the first row says that

  • any row with a string "this" in the "declarative" column will have it capitalized; and, independently,
  • any row with a string "what" in the "interrogative" column will have a question mark appended (to the end of the column, not necessarily immediately after "what") 5

this first row has *no* guards, just addresses for each of its commands. this script row will be compatible with *all* rows of any input data file.

the second row, which looks so similar, says that:

  • any data row that has *both* "that" in the declarative column and "how" in the interrogative is subject to the following changes:
    • capitalize the initial 't' in "that" in the declarative column
    • append a question mark to the end of the contents of the interrogative column
declarative interrogative
this is what
that is what
this is how
that is how
sedcsv --labels -f s1.csv d1.csv
declarative interrogative
This is what?
that is what?
This is how
That is how?

for grepcsv, a row in the data file is said to match the patterns in the script file if at least one row of the script file is compatible with that data row.

#sedcsv: apply sed(1)-like editing to a .csv file

this program is supposed to allow easy-enough "stream" editing of .csv files.

the program is fed a script and data file. both are .csv files.

the contents of a (row,column) we call a cell.

the script contains one or more rows. the editing/filtering operations in a cell in the script applies to a column in each row of the data file.

each cell in the script file can have one or more guards, as well as one or more commands.

guards are like sed addresses, but they end at the end of the cell, or are separated by other elements of the cell (other guards and/or commands) by a semi-colon (";").

when processing one row in the *data* file, for a row in the *script* file to be considered "active", all the guards in all the columns must succeed; i.e., the guards in a row in a script file are considered to be ANDed together. if the row in the script file is active, then its commands are processed.

now, this is maybe a bit confusing: a command can, itself, have an address, i.e., like a guard for that command. these guards are *not* separated from the command by a semi-colon – like in "normal" sed(1), they prefix and abut the command.

if a row of the script file is active, then its commands are processed. (sound familiar? :) if a command has its own guard, it is executed iff its guard is successful; otherwise, the command is ignored.

#guards (addresses):

sed(1) has many forms of indicating which lines of an input file should be processed. sedcsv only supports a limited number of these forms.

these addressing forms *should* be allowed

  • [X] N: line N
  • [ ] first~step: every step'th line after first
  • [X] /regexp/: lines that match regexp
  • [X] \cregexpc: lines that match regexp
  • [X] address1,address2: from a line that matches address1 to one that matches address2, inclusive

we should support using a

  • [ ] backslash ("\") to escape a backslash in regexp in /regexp/. etc.

this form likely will *not* be allowed

  • [ ] $: the last line in the file. too hard.

additionally, we may allow the

  • [ ] ! flag: negates the sense of the address, i.e., lines that do *not* match the address are matched. :)
  • [ ] //: match most recent regexp

a problem with ! is that after parsing the script file, we plan to build, from all its row for a given column, *one* regexp containing all the expressions. if that's possible. (if may not, if matching an earlier regexp sub-expression means *not* matching a subsequent sub-expression.)

#commands

likewise, sed(1) contains many commands. here are commands we should support

  • [X] c[text]: replace the contents of the cell with text
  • [X] d: delete the current row
  • [X] p, P: print the current row. can be used in conjunction with --quiet to only print selected lines. (also, if --invert is also specified, will only print "non-selected" lines.)
  • [X] q[exit-code]: quit with an exit code (after printing the current row)
  • [X] Q[exit-code]: quit with an exit code (do not print the current row)
  • [X] s/regexp/replacement/: replace the contents of that part of the cell contents that matches regexp. (note: see in addresses above, "\c…c": same applies here to, e.g., "sXregexpXreplacement".) there are hacks, though: accept "s.*", then look at the first character, build a regexp for that character, etc. sounds messy, but maybe works.
  • [X] y/source/dest/: transliterate a character in the contents of the cell that appears in source with the corresponding character in dest

#flags for 's' command

Sed's "substitute" ('s') command has a number of flags defined. most are *not* implemented; here is the roster of what is and what isn't implemented in sedcsv.

first, flags that *should* be implemented:

  • [X] 'g': replace *all* matches (else, just first)
  • [X] NUMBER: only replace the NUMBERth match
  • [X] p: print the "pattern space" if a substitution is made
  • [X] i, I: do a case-insensitive match

second, flags that aren't likely to be implemented:

  • w: if a substitution is made, write out to a specified file
  • e: if a substitution is made, pipe results (i guess) through this command, and replace contents of the cell with the output
  • m, M: "multi-line mode" (one would like to believe that .csv files are *not* multi-line entities, but, of course, quoted columns can span lines)

#'c' command syntax

The proposed syntax for the 'c' ("change") command is slightly different from "standard" sed:

  • c TEXT: the "standard GNU extension" is supported. TEXT extends to the end of the contents of the column in the script file.
  • c\\\nTEXT...: (this is an attempt to represent the standard POSIX variant of 'c' – a c, followed by a backslash, [possibly?] followed by a newline). this is *not* supported.

our "additions":

  • c: deletes the text in the column. this is a sop to the difficulty a user might encounter trying to get a spreadsheet to output a .csv file with a trailing space. (since, by the standard GNU extension, =c = would delete text in the column.)
  • c/.../: replace the column text with ...; forward slashes included in ... can be escaped with a back slash. (how difficult it might be to move a backslash through the spread sheet, out to the .csv file, is unknown.) (un-backslashed) back slashes before a forward slash are removed; all other back slashes are left in place.

note that the 'd' command deletes the entire *row*, not the contents of a single column.

for more information, please see the man page.

#differences from sed(1) behavior

the main difference is in scope; sed is much richer than sedcsv. but, there are some other differences in features implemented.

commands

  • 'a', 'i': in sedcsv, these commands both affect text in the current column. in sed, they add text after or before the current line of text.
  • 'c': in sed, these must be followed by a character. in sedcsv, if the column ends right after the 'c', it effectively deletes the contents of that cell. (for implementation reasons, 'a' and 'i' inherit this same behavior, though the utility of this is dubious.)

#address compatibility guide

this is an attempt to document the features sed(1) provides for addresses, along with what gnu adds, and which ones we support and, for those not supported, what ones are "planned" (in the sense of, seem tractable, likely to be implemented) and how the sedcsv feature differs from the sed one:

address feature supported "planned" modified
/…/ (by regexp) x    
\%…% (by regexp) x    
by line number x   rowno
ranges (',') x    
$ (last line)      
NUM~NUM (gnu)      
inverting (/…/!)   x  
case independent (/…/I) x    
multi-line (/…/M)      
0,/…/ (gnu)      
ADDR,+N (gnu)      
ADDR,~N (gnu)      

#grepcsv: filter rows in a .csv file by column contents

Let's say you have a .csv file named "horror.csv", with (some version of) the shortest English-language horror story ever written:

setup event response
The last man in the world sat in a room. He heard a knock on the door. "Come in", he said

And, what if you wanted to find and print out this particular line (maybe horror.csv contains various stories, inexplicably laid out in columns!)

setup event response
  knock on the door  
grepcsv --labels -f horror-script.csv horror1.csv
setup event response
The last man in the world sat in a room. He heard a knock on the door. """Come in"", he said"

--labels says to use the first line of the /script/ file (the argument of the -f option) and of the /data/ file (horror.csv, in this case) as column labels, and associate patterns in the script file with the contents of the similarly-named columns in the data file. (the default is to use the column position – column number – of the pattern in the script file to select a column to search in the data file.)

But, what if someone thinks of the columns differently. Say

event setup response
He heard a knock on the door. The last man in the world had been sitting in a room. "Come in", he said
grepcsv --labels -f horror-script.csv horror2.csv
event setup response
He heard a knock on the door. The last man in the world had been sitting in a room. """Come in"", he said"

and, what if you wanted to search *both* files?

grepcsv --labels --no-print-labels --no-filename -f <(echo -e 'event\nknock on the door') horror?.csv
The last man in the world sat in a room. He heard a knock on the door. """Come in"", he said"
He heard a knock on the door. The last man in the world had been sitting in a room. """Come in"", he said"

For more information, please see the man page.

#grep flags that are, to be, that should be, supported

basically, this is list of things that are probably easy to implement, and make sense to me

"general"

  • [X] –help
  • [X] –V, –version

"pattern syntax"

  • [X] -E, –extended-regexp
  • [ ] -F, –fixed-strings
  • [ ] -G, –basic-regexp (see sed info page "Basic (BRE) and extended (ERE) regular expression")
  • [ ] -P, –perl-regexp – probably, though i don't know enough to know how hard this might be

"matching control"

  • [X] -i, –ignore-case
  • [X] -v, –invert-match
  • [X] -w, –word-regexp
  • [X] -x, –line-regexp – as cell-regexp; no -x contents (for whichever column)

"general output control"

  • [X] -c, –count
  • [X] -L, –files-without-match
  • [X] -l, –files-with-matches
  • [X] -m NUM, –max-count=NUM
  • [X] -q, –quiet, –silent
  • [ ] -s, –no-messages (disables error messages)

"output line prefix control"

  • [X] -H, –with-filename
  • [X] -h, –no-filename
  • [ ] –label=LABEL
  • [X] -n, –line-number – probably rather "row number" (there are cases where this is different: labels; or quoted columns with an embedded newline character)

#jsonifycsv: convert a .csv file to a json file

this is a simple program that converts a .csv file to JSON format.

there are switches for

  • column labels (that would be taken from the first row);
  • whether to represent missing values in the .csv file a the null value, or as an empty string, in the output;
  • and, whether number-looking strings in the .csv file should be output as numbers (the default is to output these as strings).

#how to use

jsonifycsv reads its input from stdin, and writes to stdout (error messages are written to stderr).

there are a few flags to help tailor the transformation.

Usage: jsonifycsv [OPTION]...
convert a .csv file into a flatish json file

  -h, --help                    Print help and exit
      --detailed-help           Print help, including all details and hidden
                                  options, and exit
  -V, --version                 Print version and exit

debugging:
  -d, --debug                   Print random debugging messages
  -l, --labels                  The first row of the .csv file contains labels
                                  for subsequent rows  (default=off)
  -m, --missing-value=STRING    Value to take the place of missing values
                                  (default=`null')
  -M, --empty-string-as-missing Treat empty strings ("") with
                                  "--missing-value"  (default=off)
  -n, --numeric                 Print unquoted strings that look like numbers
                                  as numbers on output  (default=off)
  -N, --Numeric                 Like `-n`, but for both quoted and unquoted
                                  string  (default=off)

for more information, see the man page.

#limitations

there are lots of .csv files in the wild. here are some "features" of some .csv files currently not handled by jsonifycsv. possibly some of these can be fixed by using awk(1), or some similar utility, before passing the data through jsonifycsv.

  • if there is a whitespace after the terminating quote ('"') of a double-quoted string, an error will be generated (rfc4180 would say this is an error with the .csv file: "Spaces are considered part of a field and should not be ignored. ")
  • blank lines embedded in the file (especially at the end?) generate errors
  • some .csv files with column labels in the first row also include "row labels", which adds an extra column (as column one) that is not "labeled" in the first row. (rfc4180 does not appear to allow this.)
  • some .csv files leave off some columns on some rows. jsonifycsv checks for consistent number of rows, so will report an error (this is also an error according to rfc4180: "Each line should contain the same number of fields throughout the file.")

some of the above should be fixed by options to jsonifycsv. for example:

  • –ignore-whitespace-only-lines: lines comprised entirely of whitespace should not be processed
  • –row-labels: after the first line, each line has a "row label" as the first column
  • –inconsistent-rows: don't worry about missing columns. but, if -l, then more columns than labels would still generate an error (other than the one extra implied by --row-labels, if specified).

#other tools, alternatives

as an alternative to the current approach, i thought of parsing the script and data files (to understand column locations, in case of --labels), then building humongous regexp's to search or for sed addresses. aesthetically, i don't find that appealing. and, i worry about debugging. but, at least, at the end, the user would *really* be using sed(1) and grep(1). which would be nice (in terms of speed, bugs in my code, and features lacking).

sort of the main added features of these tools is:

  • describing the patterns/scripts via a .csv file
  • using labels (if "–labels") to match columns between the script and data files
  • for sedcsv, having a "guard" associated with a row in the script file that protects an entire row in a data file from being acted on by that row of the script file if the guard doesn't pass

in terms of other tools, among others there are the following:

  • csvqoute – replaces [",], etc., with "invisible" code points, so then awk, sed, grep etc., can work on them. then, at the end, puts things back.
  • csvs-to-sqlite – converts .csv files into sqllite datasets.
  • csvtk – "csv toolkit", i guess. lots of features.
  • csvtools – csvgrep, csvawk. the user documentation appears sparse. csvgrep seems to work with column labels; i don't see that in csvawk. the focus is on speed, which seems very good. one could use the script processing and column-label mapping from my code to construct a sed script, overlaying the guards in front of every script term. sounds kind of messy, but probably not as bad as doing it with raw sed.
  • csvtotable – converts a .csv file to an .html table, for "searching and sorting"
  • trdsql – execute SQL queries on .csv files
  • tsv-utils – from ebay.
  • visidata – vd command; nice curses-oriented spreadsheet viewer
  • xsv – various operations. a bit more like dplyr than grep or sed. `xsv table` has nice formatting for display.

one thing unique in grepcsv, sedcsv, is the idea of passing the patterns/commands in a .csv file, though that's really an encoding issue, not really fundamental. possibly, one could use this sort of file as a front-end to, e.g., sed, by reading the script file, reading the .csv file, building up a massive regexp/script that would do what one wants to do. (modulo sed's limitation on 9 back references.)

compared to some tools, maybe, patterns by column name is a feature. (most, though, seem to support column names.)

#issues, comments, etc.

one obvious issue is performance: the programs are very slow, as the emphasis has been on functionality and (an attempt at) correctness.

but, other than that (or with suggestions for that), please report any issues to the tracker.

#Footnotes

1 note: here, i am showing the data in some tabular form, as it might appear in your data analysis application (i.e., a spreadsheet like gnumeric, a data analysis programming environment like R, or some other application, like emacs org-mode). when running tools like sedcsv or grepcsv, the actual data is assumed to be in, or to have been exported to,a .csv file, as will be seen elsewhere.

2 Because of the "^" and "$" regular expression "assertions".

3 See, for example, RFC4180.

4 Or, more correctly, compensates for its "columnar" nature.

5 This interpretation of the 'a' command, and also of the 'i' command, is *not* like that of sed(1), which appends or inserts the desired text to the next or previous line, not to the contents of the current line.