sed(1)
-like editing to a .csv filefirst, clone the repository
git clone https://git.sr.ht/~minshall/csvedepli
then, you will need the following, which is a prerequisite for building the package:
gengetopt
is probably available as a package on your system.
additionally, the package uses the ctl
– c template library –
which is packaged as a git submodule. The makefile includes a
mechanism (see the initialized
target) to automatically set this up
when building the default (all
) target. but, once you have cloned
the repository, you can do this manually by
make submodules
to initialize the submodule.
once the prerequisites are installed, the following supposedly builds and installs the binaries and man pages:
make
sudo make install
manpages:
, manhtmls:
: building the man pages (and the .html
versions thereof) requires:
readme:
: building README.md
(from README.org
) also requires
check:
: to run the unit tests, requires
the goal is to allow grep-style searching and reporting, and sed-style searching, reporting, and modifying, operations to be performed on .csv files, in a way that allows some independence from such things as quoting and/or the order of columns (relying, instead, on column names, rather than numbers – this is an *option*, not the default).
in this introduction, i assume you are familiar with sed(1), grep(1), and the notion of regular expressions. apologies if you are not – but, there are many resources on the web to find out about them.
csvedepli
is also an experiment to see if writing the patterns and
actions in the same "form" – i.e., a table – as the data to which
those are to be applied might help simplify the task. (i think the
jury is still out.)
cleaning up data is a regular task for people working with data sets,
especially data sets imported from various sources. sedcsv
and
grepcsv
offer some modest help with this work.
as a running example, i will present some data based loosely on some the early .csv files that have been provided daily by Johns Hopkins University since the beginning, in 2020, of the Covid-19 pandemic. In the early days, the naming of geographic entities – countries, states, etc., – was somewhat fluid, which is what motivates these examples.
let us say that we had received data like the following.1
Province_State | Country_Region | Last Update | Confirmed | Deaths | Recovered | Latitude | Longitude | FIPS | Admin2 |
Berkeley, CA | US | 2020-03-03T23:53:03 | 1 | 0 | 0 | 37.8715 | -122.273 | ||
Taiwan | Taipei and environs | 2020-03-10T19:13:15 | 47 | 1 | 17 | 23.7 | 121.0 | ||
Guam | US | 2020-03-15T18:20:19 | 3 | 0 | 0 | 13.4443 | 144.7937 | ||
Guam | 2020-03-16T02:13:16 | 3 | 0 | 0 | 13.4443 | 144.7937 | |||
Taiwan* | 2020-03-23 06:45:13 | 195 | 2 | 28 | 23.7 | 121.0 | |||
The Gambia | 2020-03-17T23:33:02 | 1 | 0 | 0 | 13.4667 | -16.6 | |||
Gambia, The | 2020-03-18T14:13:56 | 1 | 0 | 0 | 13.4432 | -15.3101 | |||
Gambia | 3/22/20 23:45 | 1 | 0 | 0 | 13.4432 | -15.3101 |
as already mentioned, in the early days the naming for countries, states, etc., was not yet standardized.
let's say our standard was to use "United States" for the United States; to list Guam as a province/state within the United States; to list Taiwan as "Taiwan, Province of China"; and "Gambia, The" for the Gambia.
to standardize our input, we might write a sedcsv
script in some data
application (spreadsheet, R data frame, emacs org-mode table)
something like this:1
FIPS | Admin2 | Country_Region | Province_State |
/^US$/;c United States | |||
c 66000 | /^Guam$/;c United States | c Guam | |
c Alameda County | /^Berkeley, CA$/;c California | ||
/^Taipei and environs$/;c Taiwan, Province of China | /^Taiwan$/;c | ||
/^Taiwan$/;c Taiwan, Province of China | /^Taiwan$/;c | ||
/^Taiwan[*]$/;c Taiwan, Province of China | |||
/^The Gambia$/;c Gambia, The | |||
/^Gambia$/;c Gambia, The |
and then export this table and the original data to .csv files, to be used as inputs to sedcsv.
in the above "script", the second row, for example, says that if a
row, *in the input data file*, has a Country_Region
cell with the
exact value2 Guam
, then *replace*
FIPS
field with the value "66000",Country_Region
field with "United States", andProvince_State
field with "Guam".with similar instructions for data that claim to have come from the state of Berkeley, or various variant expressions of Taiwan and the Gambia.
notice that the order of columns in the data and script tables is
different. running either sedcsv
or grepcsv
with the --labels
argument says to match columns in the script file and those in the
data file(s) using the provided column names; otherwise, column
numbers are used.
one issue to keep in mind is the difference between what you see in a
table, spreadsheet, etc., within some program, and you see in a .csv
file exported from that table. for example, if the table looks like
this:
one cell | after another |
(notice the empty, second, column.) then the .csv
may well look
like this:
one cell,,after another
but, it doesn't *need* to look like that. it could, depending on your data analysis software, and how you have it configured, look like
"one cell","","after another"
this, when imported back into a table, gives the original table:
one cell | after another |
i.e., without double quotes surrounding each cell.
the above data table, with "Berkeley", "Guam", etc., when converted to
a .csv
file, might look like this:
Province_State,Country_Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude,FIPS,Admin2
"Berkeley, CA",US,2020-03-03T23:53:03,1,0,0,37.8715,-122.273,,
Taiwan,Taipei and environs,2020-03-10T19:13:15,47,1,17,23.7,121.0,,
Guam,US,2020-03-15T18:20:19,3,0,0,13.4443,144.7937,,
,Guam,2020-03-16T02:13:16,3,0,0,13.4443,144.7937,,
,Taiwan*,2020-03-23 06:45:13,195,2,28,23.7,121.0,,
and, here is how we might use the script (also from above) acting on this data:
sedcsv --labels -f fix-cv19-script.csv cv19-data.csv
to produce this .csv file:
Province_State,Country_Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude,FIPS,Admin2
California,United States,2020-03-03T23:53:03,1,0,0,37.8715,-122.273,,Alameda County
,"Taiwan, Province of China",2020-03-10T19:13:15,47,1,17,23.7,121.0,,
Guam,United States,2020-03-15T18:20:19,3,0,0,13.4443,144.7937,,
Guam,United States,2020-03-16T02:13:16,3,0,0,13.4443,144.7937,66000,
,"Taiwan, Province of China",2020-03-23 06:45:13,195,2,28,23.7,121.0,,
,"Gambia, The",2020-03-17T23:33:02,1,0,0,13.4667,-16.6,,
,"Gambia, The",2020-03-18T14:13:56,1,0,0,13.4432,-15.3101,,
,"Gambia, The",3/22/20 23:45,1,0,0,13.4432,-15.3101,,
there are certain rules for how a .csv file should be formatted. 3
in the above .csv
representation, you see that the string that
appeared as plain Berkeley, CA
, in the table, appears surrounded by
quotes in the .csv
file, e.g.,
"Berkeley, CA"
.
if we had happened to have a table that included cells with "internal"
double quotes, we would see a pair of quotes in an output .csv
file
for each single quote in the input table. for example, this table
He said, | "she will be next." |
"Good," | she replied. |
when converted to a .csv
table, looks at *least* like:
"He said,","""she will be next."""
"""Good,""",she replied.
(there might also be quotes – quite unnecessary, but quite
RFC4180-legal – around the second cell of the second row "she replied."
.)
this is an issue because both sedcsv and grepcsv allow you to specify
a representation against which to compare and, in the case of sedcsv,
modify. the default is neutral
, but if you are looking exclusively
at one or more .csv files, you might find it more natural to write
your patterns and commands using --representation csv
.
as an example, say you wanted to change the gender of the pronoun
"she" in the second column, but only in the "Good,"
line. you would
write a script like this:1
/^"Good,"$/ | s/she/he/ |
and then might execute it like this, specifying a representation of
csv
:
sedcsv --representation csv -f dquotes-script.csv dquotes.csv
He said, | "she will be next." |
"Good," | she replied. |
but, the gender of the pronoun "she" has not changed in the second
row. if we look at the script file as a .csv
file:
"/^""Good,""$/",s/she/he/
and compare it to the data table, converted to a .csv
file:
"He said,","""she will be next."""
"""Good,""",she replied.
we see the problem. the .csv
representation has more double
quotation marks that we might have (naively) expected.
on the other hand, if we use the same script and data files, but
invoke sedcsv
with --representation neutral
(the default):
sedcsv --representation neutral -f dquotes-script.csv dquotes.csv
He said, | "she will be next." |
"Good," | he replied. |
it's again a source of possible confusion that the representation we are talking about here is that of the *data* file, but that it is in the *script* file that you may need to compensate for whatever representation you are using.
this topic mostly affects sedcsv
, but is also relevant when running
grepcsv
with the -S-
(--sed-format
) option.
a traditional sed(1)
script consists of a series of "commands",
separated by colon (';'). a given command can have an address part,
which restricts the command to certain portions of the file. the address
can be a regular expression, one or more line numbers.
sedcsv
adds an extra concept4, called a "guard", currently a regular
expression (though, really, any sed address could probably be
implemented). each column in a script file may have one or more
guards.
*all* the guards in a given script row must succeed for a given data row before *any* of the commands in that script row will be attempted on that data row. we might invent some terminology here
M
is *compatible with* data row N
": if and only if
all the guards in M
pass when applied to N
.so, if script row M
is compatible with data row N
, then each
command in script row M will be attempted for data row N. any of
these commands may *itself* contain an *address*; assuming script row
M
is compatible with data row N
, then only those commands in
script row M
with an address that matches the equivalent column in
data row N
(or, that have no address) will have its command applied.
note that, for a given script row and data row, this means that an
*address* for a command in one column does not affect the processing
of a command in the same, or any other, column.
sorry if this is a source of confusion.
addresses
(currently, only a regular
expression) separated from other addresses and/or commands by a
semi-colon (';')addresses
, immediately preceding its
associated single-letter sed command. (i.e., *not* suffixed with a
semi-colon.take this script table as an example:
declarative | interrogative |
/this/s/t/T/ | /what/a ? |
/that/;s/t/T/ | /how/;a ? |
the first row says that
this first row has *no* guards, just addresses for each of its commands. this script row will be compatible with *all* rows of any input data file.
the second row, which looks so similar, says that:
declarative | interrogative |
this is | what |
that is | what |
this is | how |
that is | how |
sedcsv --labels -f s1.csv d1.csv
declarative | interrogative |
This is | what? |
that is | what? |
This is | how |
That is | how? |
for grepcsv
, a row in the data file is said to match the patterns
in the script file if at least one row of the script file is
compatible with that data row.
sed(1)
-like editing to a .csv filethis program is supposed to allow easy-enough "stream" editing of .csv files.
the program is fed a script
and data
file. both are .csv files.
the contents of a (row,column)
we call a cell
.
the script contains one or more rows. the editing/filtering operations in a cell in the script applies to a column in each row of the data file.
each cell in the script file can have one or more guards
, as well as
one or more commands
.
guards
are like sed addresses, but they end at the end of the
cell, or are separated by other elements of the cell (other guards
and/or commands) by a semi-colon (";").
when processing one row in the *data* file, for a row in the *script*
file to be considered "active", all the guards in all the columns must
succeed; i.e., the guards in a row in a script file are considered to
be ANDed
together. if the row in the script file is active, then
its commands are processed.
now, this is maybe a bit confusing: a command
can, itself, have an
address
, i.e., like a guard
for that command. these guards are
*not* separated from the command by a semi-colon – like in "normal"
sed(1)
, they prefix and abut the command.
if a row of the script file is active, then its commands are processed. (sound familiar? :) if a command has its own guard, it is executed iff its guard is successful; otherwise, the command is ignored.
sed(1)
has many forms of indicating which lines of an input file
should be processed. sedcsv
only supports a limited number of these
forms.
these addressing forms *should* be allowed
step
'th line after first
regexp
regexp
address1
to one
that matches address2
, inclusivewe should support using a
regexp
in /regexp/
.
etc.this form likely will *not* be allowed
additionally, we may allow the
regexp
a problem with !
is that after parsing the script file, we plan to
build, from all its row for a given column, *one* regexp containing
all the expressions. if that's possible. (if may not, if matching an
earlier regexp sub-expression means *not* matching a subsequent
sub-expression.)
likewise, sed(1)
contains many commands. here are commands we
should support
text
--quiet
to only print selected lines. (also, if --invert
is
also specified, will only print "non-selected" lines.)regexp
. (note: see in addresses
above, "\c…c": same applies here to, e.g.,
"sXregexpXreplacement".) there are hacks, though: accept "s.*",
then look at the first character, build a regexp for that character,
etc. sounds messy, but maybe works.source
with the corresponding character
in dest
's'
commandSed's
"substitute" ('s'
) command has a number of flags defined.
most are *not* implemented; here is the roster of what is and what
isn't implemented in sedcsv
.
first, flags that *should* be implemented:
'g'
: replace *all* matches (else, just first)NUMBER
: only replace the NUMBERth
matchi
, I
: do a case-insensitive matchsecond, flags that aren't likely to be implemented:
w
: if a substitution is made, write out to a specified filee
: if a substitution is made, pipe results (i guess) through
this command, and replace contents of the cell with the outputm
, M
: "multi-line mode" (one would like to believe that .csv
files are *not* multi-line entities, but, of course, quoted columns
can span lines)'c'
command syntaxThe proposed syntax for the 'c' ("change") command is slightly different from "standard" sed:
c TEXT
: the "standard GNU extension" is supported. TEXT
extends to the end of the contents of the column in the script file.c\\\nTEXT...
: (this is an attempt to represent the standard
POSIX variant of 'c' – a c, followed by a backslash, [possibly?]
followed by a newline). this is *not* supported.our "additions":
c
: deletes the text in the column. this is a sop to the
difficulty a user might encounter trying to get a spreadsheet to
output a .csv file with a trailing space. (since, by the standard
GNU extension, =c = would delete text in the column.)c/.../
: replace the column text with ...
; forward slashes
included in ...
can be escaped with a back slash. (how difficult
it might be to move a backslash through the spread sheet, out to the
.csv file, is unknown.) (un-backslashed) back slashes before a
forward slash are removed; all other back slashes are left in place.note that the 'd' command deletes the entire *row*, not the contents of a single column.
for more information, please see the man page.
the main difference is in scope; sed is much richer than sedcsv. but, there are some other differences in features implemented.
commands
this is an attempt to document the features sed(1) provides for
addresses, along with what gnu adds, and which ones we support and,
for those not supported, what ones are "planned" (in the sense of,
seem tractable, likely to be implemented) and how the sedcsv
feature differs from the sed
one:
address feature | supported | "planned" | modified |
---|---|---|---|
/…/ (by regexp) | x | ||
\%…% (by regexp) | x | ||
by line number | x | rowno | |
ranges (',') | x | ||
$ (last line) | |||
NUM~NUM (gnu) | |||
inverting (/…/!) | x | ||
case independent (/…/I) | x | ||
multi-line (/…/M) | |||
0,/…/ (gnu) | |||
ADDR,+N (gnu) | |||
ADDR,~N (gnu) |
Let's say you have a .csv file named "horror.csv", with (some version of) the shortest English-language horror story ever written:
setup | event | response |
The last man in the world sat in a room. | He heard a knock on the door. | "Come in", he said |
And, what if you wanted to find and print out this particular line (maybe horror.csv contains various stories, inexplicably laid out in columns!)
setup | event | response |
knock on the door |
grepcsv --labels -f horror-script.csv horror1.csv
setup | event | response |
The last man in the world sat in a room. | He heard a knock on the door. | """Come in"", he said" |
--labels
says to use the first line of the /script/ file (the
argument of the -f
option) and of the /data/ file (horror.csv
, in
this case) as column labels, and associate patterns in the script file
with the contents of the similarly-named columns in the data file.
(the default is to use the column position – column number – of the
pattern in the script file to select a column to search in the data
file.)
But, what if someone thinks of the columns differently. Say
event | setup | response |
He heard a knock on the door. | The last man in the world had been sitting in a room. | "Come in", he said |
grepcsv --labels -f horror-script.csv horror2.csv
event | setup | response |
He heard a knock on the door. | The last man in the world had been sitting in a room. | """Come in"", he said" |
and, what if you wanted to search *both* files?
grepcsv --labels --no-print-labels --no-filename -f <(echo -e 'event\nknock on the door') horror?.csv
The last man in the world sat in a room. | He heard a knock on the door. | """Come in"", he said" |
He heard a knock on the door. | The last man in the world had been sitting in a room. | """Come in"", he said" |
For more information, please see the man page.
basically, this is list of things that are probably easy to implement, and make sense to me
"general"
"pattern syntax"
"matching control"
"general output control"
"output line prefix control"
this is a simple program that converts a .csv file to JSON format.
there are switches for
.csv
file a the null
value, or as an empty string, in the output;.csv
file should be
output as numbers (the default is to output these as strings).jsonifycsv
reads its input from stdin
, and writes to stdout
(error messages are written to stderr
).
there are a few flags to help tailor the transformation.
Usage: jsonifycsv [OPTION]...
convert a .csv file into a flatish json file
-h, --help Print help and exit
--detailed-help Print help, including all details and hidden
options, and exit
-V, --version Print version and exit
debugging:
-d, --debug Print random debugging messages
-l, --labels The first row of the .csv file contains labels
for subsequent rows (default=off)
-m, --missing-value=STRING Value to take the place of missing values
(default=`null')
-M, --empty-string-as-missing Treat empty strings ("") with
"--missing-value" (default=off)
-n, --numeric Print unquoted strings that look like numbers
as numbers on output (default=off)
-N, --Numeric Like `-n`, but for both quoted and unquoted
string (default=off)
for more information, see the man page.
there are lots of .csv files in the wild. here are some "features" of
some .csv files currently not handled by jsonifycsv
. possibly some
of these can be fixed by using awk(1)
, or some similar utility,
before passing the data through jsonifycsv
.
jsonifycsv
checks for consistent number of rows, so will report an error (this
is also an error according to rfc4180: "Each line should contain the
same number of fields throughout the file.")some of the above should be fixed by options to jsonifycsv
. for
example:
-l
, then more columns than labels would still generate an error
(other than the one extra implied by --row-labels
, if specified).as an alternative to the current approach, i thought of parsing the
script and data files (to understand column locations, in case of
--labels
), then building humongous regexp's to search or for sed
addresses. aesthetically, i don't find that appealing. and, i worry
about debugging. but, at least, at the end, the user would *really*
be using sed(1) and grep(1). which would be nice (in terms of speed,
bugs in my code, and features lacking).
sort of the main added features of these tools is:
in terms of other tools, among others there are the following:
one thing unique in grepcsv, sedcsv, is the idea of passing the patterns/commands in a .csv file, though that's really an encoding issue, not really fundamental. possibly, one could use this sort of file as a front-end to, e.g., sed, by reading the script file, reading the .csv file, building up a massive regexp/script that would do what one wants to do. (modulo sed's limitation on 9 back references.)
compared to some tools, maybe, patterns by column name is a feature. (most, though, seem to support column names.)
one obvious issue is performance: the programs are very slow, as the emphasis has been on functionality and (an attempt at) correctness.
but, other than that (or with suggestions for that), please report any issues to the tracker.
1 note: here, i am showing the data in some tabular form, as it might
appear in your data analysis application (i.e., a spreadsheet like
gnumeric, a data analysis programming environment like R, or some
other application, like emacs org-mode). when running tools like
sedcsv
or grepcsv
, the actual data is assumed to be in, or to have
been exported to,a .csv
file, as will be seen elsewhere.
2 Because of the "^" and "$" regular expression "assertions".
4 Or, more correctly, compensates for its "columnar" nature.
5 This interpretation of the 'a'
command, and also of the 'i' command, is *not* like that of
sed(1)
, which appends or inserts the desired text to the next or
previous line, not to the contents of the current line.