|
The Kermit Project |
Columbia
University
612 West 115th Street, New York NY 10025 USA • kermit@columbia.edu
| ||||||||
| |||||||||
Frank da Cruz
The Kermit Project
Columbia University
Last update: Fri Aug 20 14:02:34 2010
Comma-Separated Value (CSV) format is commonly output by spreadsheets and databases when exporting data into plain-text files for import into other applications. It is important to understand the formal definition of a CSV file.
The first two are regular fields. The second is a field that has an embedded space but in whichy any leading or trailing spaces are to be ignored. The fourth is an empty field, but still a field. The fifth is a field that contains embedded commas. The sixth has leading and trailing spaces. The last field has embedded quotation marks.aaa, bbb, has spaces,,"ddd,eee,fff", " has spaces ","Muhammad ""The Greatest"" Ali"
Prior to C-Kermit 9.0 Alpha.06, C-Kermit did not handle CSV files according to the specification above. Most seriously, there was no provision for a separator to be surrounded by whitespace that was to be considered part of the separator. Also there was no provision for quoting doublequotes inside of a quoted string.
def xx {
echo [\fcontents(\%1)]
.\%9 := \fsplit(\fcontents(\%1), &a, \44, CSV)
for \%i 1 \%9 1 { echo "\flpad(\%i,3). [\&a[\%i]]" }
echo "-----------"
}
xx {a,b,c}
xx { a , b , c }
xx { aaa,,ccc," with spaces ",zzz }
xx { "1","2","3","","5" }
xx { this is a single field }
xx { this is one field, " and this is another " }
xx { name,"Mohammad ""The Greatest"" Ali", age, 67 }
exit
gives the following results:
The separator \44 (comma) must still be specified as the break set (3rd \fsplit() parameter). When “CSV” is specified as the include set:[a,b,c] 1. [a] 2. [b] 3. [c] ----------- [ a , b , c ] 1. [a] 2. [b] 3. [c] ----------- [ aaa,,ccc," with spaces ",zzz ] 1. [aaa] 2. [] 3. [ccc] 4. [ with spaces ] 5. [zzz] ----------- [ "1","2","3","","5" ] 1. [1] 2. [2] 3. [3] 4. [] 5. [5] 6. [] ← Oops, this was fixed in Alpha.07 ----------- [ this is a single field ] 1. [this is a single field] ----------- [ this is one field, " and this is another " ] 1. [this is one field] 2. [ and this is another ] 3. [] ← Ditto ----------- [ name,"Mohammad ""The Greatest"" Ali", age, 67 ] 1. [name] 2. [Mohammad "The Greatest" Ali] 3. [age] 4. [67] -----------
Of course you can specify any separator(s) you want with either the CSV, TSV, or ALL symbolic include sets. For example, if you have a TSV file in which you want the spaces around each Tab to be discarded, you can use:
\9 is Tab.\fsplit(variable, &a, \9, CSV)
The new symbolic include sets can also be used with \fword(), which is just like \fsplit() except that it retrieves the nth word from the argument string, rather than an array of all the words. In C-Kermit you can get information about these or any other functions with the HELP FUNCTION command, e.g.:
C-Kermit> help func word
\fword(s1,n1,s2,s3,n2,n3) - Extract word from string.
s1 = source string
n1 = word number (1-based) counting from left; if negative, from right.
s2 = optional break set.
s3 = optional include set.
n2 = optional grouping mask.
n3 = optional separator flag:
0 = collapse adjacent separators
1 = don't collapse adjacent separators.
Default break set is all characters except ASCII letters and digits.
ASCII (C0) control characters are treated as break characters by default.
Default include set is null. Three special symbolic include sets are also
allowed: ALL (bytes not in the break set), CSV (special treatment
for Comma-Separated-Value records), and TSV (special treatment for Tab-
Separated-Value records).
If grouping mask given and nonzero, words can be grouped by quotes or
brackets selected by the sum of the following:
1 = doublequotes: "a b c"
2 = braces: {a b c}
4 = apostrophes: 'a b c'
8 = parentheses: (a b c)
16 = square brackets: [a b c]
32 = angle brackets: <a b c>
Nesting is possible with {}()[]<> but not with quotes or apostrophes.
Returns string:
Word number n, if there is one, otherwise an empty string.
C-Kermit>
Previous calling conventions for \fjoin() are undisturbed, including the ability to specify a portion of an array, rather than the whole array:
declare \&a[] = 1 2 3 4 5 6 7 8 9 echo \fjoin(&a[3:7],CSV) 3,4,5,6,7
Using \fsplit() and \fjoin() is now possible to convert a comma-separated value list into a tab-separated value list, and vice versa (which is not a simple matter of changing commas to tabs or vice versa).
Here is a simple example in which we purge all records of customers who have two or more unpaid bills. The file is sorted in serial-number order.
#!/usr/local/bin/kermit
.filename = somefile.csv # Input file in CSV format
fopen /read \%c \m(filename) # Open it
if fail exit # Don't go on if open failed
copy \m(filename) ./new # Make a copy of the file
.oldserial = 00000000000 # Multiple records for each serial number
.zeros = 0 # Unpaid bill counter
while true { # Loop
fread /line \%c line # Get a record
if fail exit # End of file
.n := \fsplit(\m(line),&a,\44,CSV) # Split the fields into an array
if not equ "\m(oldserial)" "\&a[6]" { # Have new serial number?
# Remove all records for previous serial number
# if two or more bills were not paid...
if > \m(zeros) 1 {
grep /nomatch \m(oldserial) /output:./new2 ./new
rename ./new2 ./new
}
.oldserial := \&a[6] # To detect next time serial number changes
.zeros = 0 # Reset unpaid bill counter
}
if equ "\&a[5]" "$0.00" { # Element 5 is amount paid
increment zeros # If it's zero, count it.
}
}
fclose \%c
Rewriting the file multiple times is inelegant, but this is a quick and dirty use-once-and-discard script, so elegance doesn't count. The example is interesting in that it purges certain records based on the contents of other records. Maybe there is a way to do this directly with SQL, but why use SQL when you can use Kermit?
Kermit has several built-in data types, but you can invent your own data types as needed using Kermit's macro feature:
For example:define variablename value
This defines a macro named alphabet and gives it the value abcdefghijklmnopqrstuvwxyz. A more convenient notation (added in C-Kermit 7.0) for this is:define alphabet abcdefghijklmnopqrstuvwxyz
The two are exactly equivalent: they make a literal copy the "right hand side" as the value of the macro. Then you can refer to the macro anywhere in a Kermit command as "\m(macroname)":.alphabet = abcdefghijklmnopqrstuvwxyz
There is a second way to define a macro, which is like the first except that the right-hand side is evaluated first; that is, any variable references or function calls in the right-hand side are replaced by their values before the result is assigned to the macro. The command for this is ASSIGN rather than DEFINE:echo "Alphabet = \m(alphabet)"
which prints:define alphabet abcdefghijklmnopqrstuvwxyz assign backwards \freverse(\m(alphabet)) echo "Alphabet backwards = \m(backwards)"
This kind of assignment can also be done like this:Alphabet backwards = zyxwvutsrqponmlkjihgfedcba
Any command starting with a period is an assignment, and the operator (= or :=) tells what to do with the right-hand side before making the assignment..alphabet = abcdefghijklmnopqrstuvwxyz .backwards := \freverse(\m(alphabet))
In both the DEFINE and ASSIGN commands, the variable name itself is taken literally. It is also possible, however, to have Kermit compute the variable name. This is done (as described in Using C-Kermit, 2nd Ed., p.457), using parallel commands that start with underscore: _DEFINE and _ASSIGN (alias _ASG). These are just like DEFINE and ASSIGN except they evaluate the variable name before making the assigment. For example:
would create a macro named ONEONEONE with a value of 111, and:define \%a one _define \%a\%a\%a 111
would create the same macro with the same value, but:define \%a one define number 111 _assign \%a\%a\%a \m(number)
would give the macro a value of "\m(number)".define \%a one define number 111 _define \%a\%a\%a \m(number)
You can use the _ASSIGN command to create any kind of data structure you want; you can find some examples in the Object-Oriented Programming section of the Kermit Script Library. In the following program we use this capability to create a two-dimensional array, or matrix, to hold the all the elements of the CSV file, and then to display the matrix:
fopen /read \%c data.csv # Open CSV file
if fail exit 1
.\%r = 0 # Row
.\%m = 0 # Maximum columns
while true {
fread /line \%c line # Read a record
if fail break # End of file
.\%n := \fsplit(\m(line),&a,\44,CSV) # Split record into items
incr \%r # Count this row
for \%i 1 \%n 1 { # Assign items to this row of matrix
_asg a[\%r][\%i] \&a[\%i]
}
if > \%i \%m { .\%m := \%i } # Remember width of widest row
}
fclose \%c # Close CSV file
decrement \%m # (because of how FOR loop works)
echo MATRIX A ROWS: \%r COLUMNS: \%m # Show the matrix
for \%i 1 \%r 1 { # Loop through rows
for \%j 1 \%m 1 { # Loop through columns of each row
xecho "\flpad(\m(a[\%i][\%j]),6)"
}
echo
}
exit 0
The matrix is called a and its elements are a[1][1],
a[1][2], a[1][3], ... a[2][1], etc, and you can
treat this data structure exactly like a two-dimensional array, in which you
can refer to any element by its "X and Y coordinates". For example, if the
CSV file contained numeric data you could compute row and column sums using
simple FOR loops and Kermit's built-in one-dimensional array data type:
declare \&r[\%r] # Make an array for the row sums
declare \&c[\%m] # Make an array for the column sums
for \%i 1 \%r 1 { # Loop through rows
for \%j 1 \%m 1 { # Loop through columns of each row
increment \&r[\%i] \m(a[\%i][\%j]) # Accumulate row sum
increment \&c[\%j] \m(a[\%i][\%j]) # Accumulate column sum
}
}
Note that the sum arrays don't have to be initialized to zero because
Kermit's INCREMENT command treats empty definitions as zero.