Some useful text manipulation commands
In one of our works we have to import several csv files into Postgresql. Unfortunately, the files can't be imported without tweaking.
For example we have the following csv (but with dozens of columns):
Col1-bool;Col1-string;Col3-string;Col4-integer;Col5-float;... False;Hello;;NULL;5,6;...
Several things can be noted about this small csv snippet:
- As you can see, the csv file doesn't contain an id column. Actually, there might be no primary key columns at all.
- Second, the strings are unescaped. Thankfully, there are no line breaks!
- Third Col3 string column can't contain a NULL value
- Fourth, the NULL string doesn't mean a NULL value, but a NULL string, this it will cause an error
- Fifth, the float separator is , instead of.
So, how to correct all these problems easily? It's still a tedious task, but works out relatively easy with vim/sed and awk. Here comes a collection of the commands used.
:%s/^/\=strpart(line('.').";",0,5)
it's important to note the 0,5 part of strpart, this means to input the first 5 characters. This is enough for 9999 rows, but might be necessary to set to something higher otherwise.
How to deal with line breaks? Thankfully, they are often right at the beginning of a string. In these cases, this will do the trick:
sed '/;$/{N;s/\n//;}' file
strips a n off if line ends with a semicolon.
To get rid of not-NULL NULL strings we use awk:
awk -F';' 'BEGIN {OFS=";"} { if ($12=="") $12="###"; print }' infile.csv > outfile.csv
This means to treat the file as separated by semicolons (-F';'), and to change column 12 if it's empty to something strange, then print the whole line. Finally, we can use vim to change the strange characters to an empty string. As this was a rather common operation with varying column numbers, it would be nice to move it into a script where we can simply give all the column numbers to be checked, like getRidOfNull 12,24 infile.csv > outfile.csv.
Changing the NULL values is easy, just put in vim:
:%s/;NULL/;/g
Changing the floats:
:%s/\(\d\),\(\d\)/\1.\2/g
And finally, it's easy to convert binary csv files to pure ascii with enconv:
$ enca myfile.csv Universal character set 2 bytes; UCS-2; BMP
CRLF line terminators Byte order reversed in pairs (1,2 -> 2,1)
$ enconv myfile.csv
Tedious, but easy.
blog comments powered by Disqus