Background
I have a little project to analyze 10 years of Seattle crime incident data. It’s just a personal itch I need to scratch, but it does give me an excuse to look at some data sets in R and see if I can answer some questions. The incident data comes as a 363 MB CSV file that contains 1.4 million records of each incident. Each record looks like this:
new row:
> CAD CDW ID: 15738
> CAD Event Number: 10000246255
> General Offense Number: 2010246255
> Event Clearance Code: 250
> Event Clearance Description: MISCHIEF, NUISANCE COMPLAINTS
> Event Clearance SubGroup: NUISANCE, MISCHIEF COMPLAINTS
> Event Clearance Group: NUISANCE, MISCHIEF
> Event Clearance Date: 07/17/2010 08:55:00 PM
> Hundred Block Location: 21XX BLOCK OF 3RD AVE
> District/Sector: M
> Zone/Beat: M2
> Census Tract: 7200.2025
> Longitude: -122.342843234
> Latitude: 47.613551471
> Incident Location: (47.613551471, -122.342843234)
> Initial Type Description:
> Initial Type Subgroup:
> Initial Type Group:
> At Scene Time:
Now this data looks pretty normal to me. It’s a mix of numeric and string data. Some of the fields aren’t filled in and the field names, while useful, are in a format that make them a PITA to use in a database. Which made me think that a useful little tool would be a standalone utility to reformat CSV files. Something that would do the following:
- Drop columns
- Rename columns
- List a subset of records
Now granted, this can all be done in Google Sheets, or Excel, or in R, but I think having something written in C would be nice. It would be fast and wouldn’t have to rely on anything else to work. So I’m going to write it in Nim and document the process here.
Project Breakdown
In order to be sure that I don’t waste my time, I’m going to reiterate my requirements. I want to be able to do the following:
- Specify the name of a CSV file and have the program open it and perform the following tasks:
- List the 1st few records in horizontal order (like the R
dplyr::glimpse
function) - Specify a subset of existing fields to keep
- Specify new names for the remaining fields
- Print the modified version of the file to the screen
- Optionally, write the modified version out to a new file
- List the 1st few records in horizontal order (like the R
So, what are the things I need to figure out? These are basically the tasks that I need to implement to make this work.
Tasks
1. How to pass in the name of a file and have Nim recognize it?
2. How to pass in arguments and have the program do different things based on what they are?
3. How to open a CSV file, potentially with different delimiters?
4. How to recognize the row headers in the CSV?
5. How to only keep specific row headers?
6. How to rename the row headers that are left?
7. How to list records horizontally?
8. How to only print data that is for columns I’ve chosen to keep?
9. How to save the modified data to a new file?
Function signature
Before I get ahead of myself, what’s this thing going to look like to a user when they try to do something with it? Here’s a guess at what that might look like (NOTE: I’ve added line breaks for clarity that won’t be there when used for real.)
<filename.csv>
--keep "Event Clearance Code|
Event Clearance SubGroup|
Event Clearance Group|
Event Clearance Date|
District/Sector|
Zone/Beat|
Census Tract|
Longitude|
Latitude"
--rename "Event Clearance Code=clearance_code|
Event Clearance SubGroup=subgroup|
Event Clearance Group=group|
Event Clearance Date=date|
District/Sector=district|
Zone/Beat=beat|
Census Tract=census_tract|
Longitude=lon|
Latitude=lat"
--list
Which should produce something that looks like this.
> clearance_code: 250
> subgroup: NUISANCE, MISCHIEF COMPLAINTS
> group: NUISANCE, MISCHIEF
> date: 07/17/2010 08:55:00 PM
> district: M
> beat: M2
> census_tract: 7200.2025
> lon: -122.342843234
> lat: 47.613551471
Plenty of stuff to figure out, so might as well get on with it.
How to pass a filename as an argument to Nim?
I have 2 options here. Option 1 is to try and re-use the mechanism I’ve used before in the anagram_redux.nim code that I wrote. It would look something like this:
if os.paramCount() < 1:
let filename = os.paramStr(1)
That would basically check to see if arguments had been passed in along with the program name and if one (or more) had been, then the 1st argument would be treated as the filename. The use of positional arguments like this (1st argument = file, 2nd argument = something_else) is ok, but I don’t like it for something like this where there are many other arguments that can be passed in as well.
So that leaves option 2, which is to use Nim’s parseopt module. parseopt
is nice because I can still have an argument that isn’t named anything to specify the filename, and I can have named options for everything else. Another benefit of using parseopt
is that I can combine items 1 and 2 from my task list and solve them both at once.
How to pass in arguments by name into the program?
I have 4 arguments that I want to pass into the program.
<filename>
: mandatory, name of the CSV file to process--keep
: optional, original names of columns to keep--rename
: optional, original column name followed by new column name--list
: optional, show 1 record in horizontal format (may want to consider showing “n” records)
According to the doc, I should be able to do the following:
import parseopt
var
filename: string
keep_fields: string
rename_fields: string
proc parseOptions() =
var options = initOptParser()
for kind, key, val in getopt(options):
if kind == cmdArgument: filename = key # Remember that indentation is not required
elif kind == cmdLongOption:
if key == "keep": keep_fields = val
elif key == "rename": rename_fields = val
elif key == "list": parseCSV()
elif kind == cmdEnd:
assert(false) # cannot happen
# NOTE: Should use the 'case' syntax that's in the docs, but that's not yet familiar to me.
So that takes care of the bare minimum code needed to parse my options. It’s brittle and will need to be fleshed out, but it works.
Next: Figure out CSV parsing.
Written with StackEdit.