Thursday, February 20, 2020

Nim CSV Tool

Background

I have a little project to analyze 10 years of Seattle crime incident data. It’s just a personal itch I need to scratch, but it does give me an excuse to look at some data sets in R and see if I can answer some questions. The incident data comes as a 363 MB CSV file that contains 1.4 million records of each incident. Each record looks like this:

new row: 
> CAD CDW ID:  15738  
> CAD Event Number:  10000246255 
> General Offense Number:  2010246255
> Event Clearance Code:  250
> Event Clearance Description:  MISCHIEF, NUISANCE COMPLAINTS
> Event Clearance SubGroup:  NUISANCE, MISCHIEF COMPLAINTS
> Event Clearance Group:  NUISANCE, MISCHIEF 
> Event Clearance Date:  07/17/2010 08:55:00 PM
> Hundred Block Location:  21XX BLOCK OF 3RD AVE
> District/Sector:  M
> Zone/Beat:  M2
> Census Tract:  7200.2025
> Longitude:  -122.342843234
> Latitude:  47.613551471
> Incident Location:  (47.613551471, -122.342843234)
> Initial Type Description:  
> Initial Type Subgroup:  
> Initial Type Group:  
> At Scene Time:  

Now this data looks pretty normal to me. It’s a mix of numeric and string data. Some of the fields aren’t filled in and the field names, while useful, are in a format that make them a PITA to use in a database. Which made me think that a useful little tool would be a standalone utility to reformat CSV files. Something that would do the following:

  1. Drop columns
  2. Rename columns
  3. List a subset of records

Now granted, this can all be done in Google Sheets, or Excel, or in R, but I think having something written in C would be nice. It would be fast and wouldn’t have to rely on anything else to work. So I’m going to write it in Nim and document the process here.

Project Breakdown

In order to be sure that I don’t waste my time, I’m going to reiterate my requirements. I want to be able to do the following:

  • Specify the name of a CSV file and have the program open it and perform the following tasks:
    1. List the 1st few records in horizontal order (like the R dplyr::glimpse function)
    2. Specify a subset of existing fields to keep
    3. Specify new names for the remaining fields
    4. Print the modified version of the file to the screen
    5. Optionally, write the modified version out to a new file

So, what are the things I need to figure out? These are basically the tasks that I need to implement to make this work.

Tasks

1. How to pass in the name of a file and have Nim recognize it?
2. How to pass in arguments and have the program do different things based on what they are?
3. How to open a CSV file, potentially with different delimiters?
4. How to recognize the row headers in the CSV?
5. How to only keep specific row headers?
6. How to rename the row headers that are left?
7. How to list records horizontally?
8. How to only print data that is for columns I’ve chosen to keep?
9. How to save the modified data to a new file?

Function signature

Before I get ahead of myself, what’s this thing going to look like to a user when they try to do something with it? Here’s a guess at what that might look like (NOTE: I’ve added line breaks for clarity that won’t be there when used for real.)

  <filename.csv> 
  --keep "Event Clearance Code|
          Event Clearance SubGroup|
          Event Clearance Group|
          Event Clearance Date|
          District/Sector|
          Zone/Beat|
          Census Tract|
          Longitude|
          Latitude"  
  --rename "Event Clearance Code=clearance_code|
            Event Clearance SubGroup=subgroup|
            Event Clearance Group=group|
            Event Clearance Date=date|
            District/Sector=district|
            Zone/Beat=beat|
            Census Tract=census_tract|
            Longitude=lon|
            Latitude=lat"
  --list         

Which should produce something that looks like this.

> clearance_code:  250
> subgroup:  NUISANCE, MISCHIEF COMPLAINTS
> group:  NUISANCE, MISCHIEF 
> date:  07/17/2010 08:55:00 PM
> district:  M
> beat:  M2
> census_tract:  7200.2025
> lon:  -122.342843234
> lat:  47.613551471

Plenty of stuff to figure out, so might as well get on with it.

How to pass a filename as an argument to Nim?

I have 2 options here. Option 1 is to try and re-use the mechanism I’ve used before in the anagram_redux.nim code that I wrote. It would look something like this:

if os.paramCount() < 1:
  let filename = os.paramStr(1)

That would basically check to see if arguments had been passed in along with the program name and if one (or more) had been, then the 1st argument would be treated as the filename. The use of positional arguments like this (1st argument = file, 2nd argument = something_else) is ok, but I don’t like it for something like this where there are many other arguments that can be passed in as well.

So that leaves option 2, which is to use Nim’s parseopt module. parseopt is nice because I can still have an argument that isn’t named anything to specify the filename, and I can have named options for everything else. Another benefit of using parseopt is that I can combine items 1 and 2 from my task list and solve them both at once.

How to pass in arguments by name into the program?

I have 4 arguments that I want to pass into the program.

  • <filename>: mandatory, name of the CSV file to process
  • --keep: optional, original names of columns to keep
  • --rename: optional, original column name followed by new column name
  • --list: optional, show 1 record in horizontal format (may want to consider showing “n” records)

According to the doc, I should be able to do the following:

import parseopt

var
  filename: string
  keep_fields: string
  rename_fields: string
  
proc parseOptions() =
  var options = initOptParser()
  for kind, key, val in getopt(options):
    if kind == cmdArgument: filename = key # Remember that indentation is not required
    elif kind == cmdLongOption:
      if key == "keep": keep_fields = val
      elif key == "rename": rename_fields = val
      elif key == "list": parseCSV()
    elif kind == cmdEnd:
      assert(false) # cannot happen
# NOTE: Should use the 'case' syntax that's in the docs, but that's not yet familiar to me.     

So that takes care of the bare minimum code needed to parse my options. It’s brittle and will need to be fleshed out, but it works.

Next: Figure out CSV parsing.

Written with StackEdit.

Monday, February 10, 2020

Nim Test Drive

I recently read about the Nim programming language in an article that posited it might take the place of Python. Despite the article being both badly written and patently inaccurate (Nim does NOT rely on an interpreter of any kind), I was intrigued and decided to give the language a try.

What is Nim

Nim is a statically-typed compiled programming language originally designed and developed by Andreas Rumpf in 2005. The project was made public in 2008 and is now a Free and Open Source Software project being worked on by a group of volunteers along with Andreas. Version 1.0 of the language was released on September 23, 2019, which guarantees a stable base to build future releases on. Nim’s goals are to, “be as fast as C, as expressive as Python, and as extensible as Lisp.”

One of the unique features of Nim is that it compiles your code into optimized C code by default, which is then compiled by gcc or Clang on your system.

Anagram Redux

Having just spent some time working in both Python and R on a anagram implementation, I decided to try writing one in Nim. The goal was to write a program that takes a word as input from the user, reads a local file of dictionary words, and returns all of the words in the dictionary that are anagrams of the user’s original word. This would allow me to explore the following:

  1. Argument parsing
  2. File IO
  3. String processing
  4. Basic data structures
  5. Code organization using procs (functions)

First Impressions

I spent the better part of a weekend putting the program together. By far the most time-consuming aspect was determining the right way to declare the various types while instantiating variables. Frankly, it felt more difficult than it needed to be, but probably can be chalked up to my lack of experience working with statically typed languages.

Type madness!

I needed to store multiple sorted words in the same data structure. In Python this would be a list. In Nim there are 2 types of containers I could use, an array or a sequence. The array is a fixed size container that apparently is a C array that has very fast lookup times. A sequence is a dynamically allocated container that can grow in size. Nim allows you to instantiate either of these in one of 2 ways. You can define the empty object you want to use first and then write to it later, OR you can just create one on-the-fly with data.

Coming up with the right combination of types to hold my data took awhile, but I finally ended up with this to define an empty hash table that would store sequences.

var anagram_table = tables.initTable[string, seq[string]]()

In English, this means:
var anagram_table – create a mutable variable named ‘anagram_table’
tables.initTable – using the ‘tables’ module ‘initTable’ proc to create a table…
[string, seq[string]]– with “strings” as key type and a “sequence of strings” as values

Then I could write to anagram_table in the following manner.

anagram_table[key] = @[value]

The @ is extremely important on the initial value declaration, as it tells the program that I need to be able to add to that sequence later. Afterwards, new entries can be added to the same sequence like this.

anagram_table[key].add(value)

The rest of the implementation was pretty easy. Control flow and procedures were easy to implement. File IO made sense and it was easy to use an interator to walk through the file. Apparently there is a CSV parsing module and several modules to wrap access to various database flavors, so it looks like there are plenty of options available for data processing.

Amazing community!

As I was finishing this post, I posted a question on the Nim forum about how to fix an issue I was having with Unicode string parsing. I got several useful answers back within the space of a few hours. That, in combination with the online documentation means that it’s possible to get unstuck pretty fast.

Performance

My initial run of the program did not yield impressive performance. However, looking into compiler optimizations yielded some pretty amazing improvements.

# Inititial timing with just "nim compile"
real    0m2.543s
user    0m2.527s
sys     0m0.008s

Using --opt:speed did speed things up - alot!

real    0m0.650s
user    0m0.625s
sys     0m0.009s

Finally, using the -d:release option yielded the best results.

nim compile --opt:speed -d:release anagram_redux.nim 
real    0m0.286s
user    0m0.269s
sys     0m0.012s

That’s a clear improvement over my Python implementation, which gets these numbers:

real    0m0.350s
user    0m0.329s
sys     0m0.016s

So the moral of the story there is that “Yes” Nim can be fast, but it is highly dependent on compiler settings.

Final Thoughts

My other forays into compiled languages (C and Go) were confined to rewriting simple examples from books and some basic numerical conversion programs. I quickly got to the point where I hated the syntax, or the complexity needed to manage simple things. Nim seems much more legible to me and appears to have many available modules to help address specific domains more easily.

I would have expected Nim to blow Python away and that was the case. The fact that I was able to copy my compiled executable onto a machine that lacked any sort of development environment is also attractive for certain situations.

It seems like Nim might be a good candidate as a language to use for data pre-processing and wrangling, where speed is needed to do the initial heavy cleanup in a data set. I will continue to look into this use.

Code

Here is the code I came up with for this challenge.

# anagram_redux.nim

# ---- Imports ----------------------------------------------------------------#
import strutils
import algorithm
import os
import strformat
import tables
import unidecode

# ---- Converts a word into it's lower-case, sorted "signature" ---------------#
proc makeSignature(word: string): string =
  let ascii = unidecode(word)
  let lower = strutils.toLowerAscii(ascii)
  let sorted_word = algorithm.sorted(lower, system.cmp)
  let signature = sorted_word.join()
  result = signature

# ---- Main body of program ---------------------------------------------------#
proc main =
  # -- Parse the lookup word from the argument passed in
  if os.paramCount() < 1:
      quit("Usage: anagram_redux <word>")
  let lookup_word = os.paramStr(1)
  let lookup_signature = makeSignature(lookup_word)
  echo strformat.fmt("Looking up '{lookup_word}'")
  
  # -- Create empty table to store results in
  var anagram_table = tables.initTable[string, seq[string]]()

  # -- Open the text file and process it
  let file = open("dict.txt")
  
  for word in file.lines():
    let signature = makeSignature(word)
    if anagram_table.hasKey(signature):
      anagram_table[signature].add(word)
    else:
      anagram_table[signature] = @[word]
      
  # -- Lookup user's word via its signature in the anagram_table
  if anagram_table[lookup_signature].len == 1:
    echo strformat.fmt("'{lookup_word}' does not have any anagrams")
  else:
    echo anagram_table[lookup_signature]
    
main()

Written with StackEdit.

Friday, February 7, 2020

In Search of a Fast R Anagram Solution

The Problem

A couple years ago, Frank Warmerdam challenged me to solve a little logic problem. Since then, I’ve found it to be a useful little experiment to implement in different ways to see whether it’s possible to improve on the solution I came up with then. Here is essentially what he wanted.

Find all of the anagrams in a dictionary text file in which there are at least 4 letters in the word and at least as many anagrams as there are letters in the word. The dictionary should be a file on disk with one line per word.

Output should be in this format:

emit, item, mite, time  
merit, miter, mitre, remit, timer  
reins, resin, rinse, risen, serin, siren  
inert, inter, niter, retin, trine  
inset, neist, snite, stein, stine, tsine

The Key

What is an anagram? It’s defined as being, “a word, phrase, or name formed by rearranging the letters of another word.” The key, literally, is to convert every word into a set of ordered letters and then look to see if any other words convert into the same set of ordered letters. The ordered letters become the index value and all of the words that share it are anagrams of each other. In Python, you have a data representation that looks like this:

words = ["Lois", "silo", "oils", "soil"]
index = {'ilos': ["Lois", "silo", "oils", "soil"]}

Then when you want to see what anagrams exist for “soils”, you simply convert it into its key, “ilos” and then look it up in the index.

word_index['ilos']
['Lois', 'silo', 'oils', 'soil']

A simple Python solution for the indexing problem can be written like this.

word_file = open("dict.txt", 'r')
word_data = word_file.readlines()
word_file.close()

signature_index = {}

for word in word_data:
  word = word.strip()
  word = word.lower()
  signature = ''.join(sorted(word))
  if signature not in signature_index:
    signature_index[signature] = [word]
  else:
    signature_index[signature].append(word)

What about in R?

The Python solution above generates the indexed values for a text file with 99,179 words in 0.222s. It does this consistently and without breaking a sweat. At its core, it relies on a loop in which every word in the file is converted into its “signature”. Now if you read about R for more than 5 minutes, you come across this advice, “Avoid loops as much as possible.” You also see a great deal of justification like, “Computing time doesn’t matter as much as thinking time”. This translates to, “If the code is easy to write and understand, it’s worth a few extra seconds of computing time.” Basically, “Yes, R is slow but it makes up for that with other advantages.”

The advice to avoid loops is good, because R has a concept of “vectorization” that is extremely fast. What that means is that R can do certain types of operations much more quickly, namely ones that are cross-column, than others. So for example, if you want to compute the square root of every value in a column, rather than loop through each row, you would just apply the sqrt function to the entire column at once, like this.

column_1 <- as.integer(c(4,9,16,25,36,49,64,81,100,144))
column_2 <- sqrt(column_1)

# yields somthing like...
   column_1 column_2
        4        2
        9        3
       16        4
       25        5
       36        6
       49        7
       64        8
       81        9
      100       10
      144       12

This is a neat concept and apparently is possible because, “Vectorized functions usually involve a behind-the-scenes loop in a low-level language (C or Fortran), which runs way faster than a pure R loop.” As long as your problem can be approached using this concept, you’re golden. So what about this anagram problem?

Here are the things we need to do:

  1. Read the data into R
  2. Convert the words into lower case
  3. Sort the lower case letters
  4. Reassemble the letters into a string

Ideally, we’d like to end up with something like this, and we’d like to get there using only vectorized functions.

  words signature
1  Lois      ilos
2  soil      ilos
3  oils      ilos
4  silo      ilos

The problem is, we can’t. If we walk through the code, we get to a point where we can’t use a vectorized function.

words <- c("Lois", "soil", "silo", "oils")
lower_case <- tolower(words)
strsplit(lower_case, "")  <======= BOOOM!!!
[[1]]
[1] "l" "o" "i" "s"

[[2]]
[1] "s" "o" "i" "l"

[[3]]
[1] "s" "i" "l" "o"

[[4]]
[1] "o" "i" "l" "s"

The R strsplit function works similarly to the Python sorted function in that it returns list of elements. In Python, sorted takes the string as input and spits out a sorted list of elements that you then have to join together. In R, you split the string using strsplit, then you sort the elements with sort, then you join them back together. Unfortunately, there is no way (that I can figure out) to assign that list output as a new column that I can continue to process.

So I have to loop. I mean there’s no way around it - “for each word, split it into its elements, sort the elements then join them back together”. This is where the apply family of functions come into play in R. Using them, I end up with this.

# R Implementation
  word_df <- read.csv("dict.txt", 
                      col.names = c("word"),
                      header = FALSE,
                      stringsAsFactors = FALSE, 
                      encoding = "UTF-8")

  word_df$signature <- lapply(word_df$word,
                              function(x)
                              {paste0(sort(unlist(strsplit(tolower(x), ""))), collapse = "")}) 

The cost?

user  system elapsed 
3.16    0.01    3.24 

So 3.16s in R versus 0.222s in Python… That’s significant, and I haven’t even gotten as far in the process after that time has elapsed.

stringi To the Rescue!

It’s clear that one line of code is doing all of the heavy lifting in this script.

{paste0(sort(unlist(strsplit(tolower(x), ""))), collapse = "")}

Doing some Web research revealed that the stringi package seemed to have some potential for improving the situation. By testing all of the possible string functions in stringi, it seems that the best performance comes from replacing the Base R sort with stri_sort from stringi. All of the other base functions are faster than the equivalent ones in stringi.

{paste0(stri_sort(unlist(strsplit(tolower(x), ""))), collapse = "")})

Speed benefit?

   user  system elapsed 
   0.92    0.00    0.97

That’s not bad at all!

Next Steps

Given that the Tidyverse is built on top of stringr, which itself is built on top of stringi, it probably would be worth rewriting this entirely using Tidyverse functions to see if I can eke out even more performance. I have to say that these results are unexpected and definitely encouraging.

Thursday, February 6, 2020

R Basics

Tips and tricks in R captured in the last 6 months of work at Mazama Science. Most of this is not rocket science, but some of it took research to figure out and I’d like to keep it in my mind.

Basic Data Structures

Variables

foo <- "Hello World!"

Note the strange assignment operator “<-”. The equal sign works as well, but by convention is only used with function parameters.

Vectors

Basic unit of data in R. Pretty much the same as a Python list or an array in Perl (without the semantic stupidity). Vectors are ordered and must contain the same type of data. Elements in a vector can be accessed by index location, but using a negative index means “everything but that element”. Oh yeah, and R is not zero-indexed. Trying to access the [0] element yields no joy.

> foo <- c(1,2,3,4)

> foo[3] == 3
# TRUE

> foo[-1] == c(2,3,4)
# TRUE

> foo[-4] == c(1,2,3)
# TRUE

Lists

R lists are like Python dictionaries, or hashes in Perl, but with the important distinction that they retain their ordering. Lists can also store multiple data types.

1. Create a list using the list() function:
> contact_info <- list("first_name" = "Roger", 
                     "middle_initiual" = "T", 
                     "last_name" = "Shrubber", 
                     "address" = "123 Jump St.",
                     "age" = as.integer(100))

2. Access 1st element by index position (NOTE that this returns a list):
> contact_info[1]

$first_name = 'Roger'

# 3. Access list element by name:
> contact_info["first_name"]

$first_name = 'Roger'

4. Access first 3 elements in the list:
contact_info[1:3]

$first_name
'Roger'
$middle_initiual
'T'
$last_name
'Shrubber'

5. Access just the values, without their names (keys)
unlist(contact_info[1:3], use.names = FALSE)

'Roger' 'T' 'Shrubber'

6. Access just the value of an element using the $<name> syntax
contact_info$address

'123 Jump St.'

Lists are very useful as lookup tables and as catch-all containers for random crap. Lists are slower than vectors though, so should be used with discretion.

Matrices

R implements a 2-dimensional array using the matrix() function. Higher dimension arrays are possible as well, but since I haven’t used them yet, I’ll stick with what I know.

1. Create a 3 x 3 matrix using a sequence from [1,9]:
> threesome <- matrix(seq(1,9),
                    nrow = 3,
                    ncol = 3,
                    byrow = TRUE)
# yields
1	2	3
4	5	6
7	8	9

2. Access row 1 of matrix:
> threesome[1,]

# yields
1 2 3

3. Access column 1 of matrix:
threesome[,1]

# yields
1 4 7

4. Access a "window" of values. [row_start:row_end, col_start:col_end]
threesome[2:3, 1:2]

# yields
5	6
8	9

# NOTE: Setting "byrow = FALSE" when you create the matrix yields this:
1	4	7
2	5	8
3	6	9

Dataframes

Dataframes are probably the reason you started using R in the first place. They are the type of data structure that you seem to encounter the most when working with packages.

first_names <- c("Fred", "George", "Ronald", "Harry")
last_names <- c("Weasely", "Weasely", "Weasely", "Potter")
addresses <- c("27 Foobar Way", "27 Foobar Way", "27 Foobar Way", "4 Privet Dr")
counties <- c("Devon", "Devon", "Devon", "Surrey")
ages <- as.integer(c(17, 17, 15, 15))

potterTrivia <- data.frame("firstName" = first_names,
                           "lastName" = last_names,
                           "address" = addresses,
                           "county" = counties,
                           "age" = ages, 
                           stringsAsFactors = FALSE)

# yields
	firstName	lastName	address	county	age
    <chr>	<chr>	<chr>	        <chr>	<int>
1	Fred	Weasely	27 Foobar Way	Devon	17
2	George	Weasely	27 Foobar Way	Devon	17
3	Ronald	Weasely	27 Foobar Way	Devon	15
4	Harry	Potter	4 Privet Dr	Surrey	15

Like everything else, R dataframes are indexed and data inside them can be accessed by their index position.

  • By column:
potterTrivia[1]

# yields
firstName
<chr>
Fred
George
Ronald
Harry
  • By row:
potterTrivia[1,]

# yields
firstName	lastName	address	    county	age
    <chr>	<chr>	<chr>	        <chr>	<int>
1	Fred	Weasely	27 Foobar Way	Devon	17
  • By range of rows:
potterTrivia[1:3,]

# yields 
firstName	lastName	address	    county	age
    <chr>	<chr>	<chr>	        <chr>	<int>
1	Fred	Weasely	27 Foobar Way	Devon	17
2	George	Weasely	27 Foobar Way	Devon	17
3	Ronald	Weasely	27 Foobar Way	Devon	15

  • By range of rows and columns
potterTrivia[1:3, 1:2]

# yields
  firstName	lastName
    <chr>	<chr>
1	Fred	Weasely
2	George	Weasely
3	Ronald	Weasel

Column names can be viewed from a dataframe by using the names() function and $column_name can be used to extract it from the dataframe.

names(potterTrivia)

# yields
'firstName' 'lastName' 'address' 'county' 'age'

potterTrivia$firstName

# yields
'Fred' 'George' 'Ronald' 'Harry'

In many ways, a dataframe acts and feels like a database table. So it should come as no surprise that Pandas, which was developed to provide R-like functionality in Python, is described by some as, “an in-memory nosql database that has sql-like constructs”. I think similar things could be said about the R dataframe.

Written with StackEdit.