Monday, February 10, 2020

Nim Test Drive

I recently read about the Nim programming language in an article that posited it might take the place of Python. Despite the article being both badly written and patently inaccurate (Nim does NOT rely on an interpreter of any kind), I was intrigued and decided to give the language a try.

What is Nim

Nim is a statically-typed compiled programming language originally designed and developed by Andreas Rumpf in 2005. The project was made public in 2008 and is now a Free and Open Source Software project being worked on by a group of volunteers along with Andreas. Version 1.0 of the language was released on September 23, 2019, which guarantees a stable base to build future releases on. Nim’s goals are to, “be as fast as C, as expressive as Python, and as extensible as Lisp.”

One of the unique features of Nim is that it compiles your code into optimized C code by default, which is then compiled by gcc or Clang on your system.

Anagram Redux

Having just spent some time working in both Python and R on a anagram implementation, I decided to try writing one in Nim. The goal was to write a program that takes a word as input from the user, reads a local file of dictionary words, and returns all of the words in the dictionary that are anagrams of the user’s original word. This would allow me to explore the following:

  1. Argument parsing
  2. File IO
  3. String processing
  4. Basic data structures
  5. Code organization using procs (functions)

First Impressions

I spent the better part of a weekend putting the program together. By far the most time-consuming aspect was determining the right way to declare the various types while instantiating variables. Frankly, it felt more difficult than it needed to be, but probably can be chalked up to my lack of experience working with statically typed languages.

Type madness!

I needed to store multiple sorted words in the same data structure. In Python this would be a list. In Nim there are 2 types of containers I could use, an array or a sequence. The array is a fixed size container that apparently is a C array that has very fast lookup times. A sequence is a dynamically allocated container that can grow in size. Nim allows you to instantiate either of these in one of 2 ways. You can define the empty object you want to use first and then write to it later, OR you can just create one on-the-fly with data.

Coming up with the right combination of types to hold my data took awhile, but I finally ended up with this to define an empty hash table that would store sequences.

var anagram_table = tables.initTable[string, seq[string]]()

In English, this means:
var anagram_table – create a mutable variable named ‘anagram_table’
tables.initTable – using the ‘tables’ module ‘initTable’ proc to create a table…
[string, seq[string]]– with “strings” as key type and a “sequence of strings” as values

Then I could write to anagram_table in the following manner.

anagram_table[key] = @[value]

The @ is extremely important on the initial value declaration, as it tells the program that I need to be able to add to that sequence later. Afterwards, new entries can be added to the same sequence like this.

anagram_table[key].add(value)

The rest of the implementation was pretty easy. Control flow and procedures were easy to implement. File IO made sense and it was easy to use an interator to walk through the file. Apparently there is a CSV parsing module and several modules to wrap access to various database flavors, so it looks like there are plenty of options available for data processing.

Amazing community!

As I was finishing this post, I posted a question on the Nim forum about how to fix an issue I was having with Unicode string parsing. I got several useful answers back within the space of a few hours. That, in combination with the online documentation means that it’s possible to get unstuck pretty fast.

Performance

My initial run of the program did not yield impressive performance. However, looking into compiler optimizations yielded some pretty amazing improvements.

# Inititial timing with just "nim compile"
real    0m2.543s
user    0m2.527s
sys     0m0.008s

Using --opt:speed did speed things up - alot!

real    0m0.650s
user    0m0.625s
sys     0m0.009s

Finally, using the -d:release option yielded the best results.

nim compile --opt:speed -d:release anagram_redux.nim 
real    0m0.286s
user    0m0.269s
sys     0m0.012s

That’s a clear improvement over my Python implementation, which gets these numbers:

real    0m0.350s
user    0m0.329s
sys     0m0.016s

So the moral of the story there is that “Yes” Nim can be fast, but it is highly dependent on compiler settings.

Final Thoughts

My other forays into compiled languages (C and Go) were confined to rewriting simple examples from books and some basic numerical conversion programs. I quickly got to the point where I hated the syntax, or the complexity needed to manage simple things. Nim seems much more legible to me and appears to have many available modules to help address specific domains more easily.

I would have expected Nim to blow Python away and that was the case. The fact that I was able to copy my compiled executable onto a machine that lacked any sort of development environment is also attractive for certain situations.

It seems like Nim might be a good candidate as a language to use for data pre-processing and wrangling, where speed is needed to do the initial heavy cleanup in a data set. I will continue to look into this use.

Code

Here is the code I came up with for this challenge.

# anagram_redux.nim

# ---- Imports ----------------------------------------------------------------#
import strutils
import algorithm
import os
import strformat
import tables
import unidecode

# ---- Converts a word into it's lower-case, sorted "signature" ---------------#
proc makeSignature(word: string): string =
  let ascii = unidecode(word)
  let lower = strutils.toLowerAscii(ascii)
  let sorted_word = algorithm.sorted(lower, system.cmp)
  let signature = sorted_word.join()
  result = signature

# ---- Main body of program ---------------------------------------------------#
proc main =
  # -- Parse the lookup word from the argument passed in
  if os.paramCount() < 1:
      quit("Usage: anagram_redux <word>")
  let lookup_word = os.paramStr(1)
  let lookup_signature = makeSignature(lookup_word)
  echo strformat.fmt("Looking up '{lookup_word}'")
  
  # -- Create empty table to store results in
  var anagram_table = tables.initTable[string, seq[string]]()

  # -- Open the text file and process it
  let file = open("dict.txt")
  
  for word in file.lines():
    let signature = makeSignature(word)
    if anagram_table.hasKey(signature):
      anagram_table[signature].add(word)
    else:
      anagram_table[signature] = @[word]
      
  # -- Lookup user's word via its signature in the anagram_table
  if anagram_table[lookup_signature].len == 1:
    echo strformat.fmt("'{lookup_word}' does not have any anagrams")
  else:
    echo anagram_table[lookup_signature]
    
main()

Written with StackEdit.

No comments: