Sunday, May 3, 2020

Blazer Reassembly List

5/3/2020:

Got the transfer case and transmission crossmember bolted back in today. Pretty sure I don’t want to do that again anytime soon. Also reinstalled the transmission dipstick tube, which is held in place with a bellhousing bolt. Not sure what I think about GM engineers and their love of bellhousing bolts to hold wiring looms and other junk in place.

Things I’ve Forgotten or Need to be Addressed

  • Top off front diff
  • Torque front hubs (174 ft lbs)
  • Adjust camber via upper control arms
  • Check front brake caliper bolts
  • Coil wiring
  • Body mount washers
  • Engine oil filter
  • Rear brakes
  • Bleed brake
  • Auto shift linkage
  • Transfer case shifter bracket
  • Figure out accelerator return spring

Reassembly ToDo

  • Trans cooler lines
  • Auto shift linkage
  • Transfer case shifter bracket
  • Fill transfer case (Check fluid, should be Dexron)
  • Transfer case shift linkage and shifter pivot bolt
  • Speedometer cable
  • Reinstall Driveshafts
  • Exhaust pipe
  • Replace body mounts (need to order more)
  • Reinstall front bumper
  • Fill Transmission
  • Fill Engine oil

Sunday, April 26, 2020

Trump is going to win again, and I don't blame Republicans

So let’s start this with a bang. I think Mr. Trump will be reelected this Fall. I think he will lose the popular vote, as he did in 2016 (and as George W. Bush did in 2000) and that despite that he will still be reelected. If so, it will be the first time a 2-term president has lost the popular vote in each term, but these are heady days and nothing is as it once was.

Why do I think Trump will win? Because Republican voters are more unified than Democratic voters. I don’t believe for one second that we have swing voters anymore, voters who identify with one party and are willing to vote for someone in another party. What we have now are people who decide they can’t vote for the nominee in their party and either don’t vote, or throw away their vote to an Independent or write-in candidate. Because Democrats are less unified in ideology than Republicans, and unwilling to support someone who doesn’t cater to their specific interests, they will lose by defaulting on the election as they did in 2016.

I feel that Congress holds much of the blame for the political climate we have today. Since the mid-70s, Congress has become ever more polarized and less willing to compromise and work in a bipartisan way. Whether this is the result of the voting populace becoming more polarized, or a function of DC politics and the influence of special interest groups is open for debate. I tend to believe that there is a higher percentage of moderate voters in each party than there are moderate legislators. Regardless of the reason, Congress now views the world through a filter that allows no room for compromise and expects voters to support them - no matter how polarizing the legislation is or how inflammatory the rhetoric. Republican voters seem to be more willing to go along with this than Democrats. I think this is because the current Republican platform is narrower than that of Democrats.

Based on the state primary results, there are obviously still a large number or Democrats who are Centrists. Biden was the least progressive of all the Democratic candidates and won by large numbers. But if we learned anything in 2016 it was that Democrats are split by race, class and age into 2 large and very different camps and that the more progressive and Left-leaning side views the old-school moderates as being as hateful as the current crop of right-wing Republicans. This is like throwing the baby out with the bath water and the truth is that neither side can win an election by themselves. Republican voters on the other hand, seem willing to accept the span of ideology that currently represents their Party. (What this says about Republicans is scary, but that’s for another time.) You certainly don’t hear any large numbers of Republicans disagreeing vocally with the Party’s majority opinions in public, like you do from within the Democratic party.

So I won’t be surprised if Trump wins again and I won’t even hold it against Republicans if he does. I don’t expect them to change their minds, it would be tantamount to acknowledging they made a mistake in 2016. But to Democrats who are planning to “vote their conscience” this Fall I say, “you get what you deserve when we get another 4 years of Trump.”

Wednesday, April 22, 2020

The Death of Social Media During Self Isolation

Slack, you are dead to me. Twitter, I killed you off last year. Facebook, my passing fling with you was over before it even began. Instagram, I managed to wrest my account back from the asshat who claimed my email address, but I will likely never use it. I think all I have left are a couple group chats in Google Hangouts and some shared photo albums in Google Photos. I’ll probably keep the shared albums. My inadvertent quest to kill all of my social media interactions is almost complete, but how did it even begin?

I got an email, out of the blue, from my cousin in Belgium the other day. It was great! He told me how things were going over there during the COVID crisis and asked about how my we were all doing over here. I enjoyed it very much and it prompted a reply from me and several additional exchanges. These emails had personal engagement, interest in the recipients and expressed a genuine desire to communicate. Things which I find lacking in the vast majority of social media “interactions”. Why? Because the vast majority of Tweets, Posts, etc. are not attempts at true communication. Communication implies an exchange of information or ideas, but social media isn’t about exchanging information. Social media, for the most part, is all about providing ways to broadcast information. It was this realization a couple years ago that made me reconsider my use of Twitter and start to examine the ways in which I interacted with people via “push” communication.

Think about it. You Tweet, “Today I washed my cat!” and post a picture of the poor, wet animal. You don’t really think anyone is going to respond, but you really hope someone will. It will make you feel like you’re connected to People if someone tells you the picture was funny, or that you’re a good Cat Parent for enhancing its personal hygiene. It doesn’t really matter who the people are, just that you’re getting a response, but you hope that your inner circle of friends and family appreciate the humor and that you took the time to post something. You get a few responses and you’re hooked and off to the races. But sometimes, often in fact, you post about mundane shit, because lets face it, most of us are not living “Lives of Adventure” and are in fact occupied with mundane shit the better part of the time. Soon you don’t get responses to much of the inane crap you’re posting, but you keep doing it anyways because… well because. Then one day you realize that you never hear back from anyone anymore, including your friends and family and you also realize that it’s probably ok. You get in touch with a friend and have lunch together and catch up on all the mundane shit in a much more enjoyable way. Or you pick up the phone and call or send them an email, but you choose something that guarantees an exchange of ideas.

Now granted, not all of these platforms are exactly the same. Slack for example, is a glorified IRC channel that is intended for real-time communication and used by many workplaces as a way for people to ask questions or connect with remote coworkers. It works well for that. It’s also a handy way for people in clubs to stay connected between meetings and for like-minded people to hang out and shoot the breeze, or ask for technical help. But when the channel looks like the digital equivalent of crickets chirping, after awhile you forget about it and then eventually you stop turning it on on because… well CPU cycles and you have a cat to wash.

I’ve completely discounted until now the distasteful echo-chamber effect that social media has had on our lives as well. The disgraceful way in which propaganda and hateful speech is disseminated on these platforms (I assume there’s probably a Fox and Friends Slack Channel) makes it even less appealing to me now. You have to wonder what proportion of posts are complete bullshit? Mundane or even esoteric trivia is one thing, outright lies and hate mongering is something else.

One could argue that a Blog post is exactly the same thing. It’s after all just a much longer form of Tweet that’s being pushed out into the World. And that’s very true, but I would argue that Tweets are very much like coming across vomit on the sidewalk in the morning. You know that something happened there the night before, but none of the details. And you don’t have the option to avoid the sick on the sidewalk, it’s kinda thrust upon you. Here, dear reader, the choice is yours. And I don’t even own a cat.

Sunday, March 22, 2020

Factorio Log

I started playing Factorio again after a fairly long break from it. I played it a bit in 2018 and while I didn’t progress very far in the tech tree, I really enjoyed it. It’s a deep game and describing it is tough because it can be played in so many different ways.

At its core, you’re a stranded space explorer on a world with exposed natural resources, not unlike Earth. Your crashed ship provides you with a few basic tools and raw materials and your task is to gradually build up your manufacturing capability until you reestablish space flight again. Along the way, you work through various ages of technology - coal and steam, steel and petroleum, etc. Getting through these various stages is no picnic. Each level of technology requires more and more resources and in order to move forward, you have to find ways of automating resource extraction and complex material production, like steel girders and plastic blocks. And then there’s the natives.

Oh yes, this world is inhabited. In fact, it has a healthy population of indigent life split between 3 species; Biters, Spitters and Worms. These aliens are peaceful at first, but as you start to generate more pollution, it spreads and starts to affect them in 2 ways - it royally pisses them off AND it makes them evolve and get stronger. Eventually they start attacking you and your infrastructure and the game takes on a whole new strategic dimension that forces you to include military research and development in your gameplay.

This game is not easy. It tries to give you some basic concepts via tutorials, but really, it expects you to figure things out for yourself. Fortunately, there are a ton of resources on the Web that you can use to help you out when you get stuck. One way to work through it is to look up the materials needed to build the next level of tech and then gradually build each one. Below is a list I found on Reddit that suggests a path forward. It’s not bad, but you’re going to have to find your own path.

Suggested Strategy

  1. Electricity
  2. Iron + copper plates automation
  3. Green circuits automation
  4. Red and green science
  5. Oil processing
  6. Red circuits
  7. Steel automation
  8. Blue science
  9. Oil products cracking (Look here when you get to this point)
  10. Sulfuric acid
  11. Batteries (component) and solar panels + batteries (product) for solar electricity generation shift
  12. Bots
  13. Military and purple science
  14. Processors (blue chips)
  15. Yellow science
  16. Modules
  17. Space science
  18. Uranium processing and uranium based products
  19. Nuclear reactor
  20. Mega base projecting)

Play Log

Below is a log of how the game has revealed itself to me.

Day 7: Regroup

  • Built some solar panels and powered all the pumpjacks for oil with them. Won’t pump at night, but I have enough storage (I think) that it won’t matter.
  • Going to mess around with some trains. I really dislike this centralized factory. Would prefer to have small separate factories that I I can tune individually. Will see if I can make that work.

Day 6: Plastic

  • Need plastic in order to make blue science packs
  • Built a chemical plant to create sulfur from petroleum gas
  • Created red circuit board assembly line
  • Rebuilt automated assembly lines for:
    • Red science packs
    • Green science packs
    • Red circuits
    • Blue science packs
  • Research blue science
  • Research “all of the things” (Red, Blue and Green tech)
  • Added 6 more steam engines and 3 new boilers to meet power demands.

Discovered a few places that caused slowdowns in production.

  1. Iron plates were running low which caused a ripple down chain reaction. Fixed by reducing the amount of steel I was producing and replacing the stone burner smelters with steel furnaces.
  2. Green and red circuit production needed to be balanced and increased in order to supply enough blue science packs.

Day 5: Oil

  • Explored the map to hell-and-gone to find some oil fields. Finally found a couple to the North of my main area. (Note to self: Next time, recon the map first and find an area with all resources kinda close)
  • Got a small steel plate production line set up. I need it to build some of the tech that I’ve unlocked.
  • Built a couple pumpjacks and set them up on the oil-field. Setup 5 to start.
  • Built some medium power towers and strung power lines out from Factory to oil field
  • Built a BUNCH of pipe (straight and below ground) and built a pipeline from oil-field to Factory
  • Want to build red circuits and that requires plastic. Plastics are under the Oil Processing tech tree.
    • How do I make plastic? (In a Chemical Plant with coal and petroleum gas)
  • How do I deal with the oil? Where do I store it and how do I turn it into petroleum gas?
    • Need to make brick
    • Need some storage tanks
    • Need a refinery
    • If I want some pumps, will need to automate some engine production.

Day 4: Research

  • Completed all the research tasks I could do with Red and Green Science packs
  • Tore down the science pack automation. It was sufficient to get this far, but struggled to keep up with 10 Labs running
  • Rebuilt the Iron and Copper extraction areas.
    • Replaced Burner drills with electric ones
    • Straightened belt runs to smelters
    • Added stone smelters to create more iron and copper plate
  • Added a couple more boilers and steam engines in anticipation to having more power load

Day 3 :

  • Decided to scrap my game and start over. This time I have a plan.
  • Started Producing Electricity
    A. Gather the resources needed to produce steam
    1. Automate coal production
    2. Automate iron prodiction
    3. Automate iron plate production
    4. Automate copper ore extraction
    5. Automate copper plate production
    Can all be done with coal as fuel. In fact, given that the burners all consume coal initially, I think it makes sense to have extraction - plate production done by coal.

Day 2 :

  • Completed a couple areas of Research, but it’s still a bit opaque to me. Didn’t realize that completed Research goes to bottom of menu, but there are still items in them to do.
  • Realized that in using the Mod for Daytime-Only, I screwed up my ability to earn achievements
  • Realized that I need to turn enable the setting that allows for a Research Queue.
  • Did lame coal and iron mining that isn’t sustainable, but gave enough resources to start Research.
  • Got electricity from a steam boiler.
  • Started a new game in V 0.17 with a mod to turn night time off.
  • Reinstalled V 0.16. Played it for awhile, it’s ok, but the new graphics in 0.17 are really nice. And I miss the new hot-swappable inventory menu. Got through making electricity with steam and scrapped my game.

Day 1 :

  • How the hell do I play this game again? The mouse doesn’t move the character? How do I harvest a tree for wood?
  • Played the tutorial in V 0.17. What the hell is this Compilatron thing? Time to just play in Sandbox mode.
  • Figured out how to have a self-feeding burner drill.

Thursday, February 20, 2020

Nim CSV Tool

Background

I have a little project to analyze 10 years of Seattle crime incident data. It’s just a personal itch I need to scratch, but it does give me an excuse to look at some data sets in R and see if I can answer some questions. The incident data comes as a 363 MB CSV file that contains 1.4 million records of each incident. Each record looks like this:

new row: 
> CAD CDW ID:  15738  
> CAD Event Number:  10000246255 
> General Offense Number:  2010246255
> Event Clearance Code:  250
> Event Clearance Description:  MISCHIEF, NUISANCE COMPLAINTS
> Event Clearance SubGroup:  NUISANCE, MISCHIEF COMPLAINTS
> Event Clearance Group:  NUISANCE, MISCHIEF 
> Event Clearance Date:  07/17/2010 08:55:00 PM
> Hundred Block Location:  21XX BLOCK OF 3RD AVE
> District/Sector:  M
> Zone/Beat:  M2
> Census Tract:  7200.2025
> Longitude:  -122.342843234
> Latitude:  47.613551471
> Incident Location:  (47.613551471, -122.342843234)
> Initial Type Description:  
> Initial Type Subgroup:  
> Initial Type Group:  
> At Scene Time:  

Now this data looks pretty normal to me. It’s a mix of numeric and string data. Some of the fields aren’t filled in and the field names, while useful, are in a format that make them a PITA to use in a database. Which made me think that a useful little tool would be a standalone utility to reformat CSV files. Something that would do the following:

  1. Drop columns
  2. Rename columns
  3. List a subset of records

Now granted, this can all be done in Google Sheets, or Excel, or in R, but I think having something written in C would be nice. It would be fast and wouldn’t have to rely on anything else to work. So I’m going to write it in Nim and document the process here.

Project Breakdown

In order to be sure that I don’t waste my time, I’m going to reiterate my requirements. I want to be able to do the following:

  • Specify the name of a CSV file and have the program open it and perform the following tasks:
    1. List the 1st few records in horizontal order (like the R dplyr::glimpse function)
    2. Specify a subset of existing fields to keep
    3. Specify new names for the remaining fields
    4. Print the modified version of the file to the screen
    5. Optionally, write the modified version out to a new file

So, what are the things I need to figure out? These are basically the tasks that I need to implement to make this work.

Tasks

1. How to pass in the name of a file and have Nim recognize it?
2. How to pass in arguments and have the program do different things based on what they are?
3. How to open a CSV file, potentially with different delimiters?
4. How to recognize the row headers in the CSV?
5. How to only keep specific row headers?
6. How to rename the row headers that are left?
7. How to list records horizontally?
8. How to only print data that is for columns I’ve chosen to keep?
9. How to save the modified data to a new file?

Function signature

Before I get ahead of myself, what’s this thing going to look like to a user when they try to do something with it? Here’s a guess at what that might look like (NOTE: I’ve added line breaks for clarity that won’t be there when used for real.)

  <filename.csv> 
  --keep "Event Clearance Code|
          Event Clearance SubGroup|
          Event Clearance Group|
          Event Clearance Date|
          District/Sector|
          Zone/Beat|
          Census Tract|
          Longitude|
          Latitude"  
  --rename "Event Clearance Code=clearance_code|
            Event Clearance SubGroup=subgroup|
            Event Clearance Group=group|
            Event Clearance Date=date|
            District/Sector=district|
            Zone/Beat=beat|
            Census Tract=census_tract|
            Longitude=lon|
            Latitude=lat"
  --list         

Which should produce something that looks like this.

> clearance_code:  250
> subgroup:  NUISANCE, MISCHIEF COMPLAINTS
> group:  NUISANCE, MISCHIEF 
> date:  07/17/2010 08:55:00 PM
> district:  M
> beat:  M2
> census_tract:  7200.2025
> lon:  -122.342843234
> lat:  47.613551471

Plenty of stuff to figure out, so might as well get on with it.

How to pass a filename as an argument to Nim?

I have 2 options here. Option 1 is to try and re-use the mechanism I’ve used before in the anagram_redux.nim code that I wrote. It would look something like this:

if os.paramCount() < 1:
  let filename = os.paramStr(1)

That would basically check to see if arguments had been passed in along with the program name and if one (or more) had been, then the 1st argument would be treated as the filename. The use of positional arguments like this (1st argument = file, 2nd argument = something_else) is ok, but I don’t like it for something like this where there are many other arguments that can be passed in as well.

So that leaves option 2, which is to use Nim’s parseopt module. parseopt is nice because I can still have an argument that isn’t named anything to specify the filename, and I can have named options for everything else. Another benefit of using parseopt is that I can combine items 1 and 2 from my task list and solve them both at once.

How to pass in arguments by name into the program?

I have 4 arguments that I want to pass into the program.

  • <filename>: mandatory, name of the CSV file to process
  • --keep: optional, original names of columns to keep
  • --rename: optional, original column name followed by new column name
  • --list: optional, show 1 record in horizontal format (may want to consider showing “n” records)

According to the doc, I should be able to do the following:

import parseopt

var
  filename: string
  keep_fields: string
  rename_fields: string
  
proc parseOptions() =
  var options = initOptParser()
  for kind, key, val in getopt(options):
    if kind == cmdArgument: filename = key # Remember that indentation is not required
    elif kind == cmdLongOption:
      if key == "keep": keep_fields = val
      elif key == "rename": rename_fields = val
      elif key == "list": parseCSV()
    elif kind == cmdEnd:
      assert(false) # cannot happen
# NOTE: Should use the 'case' syntax that's in the docs, but that's not yet familiar to me.     

So that takes care of the bare minimum code needed to parse my options. It’s brittle and will need to be fleshed out, but it works.

Next: Figure out CSV parsing.

Written with StackEdit.

Monday, February 10, 2020

Nim Test Drive

I recently read about the Nim programming language in an article that posited it might take the place of Python. Despite the article being both badly written and patently inaccurate (Nim does NOT rely on an interpreter of any kind), I was intrigued and decided to give the language a try.

What is Nim

Nim is a statically-typed compiled programming language originally designed and developed by Andreas Rumpf in 2005. The project was made public in 2008 and is now a Free and Open Source Software project being worked on by a group of volunteers along with Andreas. Version 1.0 of the language was released on September 23, 2019, which guarantees a stable base to build future releases on. Nim’s goals are to, “be as fast as C, as expressive as Python, and as extensible as Lisp.”

One of the unique features of Nim is that it compiles your code into optimized C code by default, which is then compiled by gcc or Clang on your system.

Anagram Redux

Having just spent some time working in both Python and R on a anagram implementation, I decided to try writing one in Nim. The goal was to write a program that takes a word as input from the user, reads a local file of dictionary words, and returns all of the words in the dictionary that are anagrams of the user’s original word. This would allow me to explore the following:

  1. Argument parsing
  2. File IO
  3. String processing
  4. Basic data structures
  5. Code organization using procs (functions)

First Impressions

I spent the better part of a weekend putting the program together. By far the most time-consuming aspect was determining the right way to declare the various types while instantiating variables. Frankly, it felt more difficult than it needed to be, but probably can be chalked up to my lack of experience working with statically typed languages.

Type madness!

I needed to store multiple sorted words in the same data structure. In Python this would be a list. In Nim there are 2 types of containers I could use, an array or a sequence. The array is a fixed size container that apparently is a C array that has very fast lookup times. A sequence is a dynamically allocated container that can grow in size. Nim allows you to instantiate either of these in one of 2 ways. You can define the empty object you want to use first and then write to it later, OR you can just create one on-the-fly with data.

Coming up with the right combination of types to hold my data took awhile, but I finally ended up with this to define an empty hash table that would store sequences.

var anagram_table = tables.initTable[string, seq[string]]()

In English, this means:
var anagram_table – create a mutable variable named ‘anagram_table’
tables.initTable – using the ‘tables’ module ‘initTable’ proc to create a table…
[string, seq[string]]– with “strings” as key type and a “sequence of strings” as values

Then I could write to anagram_table in the following manner.

anagram_table[key] = @[value]

The @ is extremely important on the initial value declaration, as it tells the program that I need to be able to add to that sequence later. Afterwards, new entries can be added to the same sequence like this.

anagram_table[key].add(value)

The rest of the implementation was pretty easy. Control flow and procedures were easy to implement. File IO made sense and it was easy to use an interator to walk through the file. Apparently there is a CSV parsing module and several modules to wrap access to various database flavors, so it looks like there are plenty of options available for data processing.

Amazing community!

As I was finishing this post, I posted a question on the Nim forum about how to fix an issue I was having with Unicode string parsing. I got several useful answers back within the space of a few hours. That, in combination with the online documentation means that it’s possible to get unstuck pretty fast.

Performance

My initial run of the program did not yield impressive performance. However, looking into compiler optimizations yielded some pretty amazing improvements.

# Inititial timing with just "nim compile"
real    0m2.543s
user    0m2.527s
sys     0m0.008s

Using --opt:speed did speed things up - alot!

real    0m0.650s
user    0m0.625s
sys     0m0.009s

Finally, using the -d:release option yielded the best results.

nim compile --opt:speed -d:release anagram_redux.nim 
real    0m0.286s
user    0m0.269s
sys     0m0.012s

That’s a clear improvement over my Python implementation, which gets these numbers:

real    0m0.350s
user    0m0.329s
sys     0m0.016s

So the moral of the story there is that “Yes” Nim can be fast, but it is highly dependent on compiler settings.

Final Thoughts

My other forays into compiled languages (C and Go) were confined to rewriting simple examples from books and some basic numerical conversion programs. I quickly got to the point where I hated the syntax, or the complexity needed to manage simple things. Nim seems much more legible to me and appears to have many available modules to help address specific domains more easily.

I would have expected Nim to blow Python away and that was the case. The fact that I was able to copy my compiled executable onto a machine that lacked any sort of development environment is also attractive for certain situations.

It seems like Nim might be a good candidate as a language to use for data pre-processing and wrangling, where speed is needed to do the initial heavy cleanup in a data set. I will continue to look into this use.

Code

Here is the code I came up with for this challenge.

# anagram_redux.nim

# ---- Imports ----------------------------------------------------------------#
import strutils
import algorithm
import os
import strformat
import tables
import unidecode

# ---- Converts a word into it's lower-case, sorted "signature" ---------------#
proc makeSignature(word: string): string =
  let ascii = unidecode(word)
  let lower = strutils.toLowerAscii(ascii)
  let sorted_word = algorithm.sorted(lower, system.cmp)
  let signature = sorted_word.join()
  result = signature

# ---- Main body of program ---------------------------------------------------#
proc main =
  # -- Parse the lookup word from the argument passed in
  if os.paramCount() < 1:
      quit("Usage: anagram_redux <word>")
  let lookup_word = os.paramStr(1)
  let lookup_signature = makeSignature(lookup_word)
  echo strformat.fmt("Looking up '{lookup_word}'")
  
  # -- Create empty table to store results in
  var anagram_table = tables.initTable[string, seq[string]]()

  # -- Open the text file and process it
  let file = open("dict.txt")
  
  for word in file.lines():
    let signature = makeSignature(word)
    if anagram_table.hasKey(signature):
      anagram_table[signature].add(word)
    else:
      anagram_table[signature] = @[word]
      
  # -- Lookup user's word via its signature in the anagram_table
  if anagram_table[lookup_signature].len == 1:
    echo strformat.fmt("'{lookup_word}' does not have any anagrams")
  else:
    echo anagram_table[lookup_signature]
    
main()

Written with StackEdit.

Friday, February 7, 2020

In Search of a Fast R Anagram Solution

The Problem

A couple years ago, Frank Warmerdam challenged me to solve a little logic problem. Since then, I’ve found it to be a useful little experiment to implement in different ways to see whether it’s possible to improve on the solution I came up with then. Here is essentially what he wanted.

Find all of the anagrams in a dictionary text file in which there are at least 4 letters in the word and at least as many anagrams as there are letters in the word. The dictionary should be a file on disk with one line per word.

Output should be in this format:

emit, item, mite, time  
merit, miter, mitre, remit, timer  
reins, resin, rinse, risen, serin, siren  
inert, inter, niter, retin, trine  
inset, neist, snite, stein, stine, tsine

The Key

What is an anagram? It’s defined as being, “a word, phrase, or name formed by rearranging the letters of another word.” The key, literally, is to convert every word into a set of ordered letters and then look to see if any other words convert into the same set of ordered letters. The ordered letters become the index value and all of the words that share it are anagrams of each other. In Python, you have a data representation that looks like this:

words = ["Lois", "silo", "oils", "soil"]
index = {'ilos': ["Lois", "silo", "oils", "soil"]}

Then when you want to see what anagrams exist for “soils”, you simply convert it into its key, “ilos” and then look it up in the index.

word_index['ilos']
['Lois', 'silo', 'oils', 'soil']

A simple Python solution for the indexing problem can be written like this.

word_file = open("dict.txt", 'r')
word_data = word_file.readlines()
word_file.close()

signature_index = {}

for word in word_data:
  word = word.strip()
  word = word.lower()
  signature = ''.join(sorted(word))
  if signature not in signature_index:
    signature_index[signature] = [word]
  else:
    signature_index[signature].append(word)

What about in R?

The Python solution above generates the indexed values for a text file with 99,179 words in 0.222s. It does this consistently and without breaking a sweat. At its core, it relies on a loop in which every word in the file is converted into its “signature”. Now if you read about R for more than 5 minutes, you come across this advice, “Avoid loops as much as possible.” You also see a great deal of justification like, “Computing time doesn’t matter as much as thinking time”. This translates to, “If the code is easy to write and understand, it’s worth a few extra seconds of computing time.” Basically, “Yes, R is slow but it makes up for that with other advantages.”

The advice to avoid loops is good, because R has a concept of “vectorization” that is extremely fast. What that means is that R can do certain types of operations much more quickly, namely ones that are cross-column, than others. So for example, if you want to compute the square root of every value in a column, rather than loop through each row, you would just apply the sqrt function to the entire column at once, like this.

column_1 <- as.integer(c(4,9,16,25,36,49,64,81,100,144))
column_2 <- sqrt(column_1)

# yields somthing like...
   column_1 column_2
        4        2
        9        3
       16        4
       25        5
       36        6
       49        7
       64        8
       81        9
      100       10
      144       12

This is a neat concept and apparently is possible because, “Vectorized functions usually involve a behind-the-scenes loop in a low-level language (C or Fortran), which runs way faster than a pure R loop.” As long as your problem can be approached using this concept, you’re golden. So what about this anagram problem?

Here are the things we need to do:

  1. Read the data into R
  2. Convert the words into lower case
  3. Sort the lower case letters
  4. Reassemble the letters into a string

Ideally, we’d like to end up with something like this, and we’d like to get there using only vectorized functions.

  words signature
1  Lois      ilos
2  soil      ilos
3  oils      ilos
4  silo      ilos

The problem is, we can’t. If we walk through the code, we get to a point where we can’t use a vectorized function.

words <- c("Lois", "soil", "silo", "oils")
lower_case <- tolower(words)
strsplit(lower_case, "")  <======= BOOOM!!!
[[1]]
[1] "l" "o" "i" "s"

[[2]]
[1] "s" "o" "i" "l"

[[3]]
[1] "s" "i" "l" "o"

[[4]]
[1] "o" "i" "l" "s"

The R strsplit function works similarly to the Python sorted function in that it returns list of elements. In Python, sorted takes the string as input and spits out a sorted list of elements that you then have to join together. In R, you split the string using strsplit, then you sort the elements with sort, then you join them back together. Unfortunately, there is no way (that I can figure out) to assign that list output as a new column that I can continue to process.

So I have to loop. I mean there’s no way around it - “for each word, split it into its elements, sort the elements then join them back together”. This is where the apply family of functions come into play in R. Using them, I end up with this.

# R Implementation
  word_df <- read.csv("dict.txt", 
                      col.names = c("word"),
                      header = FALSE,
                      stringsAsFactors = FALSE, 
                      encoding = "UTF-8")

  word_df$signature <- lapply(word_df$word,
                              function(x)
                              {paste0(sort(unlist(strsplit(tolower(x), ""))), collapse = "")}) 

The cost?

user  system elapsed 
3.16    0.01    3.24 

So 3.16s in R versus 0.222s in Python… That’s significant, and I haven’t even gotten as far in the process after that time has elapsed.

stringi To the Rescue!

It’s clear that one line of code is doing all of the heavy lifting in this script.

{paste0(sort(unlist(strsplit(tolower(x), ""))), collapse = "")}

Doing some Web research revealed that the stringi package seemed to have some potential for improving the situation. By testing all of the possible string functions in stringi, it seems that the best performance comes from replacing the Base R sort with stri_sort from stringi. All of the other base functions are faster than the equivalent ones in stringi.

{paste0(stri_sort(unlist(strsplit(tolower(x), ""))), collapse = "")})

Speed benefit?

   user  system elapsed 
   0.92    0.00    0.97

That’s not bad at all!

Next Steps

Given that the Tidyverse is built on top of stringr, which itself is built on top of stringi, it probably would be worth rewriting this entirely using Tidyverse functions to see if I can eke out even more performance. I have to say that these results are unexpected and definitely encouraging.