Friday, February 7, 2020

In Search of a Fast R Anagram Solution

The Problem

A couple years ago, Frank Warmerdam challenged me to solve a little logic problem. Since then, I’ve found it to be a useful little experiment to implement in different ways to see whether it’s possible to improve on the solution I came up with then. Here is essentially what he wanted.

Find all of the anagrams in a dictionary text file in which there are at least 4 letters in the word and at least as many anagrams as there are letters in the word. The dictionary should be a file on disk with one line per word.

Output should be in this format:

emit, item, mite, time  
merit, miter, mitre, remit, timer  
reins, resin, rinse, risen, serin, siren  
inert, inter, niter, retin, trine  
inset, neist, snite, stein, stine, tsine

The Key

What is an anagram? It’s defined as being, “a word, phrase, or name formed by rearranging the letters of another word.” The key, literally, is to convert every word into a set of ordered letters and then look to see if any other words convert into the same set of ordered letters. The ordered letters become the index value and all of the words that share it are anagrams of each other. In Python, you have a data representation that looks like this:

words = ["Lois", "silo", "oils", "soil"]
index = {'ilos': ["Lois", "silo", "oils", "soil"]}

Then when you want to see what anagrams exist for “soils”, you simply convert it into its key, “ilos” and then look it up in the index.

word_index['ilos']
['Lois', 'silo', 'oils', 'soil']

A simple Python solution for the indexing problem can be written like this.

word_file = open("dict.txt", 'r')
word_data = word_file.readlines()
word_file.close()

signature_index = {}

for word in word_data:
  word = word.strip()
  word = word.lower()
  signature = ''.join(sorted(word))
  if signature not in signature_index:
    signature_index[signature] = [word]
  else:
    signature_index[signature].append(word)

What about in R?

The Python solution above generates the indexed values for a text file with 99,179 words in 0.222s. It does this consistently and without breaking a sweat. At its core, it relies on a loop in which every word in the file is converted into its “signature”. Now if you read about R for more than 5 minutes, you come across this advice, “Avoid loops as much as possible.” You also see a great deal of justification like, “Computing time doesn’t matter as much as thinking time”. This translates to, “If the code is easy to write and understand, it’s worth a few extra seconds of computing time.” Basically, “Yes, R is slow but it makes up for that with other advantages.”

The advice to avoid loops is good, because R has a concept of “vectorization” that is extremely fast. What that means is that R can do certain types of operations much more quickly, namely ones that are cross-column, than others. So for example, if you want to compute the square root of every value in a column, rather than loop through each row, you would just apply the sqrt function to the entire column at once, like this.

column_1 <- as.integer(c(4,9,16,25,36,49,64,81,100,144))
column_2 <- sqrt(column_1)

# yields somthing like...
   column_1 column_2
        4        2
        9        3
       16        4
       25        5
       36        6
       49        7
       64        8
       81        9
      100       10
      144       12

This is a neat concept and apparently is possible because, “Vectorized functions usually involve a behind-the-scenes loop in a low-level language (C or Fortran), which runs way faster than a pure R loop.” As long as your problem can be approached using this concept, you’re golden. So what about this anagram problem?

Here are the things we need to do:

  1. Read the data into R
  2. Convert the words into lower case
  3. Sort the lower case letters
  4. Reassemble the letters into a string

Ideally, we’d like to end up with something like this, and we’d like to get there using only vectorized functions.

  words signature
1  Lois      ilos
2  soil      ilos
3  oils      ilos
4  silo      ilos

The problem is, we can’t. If we walk through the code, we get to a point where we can’t use a vectorized function.

words <- c("Lois", "soil", "silo", "oils")
lower_case <- tolower(words)
strsplit(lower_case, "")  <======= BOOOM!!!
[[1]]
[1] "l" "o" "i" "s"

[[2]]
[1] "s" "o" "i" "l"

[[3]]
[1] "s" "i" "l" "o"

[[4]]
[1] "o" "i" "l" "s"

The R strsplit function works similarly to the Python sorted function in that it returns list of elements. In Python, sorted takes the string as input and spits out a sorted list of elements that you then have to join together. In R, you split the string using strsplit, then you sort the elements with sort, then you join them back together. Unfortunately, there is no way (that I can figure out) to assign that list output as a new column that I can continue to process.

So I have to loop. I mean there’s no way around it - “for each word, split it into its elements, sort the elements then join them back together”. This is where the apply family of functions come into play in R. Using them, I end up with this.

# R Implementation
  word_df <- read.csv("dict.txt", 
                      col.names = c("word"),
                      header = FALSE,
                      stringsAsFactors = FALSE, 
                      encoding = "UTF-8")

  word_df$signature <- lapply(word_df$word,
                              function(x)
                              {paste0(sort(unlist(strsplit(tolower(x), ""))), collapse = "")}) 

The cost?

user  system elapsed 
3.16    0.01    3.24 

So 3.16s in R versus 0.222s in Python… That’s significant, and I haven’t even gotten as far in the process after that time has elapsed.

stringi To the Rescue!

It’s clear that one line of code is doing all of the heavy lifting in this script.

{paste0(sort(unlist(strsplit(tolower(x), ""))), collapse = "")}

Doing some Web research revealed that the stringi package seemed to have some potential for improving the situation. By testing all of the possible string functions in stringi, it seems that the best performance comes from replacing the Base R sort with stri_sort from stringi. All of the other base functions are faster than the equivalent ones in stringi.

{paste0(stri_sort(unlist(strsplit(tolower(x), ""))), collapse = "")})

Speed benefit?

   user  system elapsed 
   0.92    0.00    0.97

That’s not bad at all!

Next Steps

Given that the Tidyverse is built on top of stringr, which itself is built on top of stringi, it probably would be worth rewriting this entirely using Tidyverse functions to see if I can eke out even more performance. I have to say that these results are unexpected and definitely encouraging.

No comments: