A second note on loops… more looping with lists

As I continue to plod through Matloff’s very appropriate (for where I personally happen to be in my journey with R) text (The Art of R Programming: A Tour of Statistical Software Design), I came upon another piece of looping code.  I wanted to take some time to quickly jot down my thoughts and parse my way through each step.

This code does something that I imagine would be an important data munging step for a spam filtering application or any application that does text analysis.  It takes text input (that you need to first convert into a vector of strings) and then loops over said vector to count the occurrences of each word.  It is important to note, again, that I do realize that any experienced and proficient R programmer develops ways to tackle tasks like this without the use of looping constructs.  This is, for me, purely an academic exercise.

First, as always, the code (adapted From Chapter 3 of Matloff’s text):

findwords <- function(tf) {  # read in the words from the file

wl <- list()

for (i in 1:length(tf)) {

wrd <- txt[i]  # ith word in input file

wl[[wrd]] <- c(wl[[wrd]],i)

}

return(wl)

}

Let’s assume we are going to run the function on this string:

txt <- c(“now”, “is”, “the”, “the”, “time”, “or”, “time”, “now”, “is”, “all”)

The real interesting stuff happens, in my opinion, after the line of code with the for loop.  Let’s start at the beginning (wrd <- txt[i]), and do a bit of slow cognitive mastication.  So the for loop above starts with 1, and then loops from 1 to the last indexed item in the text string (length(tf)).  So, for example, during the first iteration wrd will hold the value of the 1st item in the character vector (in txt it wold be “now”).

Holding that thought, we move to the next line of code (wl[[wrd]] <- c(wl[[wrd]],i) ).  This is where the magic happens.  We are telling the loop here to fill our list (wl) that we created above with stuff.  The first interesting thing going on–to the left of the assignment operator “<-“– is that we are creating names for items in the list, and the names for the items are going to come from the text vector we are looping over.  In the case of the first iteration we are holding wrd <- “txt” so the first item in the list will be called “txt”.   The second interesting thing going on–to the right of the assignment operator “<-“– is a concatenation that “builds” a vector adding an item for each iteration of the loop.  What exactly is being added is the index (within txt) of the value being actively held in the loop (in the case of iteration #1 = “now”).  So for iteration #1, the loop creates the list item wl$now, and fills it with NULL, 1, which ends up creating a single item vector wl$now <- 1.

The other cool thing that you need to keep in mind is that when the loop encounters a text value that it has already created a list item name for (for example take item #7 in txt, another instance of “now”), it concatenates the vector wl$now that you already created (in the first iteration of the loop).  So, for example, after the 7th iteration wl$now <- c(1,7)!

The of course you need to have the loop spit out the list you created wl (return(wl)).

Very cool!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s