gdata for importing .xlsx files on 64 bit Win running 64 bit R

frustration

While in most instances it is a much better bet to convert excel data into a *.csv file and use read.csv, there are the rare times when it may actually make sense to grab the data directly out of excel.  This is the case (for example) when you have a client who sends you multiple excel files (with multiple sheets per file) that need to be imported.

There are many different packages that do this; however, I kept running into issues as certain packages are only compatible with 32 bit R (or 32 bit Windows), and I run a 64-64 set up.  My solution, the gdata package.  While it is not very fast (esp for large excel files), you can accomplish the above task with gdata’s read.xls function.  In my case the code looks like this:

library(gdata)
DF <- read.xls(xls="~/yourpath/filename.xlsx",sheet=1)
View(DF)

I plan on using a loop (with assign()) to pull in data from 3 sheets in each of many xls files and create the correct names for all of them in R. A word of warning. Each of my sheets contained around 60K rows, and read.xls took quite a bit of time to run an import, so plan accordingly if you are dealing with big files. In my case, system.time() told me that my import of ~65K rows took around 3.5 minutes.

BTW, while this does not warrant a full post, I just recently discovered the function file.choose() for having R create the file path for you (if you happen to be too lazy to type in an actual file name)… or, to put my code above another way, (and if don’t care about looping) you can do something like this:

library(gdata)
DF <- read.xls(xls=file.choose(),sheet=1)
View(DF)

Take it to the limit…. with usr()

Another example of exploring the limits of your tools...

Another example of exploring the limits of your tools…

Recently I was running a script that produced a time series plot and needed to draw some horizontal reference lines that highlighted when certain programs were implemented. Usually this is a task handled quite efficiently with the abline() function. The problem was that, due to some custom formatting (user generated legend that was outside of plotting area by using xpd=TRUE in par()), when I called abline, the plotting device would draw the line through the entire plotting surface (e.g., the line was not confined to the plotting area).

So what I had to do was use the segment() function to draw my lines. The only problem was that I needed the precise upper and lower coordinates to feed to the segment command. Even though in my case I used a user defined ylim (max and min of y), the plot adds a little bit of a fudge-factor to these upper and lower limits. Enter the usr command. My answer came in the form of this very useful R help posting. The usr option to par() returns a vector of the exact corner limit values of your plotting area (very handy!).

Here is another nice post about the ‘usr’ option to par.

Here’s how it all came together.


##  model of my problem
par(xpd=TRUE)
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
plot(x, y, main = "Notice how the line extends beyond plot margins")
abline(v=0.4)

##  my solution
par(xpd=TRUE)
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
plot(x, y, main = "Notice how the line extends beyond plot margins")
u <- par("usr")
segments(0.4, u[3], 0.4, u[4])

Picking one observation per ‘subject’ based on max (or min)…there can be only one!

Highlander http://encefalus.com/wp-content/uploads/2008/07/highlander.jpg

Today, I came across a post from the ‘What you’re doing is rather desperate’ blog that dealt with a common issue, and something that I deal with on (almost) a daily basis. It is, in fact, so common an issue that I have a script that does all the work for me and it was good diving back in for a refresh of something I wrote quite a bit ago.

N.Saunders posts a much cleaner solution than mine, but mine avoids this issues that can arise when you have non-unique values as maximums (or minimums). Plus my solution avoids the use of the merge() function which, in my experience can sometimes be a memory and time hog. See below for my take on solving his issue.

## First lets create some data (and inject some gremlins)
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15'))  ##  create some ties
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16'))  ##  more ties

df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]  ##  my solution requires that you order your data first
row.names(df.orig) <- seq(1,nrow(df.orig))  ##  since the row.names get scrambled by the order() function we need to re-establish some neatness
x1 <- match(df.orig$vars, df.orig$vars)
index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1)) ## here's where the magic happens
df.max <- df.orig[index,]

Implementing R-assigned variables as part of RODBC Select statements

I do a lot of work now that involves cutting data out of my company’s SQL database and bringing it directly into R via the RODBC package. It’s a really nice way to tidy up my workflow and by having access to all of the aggregation (and join) functions of SQL, I can do much of the heavy lifting before bringing the data into R. As you can imagine, this really beats the pants off of exporting *.csv tables out of an SQL database and then importing them in (which is what I did until I learned how to use the RODBC package).

Recently my R-RODBC workflow productivity took a bit of a quantum leap as I learned how to incorporate variables that I assign in R directly into an RODBC SQLSelect query. The problem I was confronted with was that, when I go into a database, I often want to cut a years worth (or more) of data by complete months (e.g., I do not want 1/2 a month of data). In cases where the most recent entry leaves you with less than a full month of data on the tail end (or beginning), I needed a way for the select statement to pull data up to the end of the PREVIOUS month (essentially ignoring the hanging chad of the partial month). If, on the other hand, the data just so happened to run up to the last day of a month, I needed the data pull to end on the last day we had data in the SQL database. Because my ‘last day’ was dependent upon what was in the database, I needed a way for R to: 1) Check to see what the last day (most recent entry) was in the data, 2) Evaluate it to see if that last day happened to be an end-of-month value, and 3) Feed the right value into my SQLSelect statement.

The keys to how I implemented this solution involved using the frac option from the as.yearmon function in the zoo package. Once the evaluation was complete, I used if statements to assign the correct first and last date values, and lastly I used the paste function to create my SQL statement, then I had to tidy up the statement using the strwrap function so that it could be fed to RODBCs sqlQuery function. It all goes down like this:


library(zoo)
library(RODBC)
ch <- odbcConnect("THE_DSN_YOU_CREATE")
head(sqlTables(ch), n=20) ## just a test to make sure the connection worked

lastfill <- sqlQuery(ch, "SELECT (MAX(DATE)) AS LASTFILL FROM [YOUR SQL DB]") ## pull the last date

x1 <- as.Date(lastfill[1,1])# pull last fill date as element
x2 <- as.Date(as.yearmon(x1), frac=1) ##  create element for last day of month 
x3 <- as.numeric(x2-x1) ##  test, result = 0 if lastfill goes to last day of month


if (x3 > 0){ ##  if lastfill is not at end of month
    lastyearmon <- (as.yearmon(maxyearmon, '%Y%m') - 1/12) ##  The most efficient way for me to cut the data by date is using the "YEAR MONTH" variable in our DB
    firstyearmon <- lastyearmon-(1-(1/12))
}

if (x3 == 0){  ## if last fill IS at end of a month
  lastyearmon <- (as.yearmon(maxyearmon, '%Y%m'))
  firstyearmon <- lastyearmon-(1-(1/12))
}

 

sql.select <- paste("  ## notice my use of tabs and hard returns to make the statement more legible (highly recommended), we fix these below
                    SELECT 
                    *
FROM  [YOUR SQL DB]
WHERE (YEARMONTH>= '", firstyearmon ,"' AND YEARMONTH <= '", lastyearmon, "')", sep="") ##  Simple SQL SELECT statement, but you can get really fancy here if you like.


data <- sqlQuery(ch, 
                 strwrap(sql.select, width=nchar(sql.select))) ##  use strwrap to clean up hard returns 

close(ch)

Confronting my fea-R-s, the list-indexing edition

I don't know why I find these twins scary... but I do...

I don’t know why I find these twins scary… but I do…

Lists are very powerful data structures, but since diving into the R pool, I’ve cycled through a spectrum of emotions/opinions about the list data structure. My responses have ranged from awe, confusion, wonder, excitement and most recently, fear and frustration. The later two spring, in large part, from a lack of comfort around how to efficiently extract items that either I, or a well-meaning function (such as strsplit() or gregexpr(), or heck even lm() and just about every other statistical function), put into a list structure. Coming to R from, I’m ashamed to say (especially after reading this)… SPSS(mainly) with some SAS… I don’t have the benefit of formal object oriented programming language training and comfort with list structures that someone coming to R from Python or C++ might have.

BOTTOM LINE: While I feel quite comfortable extracting elements from data.frame structures, I’m all thumbs when it comes to lists, which is kind of ironic because the data.frame structure is a kind of list

All that said, I’m a strong believer in the power of confronting fears and addressing ones weaknesses, and today, an occasion presented itself for me to address some of my list-indexing issues. In my example I was using strsplit() to manipulate some string data. What I wanted to do was to extract the third word from a vector of strings where each element contained a name with 2+ words. After doing some hunting around I found this most helpful post that contained a solution, but also provided me with some insight on lists. Enough of that, however, here’s what I ended up doing:

V1 <- c( "Mr. Hooper's Store", "Oscar the Grouchs Can", "Big Bird's Nest" , "Maria's pad", "Elmo's Secret Hideout")
V2 <- sapply(strsplit(V1, " "), function(x) x[3]) ## the result of strsplit is a list
##  this is also insightful
strsplit(V1, " ")[[c(1,3)]]
##  to parse things out, the magic happens with sapply
V3 <- strsplit(V1, " ")
V3
##  let's just change our request to the second word now, just for fun!
sapply(V3, function(x) x[2]) ## magic!

So, what I learned is that to extract specific elements from list structures, I’m going to have to lean on looping constructs and helpful R-ish loop-like functions (of the apply-ilk) to get what I want.