Creating a vector of string or numeric from a factor vector using the factors levels: a transformation of sorts

Posted on September 16, 2013 by Healthoutcomesguy

I’ve used this handy bit of code quite a bit. It comes in handy when I want to use something that has been coerced to type=factor. I run into this quite a bit after a table() call when I want to do something more interesting with the result (e.g., convert to a data.frame and sort or select rows by the bin name).
vec <- round(12 + rnorm(100, 5, 7), 0) t1 <- table(vec) df1 <- data.frame(t1) df1 class(df1$vec) df1$vec <- as.numeric(levels(df1$vec))[df1$vec] df1 class(df1$vec)

Histogram with density plot overlay (and fancy ggplot-esque background + summary data where legend goes)

Posted on September 13, 2013 by Healthoutcomesguy

I recently had to visualize some data for a client that involved identifying the number of members that were under the age of 18. Thanks to some help from Robert Kabacoff’s Quick-R site, I put together a histogram with density plot overlay. This is how I did it:

set.seed(123) agedata <- round(20 + rnorm(1000, 25, 20),0)

xrange <- "Jan-2013 to Jun-2013" par(mar=c(5.1, 4.1, 4.1, 11.5)) x1 <- hist( agedata, breaks = 100, xlim=range(0, 100), lwd=2,yaxt='n', xaxt='n', ylab="", xlab="", main="") hist( agedata, breaks = 100, xlim=range(0, 100), lwd=2,yaxt='n', xaxt='n', ylab="", xlab="", main="") u <- par("usr") rect(u[1], u[3], u[2], u[4], col = "gray88", border = "black") abline(h = pretty(x1$counts), col='white') ## draw h lines abline(v = (pretty(range(1,100))), col='white') ## draw v lines par(new=T) hist(agedata, breaks=100, xlim=range(0, 100), ylim=range(u[3:4]), col="yellow", xlab="Age", main=paste("Drug Users by Age fills from", xrange) ) xx1 <- data.frame(table(agedata)) ## I will use this next chunk of code to "color in" only those bars that are below 18years to highlight them xx1$agedata <- as.numeric(levels(xx1$agedata))[xx1$agedata] xx1 <- xx1[order(xx1$agedata),] xx1 <- xx1[xx1$agedata < 18,] vec <- NULL for( i in 1:nrow(xx1)){ ## this is where I create my new data run <- rep(xx1$agedata[i], xx1$Freq[i]) vec <- c(vec, run) } par(new=T) hist(vec, breaks=c(x1$breaks[x1$breaks<18]) ,xlim=range(0, 100), ylim=range(u[3:4]), col="darksalmon", xlab="", main= "" ) # notice I have to use breaks from the previous hist call to ensure that the breaks for the new hist line up with the previous. abline(v=17, lwd=2, lty=2, col="red" ) count <- nrow(genage) avage <- round(mean(agedata),2) medage <- round(median(agedata), 2) sdage <- round(sd(agedata),2) # text(xpd=TRUE, x= 35, y=35, offset=-.2, labels=paste("Count of Users:", count, "\n", "Mean Age:", avage, "\n", "Median Age:", medage, "\n", "St. Dev.:", sdage), font=2) legend(xpd=TRUE,'topright', inset=c(-0.3,0), legend=paste(" Count of Users:", count, "\n", "Mean Age:", avage, "\n", "Median Age:", medage, "\n", "St. Dev.:", sdage), pch=NA, ,title="Summary", bty='n' , cex=1) ## you are going to have to create the strings that go into the legend youself, I used some that I had previously created d <- density(agedata) par(new=T) plot(d, main="", xlim=range(0,100), xlab="", ylab="", xaxt='n', yaxt="n")

importing data into R directly from a Github repository

Posted on September 4, 2013 by Healthoutcomesguy

(insert cool sound effect)… I think doing so while running the code also will vastly improve performance

I have recently started to get serious about reviewing time series forecasting. Partly, my push into the topic is nudged along by Rob J Hyndman and George Athanasopoulos’s online text Forecasting: principles and practice. In addition, there are was a really nice contributed blog post on R-bloggers by Corey Chivers @ bayesianbilogist.com that took a stab at forecasting bicycle collision rates in Montreal based on 3 years of data.

There was a lot of really great stuff that Corey’s post encouraged me to explore, but one of the things that it required was to import some data directly from Corey’s github repository into R, and I thought I would quickly write a post on how that can be done.

Corey’s github repository can be found here, and in it there is a data folder that contains all the data required to replicate his time series forecast of bicycle collisions. Corey has also graciously provided R code for the later in the same repository.

To import the data, you will need to have the RCurl library installed.

library(RCurl)
url <- 'https://raw.github.com/cjbayesian/collisions/timeseries/data/Bike%20Accidents.csv'
bike.data <- getURL(url, ssl.verifypeer = FALSE) ##  ssl.verifypeer is to subverte an SSL error you get otherwise               
d <-  read.table(textConnection(bike.data), header=TRUE, sep='|', row.names='id', na.strings=' ') ##  file uses bar separator

It’s as easy as that! Now if I can only get that pesky shapefile (montreal_borough_borders.shp) to play nicely with readShapePoly()! So much to learn, so little time! 🙂

##UPDATE##

OK, so I’ve revisited this as I start to implement a workflow that incorporates Git for version control and such… As I’ve done so, I wanted to post another set of example code:

url <- "https://raw.githubusercontent.com/connerpharmd/LDLCVE/master/secprevstatin.csv"
statins <- getURL(url,  ssl.verifypeer = FALSE)                
statins<- read.csv(textConnection(statins))
statins <- statins[ which(statins$Study != "AtoZ"),]

failuretoconverge

father, husband, son, brother, data hacker, seeking optimization

Monthly Archives: September 2013

Creating a vector of string or numeric from a factor vector using the factors levels: a transformation of sorts

Histogram with density plot overlay (and fancy ggplot-esque background + summary data where legend goes)

importing data into R directly from a Github repository