Creating a vector of string or numeric from a factor vector using the factors levels: a transformation of sorts

tuxedo_warefolf

I’ve used this handy bit of code quite a bit. It comes in handy when I want to use something that has been coerced to type=factor. I run into this quite a bit after a table() call when I want to do something more interesting with the result (e.g., convert to a data.frame and sort or select rows by the bin name).

vec <- round(12 + rnorm(100, 5, 7), 0)
t1 <- table(vec)
df1 <- data.frame(t1)
df1
class(df1$vec)
df1$vec <- as.numeric(levels(df1$vec))[df1$vec]
df1
class(df1$vec)

Advertisements

Histogram with density plot overlay (and fancy ggplot-esque background + summary data where legend goes)

testplot_2013_09_13I recently had to visualize some data for a client that involved identifying the number of members that were under the age of 18. Thanks to some help from Robert Kabacoff’s Quick-R site, I put together a histogram with density plot overlay. This is how I did it:


set.seed(123)
agedata <- round(20 + rnorm(1000, 25, 20),0)

xrange <- "Jan-2013 to Jun-2013"
par(mar=c(5.1, 4.1, 4.1, 11.5))
x1 <- hist( agedata, breaks = 100, xlim=range(0, 100), lwd=2,yaxt='n', xaxt='n', ylab="", xlab="", main="")
hist( agedata, breaks = 100, xlim=range(0, 100), lwd=2,yaxt='n', xaxt='n', ylab="", xlab="", main="")
u <- par("usr")
rect(u[1], u[3], u[2], u[4], col = "gray88", border = "black")
abline(h = pretty(x1$counts), col='white') ## draw h lines
abline(v = (pretty(range(1,100))), col='white') ## draw v lines
par(new=T)
hist(agedata, breaks=100, xlim=range(0, 100), ylim=range(u[3:4]), col="yellow", xlab="Age", main=paste("Drug Users by Age fills from", xrange) )
xx1 <- data.frame(table(agedata)) ## I will use this next chunk of code to "color in" only those bars that are below 18years to highlight them
xx1$agedata <- as.numeric(levels(xx1$agedata))[xx1$agedata]
xx1 <- xx1[order(xx1$agedata),]
xx1 <- xx1[xx1$agedata < 18,]
vec <- NULL
for( i in 1:nrow(xx1)){ ## this is where I create my new data
run <- rep(xx1$agedata[i], xx1$Freq[i])
vec <- c(vec, run)
}
par(new=T)
hist(vec, breaks=c(x1$breaks[x1$breaks<18]) ,xlim=range(0, 100), ylim=range(u[3:4]), col="darksalmon", xlab="", main= "" ) # notice I have to use breaks from the previous hist call to ensure that the breaks for the new hist line up with the previous.
abline(v=17, lwd=2, lty=2, col="red" )
count <- nrow(genage)
avage <- round(mean(agedata),2)
medage <- round(median(agedata), 2)
sdage <- round(sd(agedata),2)
# text(xpd=TRUE, x= 35, y=35, offset=-.2, labels=paste("Count of Users:", count, "\n", "Mean Age:", avage, "\n", "Median Age:", medage, "\n", "St. Dev.:", sdage), font=2)
legend(xpd=TRUE,'topright', inset=c(-0.3,0), legend=paste(" Count of Users:", count, "\n", "Mean Age:", avage, "\n", "Median Age:", medage, "\n", "St. Dev.:", sdage), pch=NA, ,title="Summary", bty='n' , cex=1) ## you are going to have to create the strings that go into the legend youself, I used some that I had previously created
d <- density(agedata)
par(new=T)
plot(d, main="", xlim=range(0,100), xlab="", ylab="", xaxt='n', yaxt="n")

importing data into R directly from a Github repository

(insert cool sound effect)... I think doing so while running the code also will vastly improve performance

(insert cool sound effect)… I think doing so while running the code also will vastly improve performance

I have recently started to get serious about reviewing time series forecasting. Partly, my push into the topic is nudged along by Rob J Hyndman and George Athanasopoulos’s online text Forecasting: principles and practice. In addition, there are was a really nice contributed blog post on R-bloggers by Corey Chivers @ bayesianbilogist.com that took a stab at forecasting bicycle collision rates in Montreal based on 3 years of data.

There was a lot of really great stuff that Corey’s post encouraged me to explore, but one of the things that it required was to import some data directly from Corey’s github repository into R, and I thought I would quickly write a post on how that can be done.

Corey’s github repository can be found here, and in it there is a data folder that contains all the data required to replicate his time series forecast of bicycle collisions. Corey has also graciously provided R code for the later in the same repository.

To import the data, you will need to have the RCurl library installed.

library(RCurl)
url <- 'https://raw.github.com/cjbayesian/collisions/timeseries/data/Bike%20Accidents.csv'
bike.data <- getURL(url, ssl.verifypeer = FALSE) ##  ssl.verifypeer is to subverte an SSL error you get otherwise               
d <-  read.table(textConnection(bike.data), header=TRUE, sep='|', row.names='id', na.strings=' ') ##  file uses bar separator 

It’s as easy as that! Now if I can only get that pesky shapefile (montreal_borough_borders.shp) to play nicely with readShapePoly()! So much to learn, so little time! 🙂

##UPDATE##

OK, so I’ve revisited this as I start to implement a workflow that incorporates Git for version control and such… As I’ve done so, I wanted to post another set of example code:

url <- "https://raw.githubusercontent.com/connerpharmd/LDLCVE/master/secprevstatin.csv"
statins <- getURL(url,  ssl.verifypeer = FALSE)                
statins<- read.csv(textConnection(statins))
statins <- statins[ which(statins$Study != "AtoZ"),]