Sunday, June 12, 2011

Animated Plots with R



Using ImageMagick it's pretty easy to make an animated gif from a set of plots. Essentially the way to do it is to save a plot for each frame of the animation and then convert them into a .gif. Here's a simple example that plots binomial density's for two different success rates and n between 1 and 50.


#set working directory
setwd('~/Documents/R/images/')

frames = 50

for(i in 1:frames){
# creating a name for each plot file with leading zeros
if (i < 10) {name = paste('000',i,'plot.png',sep='')}
if (i < 100 && i >= 10) {name = paste('00',i,'plot.png', sep='')}
if (i >= 100) {name = paste('0', i,'plot.png', sep='')}

x = seq(0, i, 1)
f.3 = dbinom(x, size = i, prob=.3)
f.7 = dbinom(x, size = i, prob=.7)

#saves the plot as a .png file in the working directory
png(name)
plot(x, f.3, type='h', xlim = c(0,frames), ylim = c(0,.7), ylab ='probability',
main = paste('Binomial density with n = ', i), col = 'red')
lines(x,f.7,type='h',col='blue')
text(45, .6, 'p = .3', col='red')
text(45, .6, 'p = .7', col='blue', pos=1)
dev.off()
}


The important part of the code is the png function. This function saves the plot as a .png file in the working directory (there are aslo jpeg, bmp, and tiff functions that work the same way). Anything between the png and dev.off() will be saved in the plot.

After running this code there will be 50 .png files in your working directory. Now it's time to use ImageMagick. From a command line navigate to the directory where the .png files are saved and enter the following command.


$ convert *.png -delay 3 -loop 0 binom.gif


This will convert all the .png files into a animated .gif file. The -delay 3 option sets the delay between frames. The -loop 0 option makes it so the .gif repeats forever, -loop 5 would repeat the animation 5 times.


An After Thought: I made this plot just as a simple example but I think it's also a nice visualization of hypothesis testing. In the animation we can see the power of the corresponding simple versus simple hypothesis test goes to 1 as n increases. Also, I haven't tried it, but it might be neat to look at the animation with 3D glasses.

Thursday, June 9, 2011

Rejection Sampling


An interesting sampling method that was covered briefly in my Bayesian statistics course was rejection sampling. Since I have nothing better to do, I thought it would be fun to make an acceptance-rejection algorithm using R. FUN!

The Rejection Sampling method is usually used to simulate data from an unknown distribution. To do this one samples from a distribution that covers the suport of the unknown distribution and use certain criteria for accepting/rejecting the sampled values. One way to do this is as follows (Rice, p 92).

Step 1: Generate T with density m.

Step 2: Generate U, uniform on [0,1] and independent of T. If M(T)*U ≤ ƒ(T), then let X = T (accept T). Otherwise, go to Step 1 (Reject T).

Where M(x) is a function such that M(x) ≥ ƒ(x) on [a,b].

To keep things simple for myself I will be simulating a Beta distribution with parameters 6 and 3 (ƒ). To do this I will sample T's from a scaled uniform distribution (M), and reject sampled values where M(T)*U ≥ ƒ(T).

In a plot of the beta distribution with parameters 6 and 3 we can see that the ƒ(x) never goes above 3. For this reason I chose to scale the uniform distribution M by multiplying it by 3.
Here is the R code to implement rejection sampling for 100,000 observations in this example.

sample.x = runif(100000,0,1)
accept = c()

for(i in 1:length(sample.x)){
U = runif(1, 0, 1)
if(dunif(sample.x[i], 0, 1)*3*U <= dbeta(sample.x[i], 6, 3)) {
accept[i] = 'Yes'
}
else if(dunif(sample.x[i],0,1)*3*U > dbeta(sample.x[i], 6, 3)) {
accept[i] = 'No'
}
}

T = data.frame(sample.x, accept = factor(accept, levels= c('Yes','No')))

We can plot the results along with the true distribution with the following code.

hist(T[,1][T$accept=='Yes'], breaks = seq(0,1,0.01), freq = FALSE, main = 'Histogram of X', xlab = 'X')
lines(x, dbeta(x,6,3))


With 100,000 observations sampled, the data fit very well.
We can look at the densities of both the accepted and rejected values to get an idea of what's going on.

library(ggplot2)
print(qplot(sample.x, data = T, geom = 'density', color = accept))


Looking at a stacked histogram of all the sampled values together we can really see how much wasted data there are in this example.

print(qplot(sample.x, data = T, geom = 'histogram', fill = accept, binwidth=0.01))


In fact, when I ran this example I got 33,114 accepted values and 66,886 rejected values. I probably could have chosen a better value than 3 to scale the uniform distribution, but ideally rejection sampling uses a known distribution that is only slightly different from the unknown distribution we're trying to estimate.

Thursday, May 5, 2011

Accessing MySQL through R

Connecting to MySQL is made very easy with the RMySQL package. To connect to a MySQL database simply install the package and load the library.

install.packages("RMySQL")
library(RMySQL)


Connecting to MySQL:
Once the RMySQL library is installed create a database connection object.

mydb = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host')


Listing Tables and Fields:
Now that a connection has been made we list the tables and fields in the database we connected to.

dbListTables(mydb)

This will return a list of the tables in our connection.

dbListFields(mydb, 'some_table')

This will return a list of the fields in some_table.

Running Queries:
Queries can be run using the dbSendQuery function.

dbSendQuery(mydb, 'drop table if exists some_table, some_other_table')

In my experience with this package any SQL query that will run on MySQL will run using this method.

Making tables:
We can create tables in the database using R dataframes.

dbWriteTable(mydb, name='table_name', value=data.frame.name)


Retrieving data from MySQL:
To retrieve data from the database we need to save a results set object.

rs = dbSendQuery(mydb, "select * from some_table")

I believe that the results of this query remain on the MySQL server, to access the results in R we need to use the fetch function.

data = fetch(rs, n=-1)

This saves the results of the query as a data frame object. The n in the function specifies the number of records to retrieve, using n=-1 retrieves all pending records.

Saturday, April 30, 2011

Hofstader's Chaotic Sequence

About a year ago I was reading Godel, Escher, Bach by Douglas Hofstadter. In a section on recursion he presents a sequence that he calls "A Chaotic Sequence" defined as:

Q(n) = Q(n - Q(n - 1)) + Q(n - Q(n - 2)) for n > 2
Q(1) = Q(2) =1

It's similar to a Fibunacci sequence in that it references earlier values to calculate new ones. But rather than simply adding the previous two values it uses the previous two values as an index to see how far back in the sequence to find the numbers to add. The results are interesting.

1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 8, 8, 8, 10, 9, 10, ...

The next number would be 11 (found by adding the 5 that's found 10 numbers back to the 6 that's found 9 numbers back).

Recreating this sequence in R is quite simple.

q = c()

q[1] = q[2] = 1

n = 3

while(n < 100){
q[n] = q[n - q[n-1]]+q[n - q[n-2]]
n = n+1
}


The first 100 values of the sequence are listed and plotted below.

> q
[1] 1 1 2 3 3 4 5 5 6 6 6 8 8 8 10 9 10 11 11 12 12
[22] 12 12 16 14 14 16 16 16 16 20 17 17 20 21 19 20 22 21 22 23 23
[43] 24 24 24 24 24 32 24 25 30 28 26 30 30 28 32 30 32 32 32 32 40
[64] 33 31 38 35 33 39 40 37 38 40 39 40 39 42 40 41 43 44 43 43 46
[85] 44 45 47 47 46 48 48 48 48 48 48 64 41 52 54

It's clear that the values are increasing.

Looking at the first 10,000 values we again see that the values are increasing with periodic increases in variance.

If we look at the first difference of the values we can see the periodic variance more pronounced.

plot(c(1:(length(q)-1)),diff(q), type='l')


Now looking at the first 150,000 values we can see that the pattern continues... it's almost like a fractal.

Tuesday, April 26, 2011

Plotting Maps with R



I stumbled upon this tutorial while not studying and I thought it would be fun to try and plot maps of the San Francisco Bay Area household income, education, population density, poverty, etc...

To do this I needed a Shapefile for the Bay Area zip codes similar to the London borough file used in the tutorial. I found a .shp file for the Bay Area zip code boundaries here.

I have a demographic database on my computer that I used to add the needed information to the Shapefile. In the code I included a vector of the median household incomes for the corresponding zipcodes in the the .shp file.

The following code is only slightly modified from the code included in the Spatial Analysis tutorial.


library(maptools)
library(RColorBrewer)
library(classInt)

## set the working directory.
setwd("/the/folder/where/bayarea_zipcode.shp/is/saved/")

## load the shapefile
zip<- readShapePoly("bayarea_zipcodes.shp")

#add the median household incomes for each zipcode to the .shp file
med.income = c(55995, 54448, 50520, 43846, 50537, 67705, 45142, 36518, NA, 65938, 50500, 61022, 65959, 17188, 53881, 66970, 35699, 76194, 63777, 63838, 66010, 48523, 70758, 41002, 60082, 48672, 61429, 60402, 75707, 58333, 60769, 60375, 76627, 68515, 64485, 60971, 60833, 54732, 60804, 33962, 57601, 43649, 59889, NA, 61494, 67824, 64429, 82528, 101555, 96658, 50300, 41573, 75747, 53750, 56905, 85479, 77455, 64389, 91283, 119832, 103791, 100590, 85109, 88184, 57153, 34951, 51418, 38613, 43640, 19750, 139997, 39290, NA, 76808, 98525, 106492, 68112, 68853, 77952, 34398, 75026, 33556, NA, 109771, 142459, 80959, 21124, 49066, 95588, 55321, 20034, 57976, 40990, 73571, 84710, 43444, 32273, 56569, 54174, 105393, 51896, 31542, 33152, 54879, 88976, 14609, 61609, 61776, 22351, 31131, 47288, NA, 63983, 60733, 29181, 75727, 61362, 53795, 76044, 34755, 66627, 37146, 92644, 87855, 95313, 50888, 55000, 57629, 54342, 77122, 44723, 64534, 65658, 60711, 57214, 54594, 48523, 90107, 69014, 49452, 72288, 56973, 81923, 61289, 71863, 61939, 0, 82735, 68067, 82188, 70026, 101977, 55112, 84442, 82777, 82796, 92989, 67152, 68121, 69350, 104958, 49279, 80973, 89016, 96677, 89572, 64256, 84565, 16250, 64839, 200001, 82072, 58304, 66807, 97758, 68721, 77539, 41313, NA, 82314, 164479, 69087, 145425, NA, 71056, 128853, 84856)
zip$INCOME = med.income

#select color palette and the number colors (levels of income) to represent on the map
colors <- brewer.pal(9, "YlOrRd")

#set breaks for the 9 colors
brks<-classIntervals(zip$INCOME, n=9, style="quantile")
brks<- brks$brks

#plot the map
plot(zip, col=colors[findInterval(zip$INCOME, brks,all.inside=TRUE)], axes=F)

#add a title
title(paste ("SF Bay Area Median Household Income"))

#add a legend
legend(x=6298809, y=2350000, legend=leglabs(round(brks)), fill=colors, bty="n",x.intersp = .5, y.intersp = .5)




This is the map that results from the code above.

Here is a visualization of the population density in the Bay Area.

Finally, here is a visualization of the percent of adults (25 years or older) with a bachelors degree or higher.


I'm not familiar with GIS software so I imagine that this is sort of a round about way of doing something pretty simple, but I think the results are cool.