Today’s lab session introduced a function buildup method:
Moving from Scripts to Functions
The lab uses the 1973 New York Air Quality Measurements (airquality
) dataset as an example.
This document is heavily-commented and step-by-step featured.
The \(function\) is to plot median lines among data points.
It contains two layers.
The first layer is to slice a vector, \(airquality\$Wind\), into \(batchSize\) pieces and calculate the \(median\) of each piece. Then, the function assigns the \(median\) of each piece to an empty vector with same structure as the original data vector (same sequence and length). The return of this \(function\) would be a vector that contains \(batchSize\) pieces of medians.
The second layer is to shift the \(batchSize\) of data points to the right, one point each, and repeat \(batchSize\) times. This is realized by adding an NA
data point at the beginning of the \(index\) vector.
To sum up, the \(function\) calls the following steps:
The syntax is:
batchSize <- 11
dim(airWind)[1]
)The syntax is:
airWind <- airquality$Wind
The syntax is:
leng <- length(airWind)
Update: 10/08/2019 Step 2 and 3 can be combined with this line: leng <- length(airquality$Wind)
The syntax is index <- rep(1:leng, each = batchSize)
– For example:
if \(batchSize = 5\), we will get:
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
…
[760] 152 153 153 153 153 153
Since the index is used for indexing the data points in our target dataset, there is no need to list the index vector all the way to the end. We only need a portion of the index which is equal to \(leng\) (the length of the target dataset). We can use [1: leng]
to select the values we want.
The updated syntax is:
index <- rep(1:leng, each = batchSize)[1: leng]
The syntax is:
medResults <- vector("list", length = batchSize)
tapply
function. The tapply
function computes a measure (mean, median, min, max, etc..) or a function for each factor variable in a vector. In this case, we will calculate the median of the \(airWind\) in a way as specified by the \(index\). The syntax is: tapply(airWind, INDEX = index, FUN = "median")
.– For example:
If \(batchSize\) = 5, then the \(index\) looks like this:
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 … 153 153 153 153 153
The tapply
function will first find values in \(airWind\) with orders equal to 1 (the first five values). Then the function will calculate the median of these five values. Then, the function moves to the second five and so on.
Each time we perform the tapply
, we need to store the result to the preset list \(medResults\). We use double brackets to index the list set.
– For example:
To store the first tapply
result to the first list in \(medResults\), we can use: medResults[[1]] <- tapply(airWind, INDEX = index, FUN = "median")
.
for loop
to automate the process. Inside the loop, we can add an NA
value at the beginning to the \(index\) each time to simulate the shift. (tapply
does not process NA
values in INDEX
.)Like this: ([n] represents repeat times, \(batchSize\) = 5)
[1] 1 1 1 1 1 2 2 2 2 2 …
[2] NA 1 1 1 1 1 2 2 2 2 2 …
[3] NA NA 1 1 1 1 1 2 2 2 2 2 …
[4] NA NA NA 1 1 1 1 1 2 2 2 2 2 …
[5] NA NA NA NA 1 1 1 1 1 2 2 2 2 2 …
The tapply
kicks in after NA
values. Thus, at the 5th loop, the tapply
will calculate the median of [5:9] instead of [1:5].
As NA
values being added at the beginning, the length of \(index\) will increase each time by 1. So we can add [1:leng]
at the end to control the length.
The for loop syntax is:
for (i in 1:batchSize) {
medResults[[i]] <- tapply(airWind, INDEX = index, FUN = "median")
index <- c(NA, index)[1:leng]
}
Inside the loop, we create a local variable called \(i\). \(i\) is used for counting loops. The syntax (i in 1: batchSize)
means we want to perform the loop with \(batchSize\) times.
\(i\) is also used for indexing the list layers in the \(medResults\). Each time, the loop will run the tapply
function and generate a vector of median results. Then, the loop assigns the result to the \(i\)th list in \(medResults\). In this way, \(medResults\) will contain all results from the loop, layer by layer, in different indexed lists.
for loop
for iteration.First, create a matrix called \(smoothMatrix\) that is filled up with NA
values. To match the dimension of \(medResults\), we set the number of row to be equal to \(\frac{leng}{batchSize}\). We use ceiling
function to round the value to the next upper integer. The number of columns should be equal to \(batchSize\). Then, we use a for loop to export each list inside the \(medResults\) to the \(smoothMatrix\).
The syntax is:
smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
for (i in 1:batchSize) {
vals <- medResults[[i]]
smoothMatrix[1:length(vals), i] <- vals
}
# The local variable, vals, inside the loop is just a bridge and has no
# global meanings.
smoothMatrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 11.5 11.5 11.5 10.9 10.90 10.9 10.90 11.5 11.50 10.9 10.9
## [2,] 11.5 11.5 11.5 12.0 12.00 12.0 12.00 12.0 12.00 12.0 12.0
## [3,] 9.7 12.0 9.7 9.2 9.20 9.7 9.20 9.2 9.70 9.7 9.7
## [4,] 9.7 9.7 10.9 11.5 11.50 11.5 11.50 11.5 10.90 10.3 10.3
## [5,] 10.3 9.2 8.0 8.0 8.00 8.0 8.00 8.0 8.00 8.0 9.2
## [6,] 9.2 9.2 9.2 9.2 8.00 7.4 7.40 8.6 8.60 8.6 8.6
## [7,] 8.6 8.6 8.6 8.6 10.30 10.3 10.30 10.3 9.70 8.6 8.6
## [8,] 8.6 8.6 8.6 8.6 8.60 8.6 8.60 8.0 7.40 7.4 7.4
## [9,] 7.4 7.4 7.4 7.4 7.40 8.0 8.00 8.6 9.70 10.3 10.3
## [10,] 10.3 10.3 10.3 10.9 10.90 10.9 10.30 10.3 10.30 9.7 9.7
## [11,] 9.7 9.7 8.0 6.9 6.30 6.3 5.70 6.3 6.30 6.3 6.3
## [12,] 6.9 7.4 9.7 10.3 10.30 10.9 10.90 10.9 10.90 10.9 10.9
## [13,] 10.3 10.9 10.3 10.3 10.30 10.3 10.30 10.3 10.30 10.3 10.3
## [14,] 10.9 10.3 10.9 11.5 12.35 11.5 12.35 11.5 9.75 11.5 NA
We want to transfer the matrix to a vector with result values in correct order. To realize this, we use t
function to transpose the matrix and use as.vector
to switch the data type to vector. For convenience, we assign a name \(smoothAirWind\) to it.
The syntax is:
smoothAirWind <- as.vector(t(smoothMatrix))
The syntax is:
plot (airWind, pch = 16, col = 8)
# pch - point character style
# col - color
# lwd - line width
lines(smoothAirWind, lwd = 3, col = "blue")
myRunMeds
.It has two variables:
\(x\) - the target data (as a vector)
\(batchSize\) - the slice size, set dafault to 5
The syntax is:
myRunMeds <- function(x, batchSize = 5) {
leng <- length(x)
index <- rep(1:leng, each = batchSize)[1:leng]
medResults <- vector("list", length = batchSize)
for (i in 1:batchSize) {
medResults[[i]] <- tapply(x, INDEX = index, FUN = "median")
index <- c(NA, index)[1:leng]
}
smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
for (i in 1:batchSize) {
vals <- medResults[[i]]
smoothMatrix[1:length(vals), i] <- vals
}
return(as.vector(t(smoothMatrix)))
}
plot(airquality$Wind, pch = 16, col = 8, ann = FALSE)
title(main = "Wind Distribution of NYC Air Quality in 1973 with Median Lines",
xlab = "Wind Speed", cex = 0.8)
# pch - point character
# col - color
# ann - default annotation (title, x-label, y-label), omit if set to FALSE
# xlab - x-label
# cex - character expansion ratio
lines(myRunMeds(airquality$Wind, batchSize = 3), col = "red", lwd = 2)
lines(myRunMeds(airquality$Wind, batchSize = 9), col = "green")
lines(myRunMeds(airquality$Wind, batchSize = 15), col = "blue")
lines(myRunMeds(airquality$Wind, batchSize = 25), col = "black", lwd = 2)
As \(batchSize\) increases, the median line will be smoother. This is because more values are involved in the calculation of median, causing much less fluctuation.
The following code is commented by professor on week 6 class lecture.
myRunMeds <- function(x, batchSize = 5) {
# myRunMeds: take medians of subsets of y of size 'batchSize' (sequentially)
# and return. vector of those medians. x: series of points. batchSize:
# window or subset size.
leng <- length(x)
# need length of x to ensure we don't reference outside the length of the
# original input.
index <- rep(1:leng, each = batchSize)[1:leng]
# set up batch lables for `tapply`.
medResults <- vector("list", length = batchSize)
# allow for different number of medians in each iteration.
for (i in 1:batchSize) {
# each iteration builds 1/batchSize of the output
medResults[[i]] <- tapply(x, INDEX = index, FUN = "median")
# generate a set of medians based on batch names given by index
index <- c(NA, index)[1:leng]
# shift bach names given by index one to the right
}
# fill columns of new matrix with bathes of medians so that we can assign
# each batch to the position of the first element of the batch. PUT MEDIANS
# IN PROPER PLACE OUTPUT
smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
for (i in 1:batchSize) {
vals <- medResults[[i]]
smoothMatrix[1:length(vals), i] <- vals
}
return(as.vector(t(smoothMatrix))[1:leng])
# need to transpose to weave together the proper medians in the proper
# position when `as.vector` unwinds the matrix or TRANSPOSE TO GET POSITION
# RIGHT.
}
myRunMeds <- function(x, batchSize = 5) {
# myRunMeds: take medians of subsets of y of size 'batchSize' (sequentially)
# and return. vector of those medians. x: series of points. batchSize:
# window or subset size.
leng <- length(x)
# need length of x to ensure we don't reference outside the length of the
# original input.
index <- rep(1:leng, each = batchSize)[1:leng]
# set up batch lables for `tapply`.
medResults <- vector("list", length = batchSize)
# allow for different number of medians in each iteration.
for (i in 1:batchSize) {
# each iteration builds 1/batchSize of the output
medResults[[i]] <- tapply(x, INDEX = index, FUN = "median")
# generate a set of medians based on batch names given by index
index <- c(NA, index)[1:leng]
# shift bach names given by index one to the right
}
# fill columns of new matrix with bathes of medians so that we can assign
# each batch to the position of the first element of the batch. PUT MEDIANS
# IN PROPER PLACE OUTPUT
smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
for (i in 1:batchSize) {
vals <- medResults[[i]]
smoothMatrix[1:length(vals), i] <- vals
}
return(as.vector(t(smoothMatrix))[1:leng])
# need to transpose to weave together the proper medians in the proper
# position when `as.vector` unwinds the matrix or TRANSPOSE TO GET POSITION
# RIGHT.
}
Below is a more advanced code, in the sense that it uses lots of complex manipulations of strings, and unusual ways to assign to an array. It’s not particularly elegant, but you could learn a lot from trying to “parse” it. For those with more knowledge of R’s vector/list manipulating features as well as those who want a “puzzle”:
# names(medResults) <- 1: batchSize vals <- unlist(medResults[[i]])
# smoothMatrix <- matrix(NA, ncol = ceiling(leng/batchSize), nrow =
# batchSize) assignLocation <-
# matrix(as.numeric(unlist(strsplit(names(vals), '\\.'))), ncol = 2, byrow
# = T) smoothMatrix[assignLocation] <- vals as.numeric(smoothMatrix)[1: n]
There is also a way to avoid the matrix altogether, since you have the proper row/col to assign to right there in the assignLocation line, and that can be converted to a location in the proper permutation of 1:153
You could also assign a matrix for the results in the FIRST LOOP and avoid the list structure.
Here is a way to do it all in one loop, using the idea that you can construct the proper location to assign the result if you think about how the terms in each batch are created and where they “belong”:
# leng <- length(y) index <- rep(1: leng, each = batchSize)[1: leng]
# medResults <- rep(NA, leng) #where you store the answer for (i in
# 1:batchSize) { runMeds <- tapply(y, index, FUN = 'median') batchLength <-
# length(runMeds) # to handle end condition properLocations <- i + batchSize
# * (0: (batchLength- 1)) medResults[properLocations] <- runMeds index <-
# c(NA, index)[1: leng] }
One of my classmates provided an updated version of the code. It improves performance by avoiding transposing the matrix. Here is the code:
Credit to Jingya Wang
smoothMatrix <- matrix(NA, nrow = batchSize, ncol = ceiling(leng/batchSize))
# switching contents in rows and columns simplifies the code
for (i in 1:batchSize) {
vals <- medResults[[i]]
smoothMatrix[i, 1:length(vals)] <- vals
# switch contents
}