Today’s lab session introduced a function buildup method:

Moving from Scripts to Functions

The lab uses the 1973 New York Air Quality Measurements (airquality) dataset as an example.

This document is heavily-commented and step-by-step featured.

Function

The \(function\) is to plot median lines among data points.

It contains two layers.

The first layer is to slice a vector, \(airquality\$Wind\), into \(batchSize\) pieces and calculate the \(median\) of each piece. Then, the function assigns the \(median\) of each piece to an empty vector with same structure as the original data vector (same sequence and length). The return of this \(function\) would be a vector that contains \(batchSize\) pieces of medians.

The second layer is to shift the \(batchSize\) of data points to the right, one point each, and repeat \(batchSize\) times. This is realized by adding an NA data point at the beginning of the \(index\) vector.

Details

To sum up, the \(function\) calls the following steps:

  1. Create a variable called \(batchSize\). It determines how large each slice is. We can create plot lines with different slice size through changing the value of this variable. In this case, we preset the value to \(11\).

The syntax is:

batchSize <- 11
  1. Create a variable called \(airWind\) and assign the \(Wind\) vector in the \(airquality\) dataset to it, so that we don’t need to type \(airquality\$Wind\) every time we need this vector. THe \(airWind\) variable has 153 data. (dim(airWind)[1])

The syntax is:

airWind <- airquality$Wind
  1. Assign the length of \(airWind\) to a new variable called \(leng\). We need this to ensure we don’t reference outside the length of the original input.

The syntax is:

leng <- length(airWind)

Update: 10/08/2019 Step 2 and 3 can be combined with this line: leng <- length(airquality$Wind)

  1. Create an \(index\) vector with the length equal to the length of target data. The content of the index will be a repeated vector with 153 original values, each being repeated by \(batchSize\) times.

The syntax is index <- rep(1:leng, each = batchSize)

– For example:

if \(batchSize = 5\), we will get:

[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6

[760] 152 153 153 153 153 153

Since the index is used for indexing the data points in our target dataset, there is no need to list the index vector all the way to the end. We only need a portion of the index which is equal to \(leng\) (the length of the target dataset). We can use [1: leng] to select the values we want.

The updated syntax is:

index <- rep(1:leng, each = batchSize)[1: leng]
  1. Create a list vector called \(medResults\) and set the number of lists contained to \(batchSize\). This list set is used to store the calculation results performed in the second layer of the \(function\).

The syntax is:

medResults <- vector("list", length = batchSize)
  1. We now need to perform the first layer by running a tapply function. The tapply function computes a measure (mean, median, min, max, etc..) or a function for each factor variable in a vector. In this case, we will calculate the median of the \(airWind\) in a way as specified by the \(index\). The syntax is: tapply(airWind, INDEX = index, FUN = "median").

– For example:

If \(batchSize\) = 5, then the \(index\) looks like this:

[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 … 153 153 153 153 153

The tapply function will first find values in \(airWind\) with orders equal to 1 (the first five values). Then the function will calculate the median of these five values. Then, the function moves to the second five and so on.

Each time we perform the tapply, we need to store the result to the preset list \(medResults\). We use double brackets to index the list set.

– For example:

To store the first tapply result to the first list in \(medResults\), we can use: medResults[[1]] <- tapply(airWind, INDEX = index, FUN = "median").

  1. As we calculate the medians, we want to shift the slices to the right to include a new data point and dump one each time so that we can plot the trend of medians as the data spread. To save the efforts of calculating medians repeatedly, we use a for loop to automate the process. Inside the loop, we can add an NA value at the beginning to the \(index\) each time to simulate the shift. (tapply does not process NA values in INDEX.)

Like this: ([n] represents repeat times, \(batchSize\) = 5)

[1] 1 1 1 1 1 2 2 2 2 2 …

[2] NA 1 1 1 1 1 2 2 2 2 2 …

[3] NA NA 1 1 1 1 1 2 2 2 2 2 …

[4] NA NA NA 1 1 1 1 1 2 2 2 2 2 …

[5] NA NA NA NA 1 1 1 1 1 2 2 2 2 2 …

The tapply kicks in after NA values. Thus, at the 5th loop, the tapply will calculate the median of [5:9] instead of [1:5].

As NA values being added at the beginning, the length of \(index\) will increase each time by 1. So we can add [1:leng] at the end to control the length.

The for loop syntax is:

for (i in 1:batchSize) {
    medResults[[i]] <- tapply(airWind, INDEX = index, FUN = "median")
    index <- c(NA, index)[1:leng]
}

Inside the loop, we create a local variable called \(i\). \(i\) is used for counting loops. The syntax (i in 1: batchSize) means we want to perform the loop with \(batchSize\) times.

\(i\) is also used for indexing the list layers in the \(medResults\). Each time, the loop will run the tapply function and generate a vector of median results. Then, the loop assigns the result to the \(i\)th list in \(medResults\). In this way, \(medResults\) will contain all results from the loop, layer by layer, in different indexed lists.

  1. As we fill up the \(medResults\), we now have \(batchSize\) lists of the median results. The length of each list is equal to \(\frac{leng}{batchSize}\). Since \(medResults\) is a list set, we need to export the values inside to a vector in order to plot a median line. Therefore, we need to create another for loop for iteration.

First, create a matrix called \(smoothMatrix\) that is filled up with NA values. To match the dimension of \(medResults\), we set the number of row to be equal to \(\frac{leng}{batchSize}\). We use ceiling function to round the value to the next upper integer. The number of columns should be equal to \(batchSize\). Then, we use a for loop to export each list inside the \(medResults\) to the \(smoothMatrix\).

The syntax is:

smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
for (i in 1:batchSize) {
    vals <- medResults[[i]]
    smoothMatrix[1:length(vals), i] <- vals
}
# The local variable, vals, inside the loop is just a bridge and has no
# global meanings.
  1. Now we have a matrix, \(smoothMatrix\), that contains median results by column.
smoothMatrix
##       [,1] [,2] [,3] [,4]  [,5] [,6]  [,7] [,8]  [,9] [,10] [,11]
##  [1,] 11.5 11.5 11.5 10.9 10.90 10.9 10.90 11.5 11.50  10.9  10.9
##  [2,] 11.5 11.5 11.5 12.0 12.00 12.0 12.00 12.0 12.00  12.0  12.0
##  [3,]  9.7 12.0  9.7  9.2  9.20  9.7  9.20  9.2  9.70   9.7   9.7
##  [4,]  9.7  9.7 10.9 11.5 11.50 11.5 11.50 11.5 10.90  10.3  10.3
##  [5,] 10.3  9.2  8.0  8.0  8.00  8.0  8.00  8.0  8.00   8.0   9.2
##  [6,]  9.2  9.2  9.2  9.2  8.00  7.4  7.40  8.6  8.60   8.6   8.6
##  [7,]  8.6  8.6  8.6  8.6 10.30 10.3 10.30 10.3  9.70   8.6   8.6
##  [8,]  8.6  8.6  8.6  8.6  8.60  8.6  8.60  8.0  7.40   7.4   7.4
##  [9,]  7.4  7.4  7.4  7.4  7.40  8.0  8.00  8.6  9.70  10.3  10.3
## [10,] 10.3 10.3 10.3 10.9 10.90 10.9 10.30 10.3 10.30   9.7   9.7
## [11,]  9.7  9.7  8.0  6.9  6.30  6.3  5.70  6.3  6.30   6.3   6.3
## [12,]  6.9  7.4  9.7 10.3 10.30 10.9 10.90 10.9 10.90  10.9  10.9
## [13,] 10.3 10.9 10.3 10.3 10.30 10.3 10.30 10.3 10.30  10.3  10.3
## [14,] 10.9 10.3 10.9 11.5 12.35 11.5 12.35 11.5  9.75  11.5    NA

We want to transfer the matrix to a vector with result values in correct order. To realize this, we use t function to transpose the matrix and use as.vector to switch the data type to vector. For convenience, we assign a name \(smoothAirWind\) to it.

The syntax is:

smoothAirWind <- as.vector(t(smoothMatrix))
  1. Finally, we can plot a median line with the vector.

The syntax is:

plot (airWind,       pch = 16, col = 8)
# pch - point character style

# col - color

# lwd - line width
lines(smoothAirWind, lwd = 3,  col = "blue")

  1. As all scripts above are functional, we can pack them into a function: myRunMeds.

It has two variables:

\(x\) - the target data (as a vector)

\(batchSize\) - the slice size, set dafault to 5

The syntax is:

myRunMeds <- function(x, batchSize = 5) {
    leng <- length(x)
    index <- rep(1:leng, each = batchSize)[1:leng]
    medResults <- vector("list", length = batchSize)
    
    for (i in 1:batchSize) {
        medResults[[i]] <- tapply(x, INDEX = index, FUN = "median")
        index <- c(NA, index)[1:leng]
    }
    
    smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
    
    for (i in 1:batchSize) {
        vals <- medResults[[i]]
        smoothMatrix[1:length(vals), i] <- vals
    }
    
    return(as.vector(t(smoothMatrix)))
}
  1. We can now test the function.
plot(airquality$Wind, pch = 16, col = 8, ann = FALSE)
title(main = "Wind Distribution of NYC Air Quality in 1973 with Median Lines", 
    xlab = "Wind Speed", cex = 0.8)
# pch - point character

# col - color

# ann - default annotation (title, x-label, y-label), omit if set to FALSE

# xlab - x-label

# cex - character expansion ratio

lines(myRunMeds(airquality$Wind, batchSize = 3), col = "red", lwd = 2)
lines(myRunMeds(airquality$Wind, batchSize = 9), col = "green")
lines(myRunMeds(airquality$Wind, batchSize = 15), col = "blue")
lines(myRunMeds(airquality$Wind, batchSize = 25), col = "black", lwd = 2)

As \(batchSize\) increases, the median line will be smoother. This is because more values are involved in the calculation of median, causing much less fluctuation.

Updates

Date: 10/08/2019

The following code is commented by professor on week 6 class lecture.

myRunMeds <- function(x, batchSize = 5) {
    # myRunMeds: take medians of subsets of y of size 'batchSize' (sequentially)
    # and return.  vector of those medians.  x: series of points.  batchSize:
    # window or subset size.
    
    leng <- length(x)
    # need length of x to ensure we don't reference outside the length of the
    # original input.
    
    index <- rep(1:leng, each = batchSize)[1:leng]
    # set up batch lables for `tapply`.
    
    medResults <- vector("list", length = batchSize)
    # allow for different number of medians in each iteration.
    
    for (i in 1:batchSize) {
        # each iteration builds 1/batchSize of the output
        
        medResults[[i]] <- tapply(x, INDEX = index, FUN = "median")
        # generate a set of medians based on batch names given by index
        
        index <- c(NA, index)[1:leng]
        # shift bach names given by index one to the right
    }
    
    # fill columns of new matrix with bathes of medians so that we can assign
    # each batch to the position of the first element of the batch.  PUT MEDIANS
    # IN PROPER PLACE OUTPUT
    smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
    
    for (i in 1:batchSize) {
        vals <- medResults[[i]]
        smoothMatrix[1:length(vals), i] <- vals
    }
    
    return(as.vector(t(smoothMatrix))[1:leng])
    # need to transpose to weave together the proper medians in the proper
    # position when `as.vector` unwinds the matrix or TRANSPOSE TO GET POSITION
    # RIGHT.
}

Updates

Date: 10/12/2019

myRunMeds <- function(x, batchSize = 5) {
    # myRunMeds: take medians of subsets of y of size 'batchSize' (sequentially)
    # and return.  vector of those medians.  x: series of points.  batchSize:
    # window or subset size.
    
    leng <- length(x)
    # need length of x to ensure we don't reference outside the length of the
    # original input.
    
    index <- rep(1:leng, each = batchSize)[1:leng]
    # set up batch lables for `tapply`.
    
    medResults <- vector("list", length = batchSize)
    # allow for different number of medians in each iteration.
    
    for (i in 1:batchSize) {
        # each iteration builds 1/batchSize of the output
        
        medResults[[i]] <- tapply(x, INDEX = index, FUN = "median")
        # generate a set of medians based on batch names given by index
        
        index <- c(NA, index)[1:leng]
        # shift bach names given by index one to the right
    }
    
    # fill columns of new matrix with bathes of medians so that we can assign
    # each batch to the position of the first element of the batch.  PUT MEDIANS
    # IN PROPER PLACE OUTPUT
    smoothMatrix <- matrix(NA, nrow = ceiling(leng/batchSize), ncol = batchSize)
    
    for (i in 1:batchSize) {
        vals <- medResults[[i]]
        smoothMatrix[1:length(vals), i] <- vals
    }
    
    return(as.vector(t(smoothMatrix))[1:leng])
    # need to transpose to weave together the proper medians in the proper
    # position when `as.vector` unwinds the matrix or TRANSPOSE TO GET POSITION
    # RIGHT.
}

Below is a more advanced code, in the sense that it uses lots of complex manipulations of strings, and unusual ways to assign to an array. It’s not particularly elegant, but you could learn a lot from trying to “parse” it. For those with more knowledge of R’s vector/list manipulating features as well as those who want a “puzzle”:

# names(medResults) <- 1: batchSize vals <- unlist(medResults[[i]])
# smoothMatrix <- matrix(NA, ncol = ceiling(leng/batchSize), nrow =
# batchSize) assignLocation <-
# matrix(as.numeric(unlist(strsplit(names(vals), '\\.'))), ncol = 2, byrow
# = T) smoothMatrix[assignLocation] <- vals as.numeric(smoothMatrix)[1: n]

There is also a way to avoid the matrix altogether, since you have the proper row/col to assign to right there in the assignLocation line, and that can be converted to a location in the proper permutation of 1:153

You could also assign a matrix for the results in the FIRST LOOP and avoid the list structure.

Here is a way to do it all in one loop, using the idea that you can construct the proper location to assign the result if you think about how the terms in each batch are created and where they “belong”:

# leng <- length(y) index <- rep(1: leng, each = batchSize)[1: leng]
# medResults <- rep(NA, leng) #where you store the answer for (i in
# 1:batchSize) { runMeds <- tapply(y, index, FUN = 'median') batchLength <-
# length(runMeds) # to handle end condition properLocations <- i + batchSize
# * (0: (batchLength- 1)) medResults[properLocations] <- runMeds index <-
# c(NA, index)[1: leng] }

Updates

Date: 10/22/2019

One of my classmates provided an updated version of the code. It improves performance by avoiding transposing the matrix. Here is the code:

Credit to Jingya Wang

smoothMatrix <- matrix(NA, nrow = batchSize, ncol = ceiling(leng/batchSize))
# switching contents in rows and columns simplifies the code

for (i in 1:batchSize) {
    vals <- medResults[[i]]
    smoothMatrix[i, 1:length(vals)] <- vals
    # switch contents
}