Statistics 385- Statistical Programming Methods

STAT 385: One Course to help Synthesize my Expeiences with R

The Course

University Course Catalog Course Description:

Statisticians must be savvy in programming methods useful to the wide variety of analysis that they will be expected to perform. This course provides the foundation for writing and packaging statistical algorithms through the creation of functions and object oriented programming. Fundamental programming techniques and considerations will be emphasized. Students will also create dynamic reports that encapsulate their implemented algorithms.

This course is instructed by the fantastic Ha Khanh Nguyen, visiting instructor and a data science support specialist for the Department of Statistics.

My History with R

I decided to take STAT385 in my final semester at Illinois to help solidify my knowledge and abilities with R. My experience with R started in PS 230 with Professor Jake Bowers. We were introduced to a lot of things in R, and took a deep dive very quickly into data wrangling, manipulation, and regression. Frankly, I didn’t learn a ton. I was confused most of the semester, but I loved what I was experiencing. Growing up with a software architect for a father I sort of “rebelled” by deciding to pursue a liberal arts degree, but the second I was introduced to coding I was hooked.

That summer, I ended up working for Professor Bowers and Professor Cara Wong assisting with data wrangling and modeling. Fortunately they were very patient with me, because I really was in over my head. I Googled a lot of code and did the best I could. I was soon hired by the Cline Center for Advanced Social Reserach and have been working there ever on quantitative social science ver since (read more about my experiences at the Cline Center here. After googling enough code to write a few papers, I decided I needed to more firmly ground myself in the methodology. After taking a few calculus and statistics courses, I felt ready to ground myself in R, which lead me to STAT 385. In the Fall of 2018 I took IS 457, which blew past all of the general information and went quickly to modeling, visualization, and some more advanced feature of R. This experience showed me that I needed a bit more time with the basics.

What I’ve Learned

This course has been great for showing me the proper way to do a lot of basic things in R, and reinforcing the concepts so I don’t feel like I have to google simple things like subsetting and data manipulation!

Important note: this blog post is NOT meant to be a comprehensive introduction to R. This is simply a brief synopsis of what I’ve learned during the first half of the semester in STAT 385. There are a bunch of great resources for learning R, such as this book by RStudio educator Garrett Grolemund. For a comprehensive collection of my lecture notes for this course, please visit my STAT 385 Github Repository.

Data Types in R

I am going to gloss over information regarding the R user interface and basic interaction. This was information I was already familiar with, and plenty of tutorials can be found online. After that introduction, one of the first things we covered was data types that R works with. A lot of this information I already knew, but it was a good refreshed having it all laid out again. We covered:

Vectors

Atomic vectors only store one type of data, and the function c() concatenates a vector.

# Double Vectors
die <- c(1, 2, 3, 4, 5, 6)
die
## [1] 1 2 3 4 5 6

typeof(die)
## [1] "double"

## Any number is stored as a double vector of length 1
typeof(4.5)
## [1] "double"

# Integer vectors denoted by 'L'
my_integer <- c(-1L, 2L, 4L)
my_integer
## [1] -1  2  4

typeof(my_integer)
## [1] "integer"

# Character Vectors
text <- c("Hello", 'World')
text
## [1] "Hello" "World"

typeof(text)
## [1] "character"

# Logical Vectors
my_logic <- c(TRUE, TRUE, FALSE)
my_logic
## [1]  TRUE  TRUE FALSE

typeof(my_logic)
## [1] "logical"

# Complex Vectors
my_complex <- complex(real = 1, imaginary = 2)
my_complex
## [1] 1+2i

typeof(my_complex)
## [1] "complex"

# Raw Vectors (bytes of data)
my_raw <- raw(length = 3)
my_raw
## [1] 00 00 00

typeof(my_raw)
## [1] "raw"

We also covered indexing items from a vector using [ ]. The iteration starts at 1.

die <- c(1, 2, 3, 4, 5, 6)

die[1]
## [1] 1
die[4]
## [1] 4

We also talked about lists

Lists can group R objects such as vectors and other lists in a one dimensional array! We talked about creating lists using list().

my_list <- list(1, "one", TRUE)
my_list
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "one"
## 
## [[3]]
## [1] TRUE

big_list <- list(1:12, "2020", list("january", 2:8))
big_list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## 
## [[2]]
## [1] "2020"
## 
## [[3]]
## [[3]][[1]]
## [1] "january"
## 
## [[3]][[2]]
## [1] 2 3 4 5 6 7 8

We also talked about how to index lists, using [ ] which returns lists and [[ ]] which can be used to return vectors, and how these items in combination can be used to access any item in a list.

big_list[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

big_list[1]
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

big_list[[3]][[1]]
## [1] "january"

We talked briefly about matrices and arrays

Matrices store values of the same data type in a two-dimensional array, and arrays are n-dimensional. These are created with the functions matrix() and array() Arguments can be specified using nrow =, ncol =, and byrow = TRUE. We also talked about indexing matrices and arrays using [ , ] for the number of dimensions of the array. For a matrix, the first number is row and the second number is column. Leave one argument blank if you want every item in that row or column.

m <- matrix(data = die, nrow = 2, byrow = TRUE)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

m[1, 2]
## [1] 2

m[2, ]
## [1] 4 5 6

my_array <- array(data = 1:12, dim = c(2, 2, 3))
my_array
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12

my_array[ , 1, 1]
## [1] 1 2

Data Frames

Data frames are a list of vectors of the same length. We discussed how to make data frames using data.frame(), and how to get information about the dataframe using str(). We also discussed reading data using read.csv() to make dataframes from .csv files either on our machines or that are online.

cards <- data.frame(face = c("ace", "two", "six"),
                    suit = c("clubs", "clubs", "clubs"),
                    value = c(1, 2, 3))
cards
##   face  suit value
## 1  ace clubs     1
## 2  two clubs     2
## 3  six clubs     3

str(cards)
## 'data.frame':    3 obs. of  3 variables:
##  $ face : Factor w/ 3 levels "ace","six","two": 1 3 2
##  $ suit : Factor w/ 1 level "clubs": 1 1 1
##  $ value: num  1 2 3
gpa <- read.csv("https://raw.githubusercontent.com/wadefagen/datasets/master/gpa/uiuc-gpa-dataset.csv")
head(gpa)
##   Year   Term YearTerm Subject Number                 Course.Title A.  A A..1
## 1 2019 Spring  2019-sp     AAS    100 Intro Asian American Studies  2 13    4
## 2 2019 Spring  2019-sp     AAS    100 Intro Asian American Studies  1  5    9
## 3 2019 Spring  2019-sp     AAS    100 Intro Asian American Studies  0  7    9
## 4 2019 Spring  2019-sp     AAS    100 Intro Asian American Studies  0  8    7
## 5 2019 Spring  2019-sp     AAS    100 Intro Asian American Studies  0  7    4
## 6 2019 Spring  2019-sp     AAS    100 Intro Asian American Studies  1  7    3
##   B. B B..1 C. C C..1 D. D D..1 F W  Primary.Instructor
## 1  3 3    2  0 3    1  0 0    0 2 0 Espiritu, Augusto F
## 2  1 3    1  0 0    1  0 0    0 1 0     Thomas, Merin A
## 3  3 2    3  0 0    0  0 0    0 2 0     Thomas, Merin A
## 4  1 4    5  1 0    1  1 0    0 0 1         Lee, Sang S
## 5  2 5    5  0 2    0  1 0    0 2 0         Lee, Sang S
## 6  1 3    2  3 3    0  0 0    1 0 0      Kang, Yoonjung

We also talked about indexing dataframes using [ , ] and either numbers or logical values. For the sake of brevity in this blog posting, I am not going to provide any examples. I have a LOT of course material to cover!

Functions, if else, and loops

Functions

Pretty early on, we talked about writing our own functions using this notation: my_function <- function(arguments) {code}. A function includes a name, argument, and code to act on the arguments. These can get quite complex, so here’s an example to sum the value of a roll of two dice:

roll_die <- function() {
  die <- 1:6
  dice <- sample(x = die, size = 2, replace = TRUE)
  sum(dice)
}

roll_die()
## [1] 10

You can include arguments to do things like specify the number of dice. To set a default value, make the argument equal to a value in function(). The first function call will sum the roll of a pair of dice as we did not specify the value, but the second function call will sum the rolls of 10 die.

roll_dice <- function(num_die = 2) {
  die <- 1:6
  dice <- sample(x = die, size = num_die, replace = TRUE)
  sum(dice)
}

roll_dice()
## [1] 7

roll_dice(num_die = 10)
## [1] 31

if and else statements

An if statement is used to tell R when to do something. If a condition is TRUE, then R will enact the body of code.

num <- -2

if (num < 0) {
  num <- num * -1
}

num
## [1] 2

If the condition is not satisfied, nothing happens.

num <- 4

if (num < 0) {
  num <- num * -1
}

num
## [1] 4

else statements are used for when a condition is FALSE.

a <- 3.15

decimal <- a - trunc(a)
decimal
## [1] 0.15

if (decimal >= 0.5) {
  a <- trunc(a) + 1
} else {
  a <- trunc(a)
}

a
## [1] 3

Loops

There are a few different types of loops in R. for loops repeats a chunk of code, once for each element in the input.

for (value in c("My", "second", "for", "loop")) {
  print(value)
}
## [1] "My"
## [1] "second"
## [1] "for"
## [1] "loop"

A while loop executes code while the condition remains true. For example:

cash <- 20
n <- 0

while (cash > 0) {
  cash <- cash - 1
  n <- n + 1
}

n
## [1] 20

A repeat loop executes until it hits a break statement or you tell it stop by hitting esc. Again, for brevity, no example is included. However, examples can be found in my STAT385 lecture notes on my Github..

Vectorized Code

One of the benefits of R is using logical tests, subsetting, and element-wise execution to speed up code. We can vectorize loops to make them run faster. Here’s an example:

abs_loop <- function(vec){
  for (i in 1:length(vec)) {
    if (vec[i] < 0) {
      vec[i] <- -vec[i]
    }
  }
  vec
}

abs_vectorized <- function(vec){
  negs <- vec < 0
  vec[negs] <- vec[negs] * -1
  vec
}

long <- rep(c(-1, 1), 5000000)

system.time(abs_loop(long))
##    user  system elapsed 
##   0.620   0.025   0.654

system.time(abs_vectorized(long))
##    user  system elapsed 
##   0.251   0.047   0.311

When we time the execution, the vectorized function is much faster. These results get amplified at scale!

Version Control

After learning about working with base R, we went into a few week long exploration into version control using Git, Github, and RMarkdown to present results. I do not feel qualified to try and explain Git and Github: although I am comfortable with them, I find it difficult to explain, especially through written text. The course resource we is linked here, and I believe presents one of the best overviews of integrating Git and Github.

Note: I was experienced with Git and Github before coming into this class and have a lot of material, both in R and Python, on my Github Page.

R Markdown and blogdown

Next, we covered creating useful reports with integrated code using R Markdown, which allows us to create formatted reports with integrated code that are easy to present and share with colleagues. This document that you are currently reading was completely written and formatted in R Markdown! For a great overview and tutorial, please see this tutorial by Professor David Dalpiaz. It is a great introduction to using R Markdown to integrate code and format reports!

This entir website is created using an R package called blogdown. It allows for the creation of formatted websites using Hugo packages and modifications. It is incredibly powerful, but slightly complicated. Once you are familiar with the system, however, you are able to build fantastic personal and project websites. For the best introduciton and explanation, please visit this manual by the creator of blogdown!

Jonathan Bonaguro
Undergraduate

I am an undergraduate student interested in the intersections of social science and data science currently seeking full time employment.