Review of basic syntax in R
library(tidyverse)
theme_set(theme_classic() +
theme(panel.grid.major.y = element_line(color = "grey92")))
# Create an object named `x`, using the assignment operator `<-`
x <- c(1, 2)
# Generally, nothing returns after an assignment. Type `x` to print the object
x
#> [1] 1 2
# Check the type
typeof(x)
#> [1] "double"
# Check the length
length(x)
#> [1] 2
# A quick look of the structure of an object
str(x)
#> num [1:2] 1 2
Another example
y <- as.character(c(seq(1, to = 10, by = 0.5), 99, 99))
typeof(y)
#> [1] "character"
str(y)
#> chr [1:21] "1" "1.5" "2" "2.5" "3" "3.5" "4" "4.5" "5" "5.5" "6" ...
To find out what the seq()
function does, use
?seq
A pipe operator is an alternative way to do multiple operations on an
object. For example, the following code (a) transforms y
to
numbers, (b) recodes the value 99 to missing, and (c) obtains the mean.
The traditional way to do it in R is through nested parentheses:
# 3. Get the mean
mean(
# 2. Recode 99 to missing
na_if(
# 1. transform to numbers
as.numeric(y), 99
),
na.rm = TRUE
)
#> [1] 5.5
What’s inconvenient is that the last operation needs to be put first. Also, it’s hard to keep track of the parentheses. Some users, including myself, prefer the alternative way of doing the same as the above code:
y %>%
# 1. transform to numbers
as.numeric() %>%
# 2. Recode 99 to missing
na_if(99) %>%
# 3. Get the mean
mean(na.rm = TRUE)
#> [1] 5.5
The pipe operator works by taking the object before
%>%
as the first argument for the function after
%>%
. So y %>% as.numeric()
is the same
as as.numeric(y)
, and y2 %>% na_if(99)
is
the same as na_if(y2, 99)
.
# Vector
(vector_a <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
typeof(vector_a)
#> [1] "integer"
is.vector(vector_a) # check whether it is a vector
#> [1] TRUE
vector_a[2:3] # select the second and the third elements
#> [1] 2 3
(vector_b <- c("classical", "frequentisti", "Bayesian"))
#> [1] "classical" "frequentisti" "Bayesian"
typeof(vector_b) # character vector
#> [1] "character"
logical_b <- vector_b == "Bayesian"
typeof(logical_b) # logical vector
#> [1] "logical"
# Matrix
matrix_c <- matrix(1:10, nrow = 5, ncol = 2)
rownames(matrix_c) <- paste0("row", 1:5)
colnames(matrix_c) <- c("var1", "var2")
matrix_c
#> var1 var2
#> row1 1 6
#> row2 2 7
#> row3 3 8
#> row4 4 9
#> row5 5 10
typeof(matrix_c)
#> [1] "integer"
class(matrix_c)
#> [1] "matrix" "array"
str(matrix_c)
#> int [1:5, 1:2] 1 2 3 4 5 6 7 8 9 10
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:5] "row1" "row2" "row3" "row4" ...
#> ..$ : chr [1:2] "var1" "var2"
is.vector(matrix_c) # check whether it is a vector
#> [1] FALSE
is.matrix(matrix_c) # check whether it is a matrix
#> [1] TRUE
matrix_c[c("row1", "row5"), ] # select rows 1 and 5
#> var1 var2
#> row1 1 6
#> row5 5 10
matrix_c[, 1] # select first column (converted to a vector)
#> row1 row2 row3 row4 row5
#> 1 2 3 4 5
set.seed(123) # for reproducibility
# Array: An array can have one, two, or more dimensions.
# A matrix is also an array.
array_d <- array(rnorm(8), dim = c(2, 2, 2))
array_d
#> , , 1
#>
#> [,1] [,2]
#> [1,] -0.5604756 1.55870831
#> [2,] -0.2301775 0.07050839
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 0.1292877 0.4609162
#> [2,] 1.7150650 -1.2650612
str(array_d)
#> num [1:2, 1:2, 1:2] -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
typeof(array_d) # numeric type
#> [1] "double"
class(array_d)
#> [1] "array"
is.matrix(array_d)
#> [1] FALSE
is.array(array_d)
#> [1] TRUE
array_d[1, 1,] # use 3 indices for a 3-D array
#> [1] -0.5604756 0.1292877
# List
# An array is a vector where each component can have a different type
list_e <- list("abc", 5, 2.57, TRUE)
typeof(list_e)
#> [1] "list"
class(list_e) # `list` for both type of class
#> [1] "list"
str(list_e)
#> List of 4
#> $ : chr "abc"
#> $ : num 5
#> $ : num 2.57
#> $ : logi TRUE
list_e[1] # extract first element, and put it in a list
#> [[1]]
#> [1] "abc"
list_e[[1]] # extract first element
#> [1] "abc"
list_f <- list(name = "abc", age = 5) # named list
# Can also extract element by name (as opposed to by position)
list_f$name
#> [1] "abc"
is.vector(list_f) # a one-dimension list is a special type of vector
#> [1] TRUE
# Data Frame
# A data frame is a special type of list with two dimensions.
# It is a list of multiple column vectors
(dataframe_c <- as.data.frame(matrix_c))
#> var1 var2
#> row1 1 6
#> row2 2 7
#> row3 3 8
#> row4 4 9
#> row5 5 10
typeof(dataframe_c) # type is list
#> [1] "list"
class(dataframe_c) # data.frame
#> [1] "data.frame"
dataframe_c[c(1, 2),] # subset using matrix method
#> var1 var2
#> row1 1 6
#> row2 2 7
dataframe_c$var1 # extract columns using list method
#> [1] 1 2 3 4 5
dataframe_c[[1]] # same as above
#> [1] 1 2 3 4 5
# select first column, and put it as a data.frame(list)
dataframe_c[1]
#> var1
#> row1 1
#> row2 2
#> row3 3
#> row4 4
#> row5 5
The popular ggplot2
package uses the “grammar of
graphics” approach to plotting. It is an elegant and comprehensive
system for graphics but requires mastering some vocabulary. As a quick
start, each plot requires specifying
For example, consider the airquality
data set:
head(airquality)
#> Ozone Solar.R Wind Temp Month Day
#> 1 41 190 7.4 67 5 1
#> 2 36 118 8.0 72 5 2
#> 3 12 149 12.6 74 5 3
#> 4 18 313 11.5 62 5 4
#> 5 NA NA 14.3 56 5 5
#> 6 28 NA 14.9 66 5 6
One can show the distribution of Ozone
. On a 2-D
Cartesian coordinate system, if we want to show the data in points, each
point needs the x- and the y-coordinates. Therefore, the following gives
an error as it only maps Ozone
to the x-axis:
ggplot(data = airquality) +
# Add a layer of points
geom_point(aes(x = Ozone))
#> Error in `check_required_aesthetics()`:
#> ! geom_point requires the following missing aesthetics: y
While not very interesting, we can instead specify every point to have a y-coordinate of 0:
ggplot(data = airquality) +
# Add a layer of points
geom_point(aes(x = Ozone, y = 0))
Some geometric objects (geom
) assign the y-coordinate
automatically. For example, geom_histogram
uses bars, with
the x-coordinate based on the variable and the y-coordinate based on the
counts of each bin.
ggplot(data = airquality) +
# Add a layer of a histogram (a set of bars)
geom_histogram(aes(x = Ozone))
A scatter plot maps the x- and the y-axes to two variables
ggplot(data = airquality) +
# Add a layer of a points
geom_point(aes(x = Wind, y = Ozone),
# alpha = 0.3 makes points transparent
alpha = 0.5)
The x- and y-coordinates are not the only attributes for the
points
geometric object. For example, we can map the sizes
of the points to another variable:
ggplot(data = airquality) +
# Add a layer of a points
geom_point(aes(x = Wind, y = Ozone, size = Temp),
alpha = 0.3)
To change the labels of axis and legend, use labs
:
ggplot(data = airquality) +
# Add a layer of a points
geom_point(aes(x = Wind, y = Ozone, size = Temp),
alpha = 0.3) +
labs(x = "Wind (mph)", y = "Ozone (ppb)", size = "Temperature (degree F)")
geom
We can use other geometric objects. For example,
geom_smooth
adds a smoothing trend
# By putting `aes()` in the first line, all subsequent layers use the same
# mapping
ggplot(aes(x = Wind, y = Ozone), data = airquality) +
# Add a layer of points
geom_point() +
# Add a layer of lines
geom_smooth()
In many situations, we want to split the data into multiple plots
through facet
.
ggplot(aes(x = Wind, y = Ozone), data = airquality) +
# Add a layer of a points
geom_point(alpha = 0.3) +
# Split by `Month`
facet_wrap(~ Month)
for
LoopUsing a loop allows one to perform some actions multiple times. See https://r4ds.had.co.nz/iteration.html#introduction-14
The following example shows an example of drawing a random variable from a normal distribution with SD = 1, mean = the previously generated number.
set.seed(2208) # for reproducibility
# Simulate data
random_numbers <- rep(NA, 100) # initialize an empty vector
random_numbers[1] <- 10 # set the first value to 10
# i = 2 in the first iteration, then 3, then 4, until i = 100
for (i in 2:100) {
# `rnorm(1)` simulate one value from a normal distribution;
# `mean = random_numbers[i - 1]` says the mean of the distribution is
# from the previously simulated value.
random_numbers[i] <- rnorm(1, mean = random_numbers[i - 1])
}
# Show the simulated values using ggplot
ggplot(
tibble(iter = 1:100, random_numbers),
aes(x = iter, y = random_numbers)
) +
geom_line() +
geom_point()
A function in R is a named object that takes in some inputs, performs some operations, and gives some outputs.
Here is an example of how you can define your own function, from p. 64 of the text
asqplusb <- function(a, b = 1) {
c <- a ^ 2 + b # c = a^2 + b
c # the last line is usually the output
}
The above function asqplusb()
takes two inputs:
a
and b
. The names of these two are called
arguments. So we say a
and b
are the
arguments of asqplusb()
. We can invoke the function
like
asqplusb(2, 3) # a = 2, b = 3, 2^2 + 3 = 7
#> [1] 7
# This is equivalent to
asqplusb(a = 2, b = 3)
#> [1] 7
# But not the same as
asqplusb(b = 2, a = 3)
#> [1] 11
Also, when we define the function, we define the first argument as
a
, and the second argument as b = 1
. The
latter means the default value of b
is 1. So when one calls
the function without specifying b
, like
asqplusb(4) # a = 4, b = 1, 4^2 + 1 = 17
#> [1] 17
It is equivalent to
asqplusb(4, b = 1) # a = 4, b = 1, 4^2 + 1 = 17
#> [1] 17
Writing one’s own function is very common in R programming, especially with Bayesian analysis, because often one needs to perform some procedures more than once, and it is tedious to write similar code every time. A rule of thumb is that if you have a few lines of code that you find yourself typing more than two times, you should write a function for it. As an example, we can wrap the simulation code above into a function that you can try out with different numbers of iterations:
my_simulate <- function(nsim = 100) {
# Simulate data
random_numbers <- rep(NA, nsim) # initialize an empty vector
random_numbers[1] <- 10 # set the first value to 10
# i = 2 in the first iteration, then 3, then 4, until i = 100
for (i in 2:nsim) {
# `rnorm(1)` simulate one value from a normal distribution;
# `mean = random_numbers[i - 1]` says the mean of the distribution is
# from the previously simulated value.
random_numbers[i] <- rnorm(1, mean = random_numbers[i - 1])
}
# Show the simulated values using ggplot
ggplot(
tibble(iter = 1:nsim, random_numbers),
aes(x = iter, y = random_numbers)
) +
geom_line() +
geom_point()
}
Try to call the above function.
#> [1] "January 13, 2022"
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://github.com/marklhc/20221-psyc573-usc, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".