/
Introduction to R
Introduction to R
This is a 4 x 2hr training workshop.
PowerPoint presentation
The first hour is a PowerPoint presentation, describing R and RStudio:
Main R script
The remaining sessions are based on an R script that the students write and annotate themselves. Here is the R script I work on to teach that:
#### INTRODUCTION. IMPORTANT. PLEASE READ. #####
# This is an R script file for the QUT eResearch Introduction to R workshop.
# You can simply open it in a text reader (e.g. Notepad) and read through the text and R practical exercises.
# First: do you have R and R Studio installed?
# If not, go to https://cran.r-project.org/ and download and install the version of R for your operating system - Mac OS, Windows or Linux.
# Then download and install R Studio (https://rstudio.com/products/rstudio/download/)
# Note that the above comments all begin with a hash (#). In R this denotes a comment. R doesn't read any text after a #.
# I strongly recommend you add your own comments throughout this script, explaining to yourself what you did.
# Good commenting or annotation is extremely important when writing an R script, so you can look back later and remember what you did, as well as explaining it to anyone else who may read or want to follow your script.
# this is a comment
##### STARTING TO USE R ######
# Open R Studio
# Do you remember the R Studio environment I described in the Lecture? If not, you might want to pause this video and have a look over that section of the lecture.
# Check the version of R you have installed
# Type:
version
# In the console
# Note that you can click on the line above (where it says 'version') and click the 'run' button in the source window. This will run the line of code your cursor is on, or multiple lines if you have multiple lines selected.
# Try a couple of other commands in the console.
6 + 9
print("Hello world")
######## Operators #######
# R is for analysing data. Can do basic maths, calculations
# Arithmetic Operators
4 + 10
5680 * 8324
65 / 435
3 ^ 6
# Logical Operators. Boolean (TRUE or FALSE)
6 == 6
6 == 5
6 > 5
6 < 5
6 < 6
6 <= 6
6 != 5
6 != 6
# Most important operator. Assignment operator
<-
# Alt and minus = keyboard shortcut
a <- 5 # Will appear in Environment tab
b <- 6
b + a
b + 9
# Can also add characters. Need "".
a1 <- "QUT"
a2 <- "fred"
a3 <- "3"
##### Vectors ######
#R = vectorized (vector = sequence of numbers, characters, etc). I.e. calculations done on entire vector.
# Make a vector
d <- c(12, 4, 8, 2)
d + 3
d * 8
# Try some logical operators on that vector
d == 4
d > 6
d != 8
d[d > 4]
# Remember, R is vectorised.
d1 <- c(2, 3, 8, 7)
d1 * d
d1 == d
d1 > d
# Above are numeric vectors. Can also do character vector
e <- c("Fred", "Helen", "Bob", "Smittywebermanjensen")
# Since calculations are done on vectors, R requires that all ELEMENTS in a vector are the same type
f <- c(7, "Sally", 9, "Reginald")
# Type 'f' into console and you'll see numbers now have ""
# Now try doing a calculation on the 'f' object
f * 5
# As an example of character vs numeric objects:
f1 <- "5"
f1 * 6
f2 <- 5
f2 * 6
tr <- c(5, 7, 3, 8, 23, 9)
tr1 <- c(3, 4)
tr * tr1 #Different length vectors
# Again, R is for analysing data. Can do more complex analysis than basic maths.
# Calculate mean?
# Sum divided by count
# How would you achieve this?
(12 + 4 + 8 + 2)/4
# Yes, but better way: use a function
##### Functions ######
# Built in functions (base R)
# Functions imported from packages
# User created functions
# I mentioned the class function
# Have a look at the vector objects that we created before
d <- c(12, 4, 8, 2)
e <- c("Fred", "Helen", "Bob", "Smittywebermanjensen")
f <- c(7, "Sally", 9, "Reginald")
# Now identify what type of objects these are using the class() function
class(d)
class(e)
class(f)
str(d)
class(cars)
str(cars)
# Help on functions: use '?'
?class
# How would you calculate mean on this set of numbers?
2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23
mean(2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23)
#[1] 2.23 <- WRONG
mean(c(2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23))
#[1] 7.794286 <- RIGHT
# Or to separate:
av <- c(2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23)
mean(av)
##### Base R functions ######
# Create some vectors to work with
d <- c(12, 4, 8, 2)
e <- c("Fred", "Helen", "Bob", "Smittywebermanjensen")
# Some other maths/statistical functions to try out
log(12)
log(d) # Natural log
log2(12)
log10(12)
max(d)
min(d)
sd(d)
var(d)
sum(d)
length(d)
seq(2, 10, 2)
# https://cran.r-project.org/doc/contrib/Short-refcard.pdf
# https://www.povertyactionlab.org/sites/default/files/r-cheat-sheet.pdf
# Lots of other good 'cheat sheets'
# Practical exercise:
# Make a vector, called 'data', using the numbers in 'cars$speed':
data <- cars$speed
# Find the mean
# Find the maximum number
# Find the minimum
# How many elements are in the vector?
# Say you only want only results greater than 12. How many elements are > 12?
# How many elements are not > 12? How many ways can you calculate this with R?
data < 12 # Boolean (TRUE/FALSE). TRUE stored as a 1, FALSE as a 0
sum(data < 12)
length(data) - sum(data > 12) #???
length(data) - sum(data >= 12)
length(data) - sum(data > 12) - sum(data == 12)
# Do the above using vectors
x <- length(data)
y <- sum(data >= 12)
x - y
# Creating a function
myfun <- function(a, b){a ^ b}
myfun(4, 7)
##### R built-in datasets ######
# Vector, dataframe, matrix, list
# Built-in datasets
cars
iris
# What class are these?
# R has lots of built-in datasets to play with
# List the datasets
data()
# Some are pre-loaded (like cars and iris), some need to be loaded:
data(mtcars)
# Good dataset for practicing plotting
##### Plotting ######
# https://bookdown.org/rdpeng/exdata/the-base-plotting-system-1.html
# Scatter plot: plot()
# Bar plot: barplot()
# Box plot: boxplot()
# Histogram: hist()
# Pull out a numeric vector from one of the built-in datasets
mydata <- cars$speed
# Plot a single numerical vector
hist(mydata)
plot(mydata)
# plot() is more useful when comparing two vectors
# We could create a couple of vectors to demonstrate
a <- c(6, 8, 10, 12, 14)
b <- c(5, 6, 7, 8, 9)
# Now plot each vector on x-y axis plot
plot(a, b)
# Better to use built in dataset, such as 'iris'
iris
# Each column can be pulled out separately using the $ symbol
iris$Sepal.Length
# You could do a histogram of the sepal width
hist(iris$Sepal.Width)
# Plot sepal length vs petal length:
plot(iris$Sepal.Length, iris$Petal.Length)
# This is comparing two numeric variable: sepal length vs petal length. But what if we wanted to plot the petal length for each species?
plot(iris$Petal.Length, col = iris$Species)
# Boxplot plots all numeric variables (columns) in a data frame or matrix
boxplot(cars)
boxplot(iris)
# Function arguments
# Modify the plot by adding function arguments
# See ?plot for parameters (inc. defaults)
# First, instead of using the lengthy 'iris$Sepal.Length' and 'iris$Petal.Length', lets make these a named vector
s <- iris$Sepal.Length
p <- iris$Petal.Length
plot(p, s)
# Change the colour of the points (col =)
plot(p, s, col = "royalblue")
# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
# Change the size of the points (cex =)
plot(p, s, col = "royalblue", cex = 2)
# Change the type of point (pch =)
# http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r
plot(p, s, col = "green", cex = 3, pch = 19)
# With a 'filled' point (21:25) you can define the outer (col =) and inner (bg =) colours:
plot(p, s, col = "green", bg = "red", cex = 3, pch = 22)
# Change the x and y axis labels (xlab = and ylab =)
plot(p, s, col = "green", bg = "red", cex = 3, pch = 22, xlab = "Petal length", ylab = "Sepal length")
# Give it a title
plot(p, s, col = "green", bg = "red", cex = 3, pch = 22, xlab = "Petal length", ylab = "Sepal length", main = "Really ugly plot")
# Add a line, define it's width (lwd =)
lines(p, s, col = "blue", lwd = 3)
# Or you could plot it as just a line plot (type = "l"), but make it a dotted line (lty = "dotted")
plot(p, s, col = "green", bg = "red", cex = 3, pch = 22, xlab = "Petal length", ylab = "Sepal length", main = "Really ugly plot", type = "l", lty = "dotted", lwd = 3)
# Notice in iris there is a character variable of species names
iris
iris$Species
# You can colour your points by this categorical data
plot(p, s)
plot(p, col = iris$Species)
plot(s, col = iris$Species)
plot(p, s, col = iris$Species)
# Try it with other plots and datasets
hist(iris$Sepal.Width, col = "firebrick", xlab = "Sepal width", main = "Histogram of Iris Sepal Widths")
# Some practice at home:
# https://rstudio-pubs-static.s3.amazonaws.com/7953_4e3efd5b9415444ca065b1167862c349.html
# Export a plot
#jpeg, png, tiff, pdf
jpeg("rplot.jpg")
plot(p, s)
dev.off()
##### Coordinates and subsetting #####
# Dollar sign
# Square brackets
# subset() function
# R is vectorised
# Each column can be thought of as an individual vector
# I.e. a column can only contain one type of element - must be all numeric, all character, etc and all columns/vectors must be the same length
# Type 'cars'
cars
# What are the two column names in this object?
# You can just type 'cars' and scroll up to see column names
# - or we can use the function colnames()
colnames(cars)
# What are the row names?
row.names(cars)
# Two things to note here - first that the row names are numbers, second, that they are characters, not numeric (i.e. "1", "2", "3", etc)
# Note: colnames() and row.names() can also be used to change column or row names, e.g.:
mycars <- cars
colnames(mycars) <- c("how fast", "how far")
# To look at just the first column, use the $:
cars$speed
# This will just output the 'speed' column to your console
# You can also put this column into a vector. In this way (and other ways - next section) you can pull out a subset of your data and work on that.
speed <- cars$speed
# Try this with the iris dataset
# I.e. Find the column and row names, pull out a single named column, store this in a vector
# Find the structure of cars and iris
str()
## R coordinates and square brackets
# R objects are coordinate based.
dim(cars)
dim(iris)
# Using square brackets you can extract data from defined coordinates
# E.g. In the num1 vector, pull out just the 4th element
speed[4] # (remember, 'speed' is a vector: cars$speed)
# Elements 3-5 (sequential you use ':')
speed[3:5]
# Elements 2, 4, 5 (for separate elements you use ',')
speed[c(2,4,5)]
# Note that things like colnames() and row.names() produce a vector of column names and row names
colnames(iris)
# So you can use the square brackets to just pull out, say, the 2nd and 5th column names:
colnames(iris)[c(2, 5)]
# The 'speed' object that we created above is just a vetor - 1 dimension (dim() won't work, use length() instead to see how many elements)
# Dataframes and matrices have two dimensions: Rows and columns
# If you wanted to pull out the second column you could use $
cars$dist
# Or the same using square brackets:
cars[ ,2]
# Comma separated: row number, column number
# You can pull out data using both row and column
cars[5,2] # Pulls out just the element in row 5, column 2
cars[2,] # Just row 2
cars[1:3,] # just rows 1:3
# And, as with the vector examples, you can pull out multiple rows and columns
cars[4:12,2] # Rows 4-12, column 2
cars[c(1,3,4), c(1,3)]
# you can pull out whatever data you want by using the coordinates
# But the real power is when you combine it with logical operators
# Recall:
# Logical Operators. Boolean (TRUE or FALSE)
6 == 6
6 == 5
6 > 5
6 < 5
6 < 6
6 <= 6
6 != 5
6 != 6
# Try it on vectors first
speed
speed > 15
# Gives a Boolean TRUE/FALSE
# So to pull out all elements in speed that are > 15, you need to combine this with square brackets
speed[speed > 15]
speed[speed < 15]
speed[speed == 15]
speed[speed != 15]
# Try it with a character vector
species <- iris$Species
species[species == "setosa"] # Note the double ==
# Pull out two categories - using 'or' operator (|)
#species[species == "setosa" | species == "versicolor"]
# Note: you can see how many unique elements there are by using unique()
unique(species)
unique(speed)
length(speed)
length(unique(speed))
# Using subset()
# Same as above, pulling out elements in speed that are > 15
subset(speed > 15)
# Subset is better for working on data frames
# Use iris dataset
# Pull out only setosa rows
subset(iris, Species == "setosa")
# Pull out rows where petal length > 5cm
subset(iris, Petal.Length > 5)
# Pull out rows where petal length > 5, but only for versicolor species
subset(iris, Petal.Length > 5 & Species == "versicolor")
# Pull out rows where petal length > 5, but only for virginica species
subset(iris, Petal.Length > 5 & Species == "virginica")
# You can also select certian columns, rather than using the whole dataset
# Pull out rows where petal length > 4.2, but only the Petal.Length and Species columns
subset(iris, Petal.Length > 5, select = c(Petal.Length, Species))
##### Getting data into and out of R #####
# Working directory. R reads in and writes out files to and from your working directory.
getwd()
setwd() # Can add to script, so you're always working from correct directory
setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Teaching/Introduction to R workshops")
# Set the working directory using RStudio menu choices:
# Session -> Set Working Directory -> Choose Directory
# You can see what files are in your working directory in the 'files' tab in the bottom right pane
# Or you can:
dir()
# Note that R is Linux based and the Linux command for checking what is in your directory is ls.
ls()
# In R, ls() shows your current objects. You can see them in RStudio in your environment window (top right).
## Importing your data
# Using base R tools you can import a text file that contains separators (e.g. a comma) to separate data elements
# The standard command to import a text file is read.table()
read.table()
# Create a basic text file called 'Book1.txt' in Excel and then 'save as' -> 'tab delimited text file'. Remember - R looks in your working directory!
a <- read.table("Book1.txt", header = T)
# IMPORTANT!: the default separators (AKA delimiters) used by read.table are 'white space', i.e. spaces, tabs.
# This can produce extra columns or rows if you're not careful
read.table("Book1.txt")
# You could make sure your data has no mixed tabs and spaces, but this is difficult with big datasets
# Better to define the separator in the read.table() arguments
read.table("Book1.txt", sep = "\t") # "\t" indicates tab-separated data. Very common.
# Any separator will work E.g. substituting tabs for 'x'
read.table("Book1.txt", sep = "x")
# Another common data format is comma separated
# There is a built-in R command called read.csv (csv = comma separated values)
# This is basically the same as read.table, but it automatically assumes a comma separator
read.csv()
read.csv("Book1.csv")
# Same as:
read.table("Book1.csv", sep = ",")
# without the delimiter:
read.table("Book1.csv")
## Headers
# Typically the data you import will have headers
# read.table ignores these by default (see ?read.table - header = FALSE is default)
read.table("Book1.txt")
# To include headers you need to add 'header = TRUE'
read.table("Book1.txt", sep = "\t", header = TRUE)
# If your first column is row names, you can use 'row.names = 1'
read.table("Book1.txt", sep = "\t", header = TRUE, row.names = 1)
## Missing data
# In Notepad, remove a couple of data points from Book1.txt file
read.table("Book1.txt", sep = "\t", header = TRUE, row.names = 1)
# Recall that R is vector-based and that in data frames or matrices, all columns (vectors) must be the same length.
# In addition, all must contain data
# If data is missing, R replaces a 'blank' entry with NA
# You can count the number of NAs in a data frame or matrix:
is.na()
# E.g. make Book1.txt an object called 'natest'
natest <- read.table("Book1.txt", sep = "\t", header = TRUE, row.names = 1)
# Then check for NA's
is.na(natest)
# Note this is boolean, so gives TRUE or FALSE
# To count NA's, use sum() as well
sum(is.na(natest))
# You can also count NA's just on one or selected columns, using subsetting ($ or [])
# NAs can be a problem. They can affect calculations, depending on the calculations you're doing.
# Sometimes they are just ignored, e.g. with element-wise calculations
3 * natest$two
log(natest$two)
# But if you're combining data (e.g. a mean) it will
rowMeans(natest)
# There are a few ways of dealing with this
# You could remove NA's
na.omit(natest)
# But this removes any row that contains an NA. The entire row. You lose all that data
# Or you could substitute NAs with zero (careful here though - will affect calculations)
natest[is.na(natest)] <- 0
natest
rowMeans(natest)
## Exporting your data
# So you've managed to filter, analyse, subset, etc your data and you have a table of your results as an object in R
carsdat <- cars[cars$speed > 16, ]
# You can write this out (to your working directory, remember) as a file using write.table, write.csv
# These are essentially the reverse of read.table and have, for the most part, the same rules, such as defining separators, etc
write.table(carsdat, "carsdat.txt", sep = "\t")
write.csv(carsdat, "carsdat.csv")
## If you want to cut and paste
# There is a nice addin called datapasta that allows you to directly copy and paste data
install.packages("datapasta")
library(datapasta)
## Importing other types of data require a package to be installed
# ChatGPT in R studio
# package 'gptstudio'
# https://cran.r-project.org/web/packages/gptstudio/index.html
# https://youtu.be/QQfDTLExoNU?t=233
##### Installing packages #####
# The primary repository for R packages is CRAN (The Comprehensive R Archive Network)
# https://cran.r-project.org/
# There are many 1000's of packages
# https://cran.r-project.org/web/packages/available_packages_by_name.html
# The standard function for installing CRAN R packages is:
install.packages()
# NOTE: this downloads and installs a package, but to use a package you have to load it by using the library function:
library()
# For example, to import an Excel file into R, you need to install a package that can do this.
# https://readxl.tidyverse.org/
install.packages("readxl")
library(readxl)
# Note that when installing a package, the package name must be in quotes install.packages("package_name") but without quotes when loading it library(package_name)
# Packages come with a collection of functions written for that package
# In the readxl package, the command to import an Excel file is read_excel()
read_excel("excel_file.xlsx")
# CRAN is the main repository for R packages, but there are other repositories, that use different package installation methods.
# E.g. Bioconductor
# https://bioconductor.org/packages/
# The instructions to install a Bioconductor package are on the package website
# E.g. DESeq2:
# https://bioconductor.org/packages/release/bioc/html/DESeq2.html
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)
##### Statistics with base R #####
# R was designed as a statistical package. Base R contains many statistical functions. This section will give an overview of some of the commonly used functions. A full explanation of R's statistical functions would require a full course!
# If you want to deep dive further into learning how to use R for statistics, there are a multitude of online resources
# E.g. https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
# QCIF courses
# https://www.qcif.edu.au/trainingcourses/statistics-for-comparisons/
# https://www.qcif.edu.au/trainingcourses/exploring-and-predicting-using-linear-regression/
# I'm using built-in datasets below, but you can import your own dataset and use that if you like.
# Have a look at the structure of the iris dataset
str(iris)
# 'summary' gives you some basic statistics of an entire matrix or data table, for each column (i.e. each vector - remember, columns = vectors in R)
# For a numeric vector summary gives you quantile and mean, for character vector it gives counts.
summary(iris)
# You can get just the quantiles with the quantile function (only works on numeric vectors)
quantile(iris$Sepal.Length)
# If you do a summary of this vector, you'll see it's the same information, but summary also includes the mean
summary(iris$Sepal.Length)
# Quantiles are good for creating box and whisker plots, to compare the range, distribution, outliers
boxplot(iris)
# Note that this will plot the character vector (Species) but is not informative. You can use subsetting (square brackets) to plot just the columns you want.
boxplot(iris[1:4])
# note: if you just plot boxplot(iris) it will plot the species too, which, as a categorical variable, makes no sense on a box plot
# These summary stats can be calculated individually with their own functions:
min(iris$Sepal.Length)
median(iris$Sepal.Length)
mean(iris$Sepal.Length)
max(iris$Sepal.Length)
# Variance
var(iris)
var(iris$Sepal.Length)
# Standard deviation function
sd(iris$Sepal.Length)
# Same as the square root of the variance
sqrt(var(iris$Sepal.Length))
# Correlations
?cor()
# Pearson's test assumes your data is normally distributed and measures linear correlation
cor(Vector1, Vector2, method = "pearson")
# Spearman's test does not assume normality and measures non-linear correlation
method = "spearman"
# Kendall's test also does not assume normality and measures non-linear correlation
method = "kendall"
# Can use a histogram to visualise if a variable follows a normal distribution
hist(iris$Sepal.Length)
hist(iris$Sepal.Width)
hist(iris$Petal.Length)
hist(iris$Petal.Width)
# Then you can compare correlations between variables, choosing the correct method. Sepal length and width follow a normal distribution, so Pearson's test should be used
cor(iris$Sepal.Length, iris$Sepal.Width, method = "pearson")
# Pretty low correlation between sepal length and width
# Which can be visualised with a dot plot
plot(iris$Sepal.Length, iris$Sepal.Width)
# Petal length and width are not normally distributed. So we use Spearman.
cor(iris$Petal.Length, iris$Petal.Width, method = "spearman")
# Unlike sepals, strong correlation between petal length and width
# Which, again, can be visualised
plot(iris$Petal.Length, iris$Petal.Width)
# You can add a regression line to this plot using another statistical function: linear model (lm)
# Note: lm is y ~ x
lm(iris$Petal.Width ~ iris$Petal.Length)
# The above generates the linear model. Use the plotting function 'abline' to plot this
abline(lm(iris$Petal.Width ~ iris$Petal.Length))
# Finally, you can add the r squared figure (i.e. correlation) by using 'cor' and the plotting function 'text'
rsq <- (cor(iris$Petal.Length, iris$Petal.Width, method = "spearman"))^2
text(3.5, 2, rsq)
# Or to tidy it up a bit:
text(3.5, 2, paste0("r2 = ", round(rsq, 2)))
# Another example of combining some of the above stats in a plot, we can fit a normal curve over a histogram
hist(iris$Sepal.Width, prob=TRUE)
# Note the 'prob=TRUE' argument. This plots density, rather than frequency
# Now calculate the mean
m <- mean(iris$Sepal.Width)
# The standard deviation
std <- sd(iris$Sepal.Width)
# And finally use this to plot a normal curve (using the plotting function 'curve' and the 'dnorm' function to calculate normal distribution)
curve(dnorm(x, mean=m, sd=std), add = TRUE)
# Note the 'add = TRUE' argument - adds curve to current plot
## Other built-in statistical tests
# t Test
t.test()
# Chi-squared Test
chisq.test()
# Generalized Linear Model
glm()
# anova
# https://www.scribbr.com/statistics/anova-in-r/
model <- lm(Sepal.Length ~ Species, data = iris)
anova(model)
# https://dynamicecology.wordpress.com/2014/10/02/interpreting-anova-interactions-and-model-selection/
# The summary function can also be used on statistical models to generate higher-level stats
summary(model)
# Gives:
# F-statistics
# R squared
# Standard error
# P value - Pr(>F)
plot(model)
# Stats using ggpubr package
# http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/78-perfect-scatter-plots-with-correlation-and-marginal-histograms/
# Install and load ggpubr package
install.packages("ggpubr")
library(ggpubr)
# Load mtcars data
data("mtcars")
# Make a new object
df <- mtcars
# Convert cyl as a grouping variable (i.e. factor)
df$cyl <- as.factor(df$cyl)
# Compare mpg with weight
ggscatter(df, x = "wt", y = "mpg")
# Add regression line
ggscatter(df, x = "wt", y = "mpg", add = "reg.line")
# Add correlation stats
ggscatter(df, x = "wt", y = "mpg", add = "reg.line") + stat_cor()
# Note: adjust label positions with: label.x = .. and label.y = .. arguments)
# Colour by cylinder (remember, we made this variable a factor)
ggscatter(df, x = "wt", y = "mpg", add = "reg.line", color = "cyl") + stat_cor()
# Change the colour palette (see: ggsci colour palettes. https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html)
ggscatter(df, x = "wt", y = "mpg", add = "reg.line", color = "cyl", palette = "jco") + stat_cor()
# Split correlation stats by cylinder too
ggscatter(df, x = "wt", y = "mpg", add = "reg.line", color = "cyl", palette = "jco") + stat_cor(aes(color = cyl))
# Split into separate sub-plots (facets)
ggscatter(df, x = "wt", y = "mpg", add = "reg.line", color = "cyl", palette = "jco", facet.by = "cyl") + stat_cor(aes(color = cyl))
# Alternatively, draw ellipses around groups (default = 0.95 CI)
ggscatter(df, x = "wt", y = "mpg", color = "cyl", palette = "jco", ellipse = TRUE)
# http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/
# https://qcif-training.github.io/StatisticalComparisonsUsingR/fig/05-fig1.png
##### More Plotting #####
# Another very popular alternative to plotting with built-in R functions is the ggplot package
# Part of the Tidyverse package
# https://tidyverse.tidyverse.org/
# ggplot2, for data visualisation.
# dplyr, for data manipulation.
# tidyr, for data tidying.
# readr, for data import.
# purrr, for functional programming.
# tibble, for tibbles, a modern re-imagining of data frames.
# stringr, for strings.
# forcats, for factors.
# lubridate, for date/times.
# ggplot can make very beautiful and functional plots, but uses some different methods to base R plotting
# The gg in ggplot2 means Grammar of Graphics
# First install and load ggplot
install.packages("ggplot2")
library(ggplot2)
# To illustrate the difference between base R plotting and ggplot, we'll plot the same data: from the built-in dataset, Iris
# Plotting Sepal.Length vs Sepal.Width using base R plot
plot(iris$Sepal.Length, iris$Sepal.Width)
# Plotting the same data using ggplot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
# You can see the plotting style differs, as does the information you need to provide to plot() vs ggplot()
# ggplots are divided into three components: data + aesthetics + geometry
# (this is the basis for the Grammar of Graphics)
# iris is the dataset you're working with
# The aesthetics is the information in that dataset you want to plot (in this case Sepal.Length vs Sepal.Width) - note that you don't have to subset the data with a $ or []
# You need to finish it off with the type of plot you want. In this case it's a scatter plot, so: geom_point()
# Note the '+' after the main ggplot command.
# ggplot adds plot components piece by piece, using the '+'
# For example, if you just ran the main ggplot command:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
# It would produce a blank plot. It'd waiting for you to tell it HOW to plot the data
# A useful way to modify your plot 'on the fly' is to make your plot an object, then you can add (+) things to your plot
# E.g.
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
# Now the basic plot is stored as an object called 'p', which you can run just by typing p
p
# To add a dot plot to this, you can:
p + geom_point()
# You can plot it as a line plot instead, using:
p + geom_line()
# The plot types are:
# Scatter plot = geom_point()
# Box plot
# Violin plot geom_violin()
# strip chart geom_jitter()
# Dot plot geom_dotplot()
# Bar chart geom_bar() or geom_col()
# Line plot geom_line()
# Histogram geom_histogram()
# Density plot geom_density()
# You need to select the correct type of plot for your data.
# For example:
p + geom_bar()
# Won't work, as a bar plot need just one variable as input
# So you'd have to redo the ggplot
p <- ggplot(iris, aes(x = Sepal.Length))
p + geom_bar()
# Then you can plot it as a plot type that uses one variable, e.g. a density plot:
p + geom_density()
# The ggplot cheatsheet is good for determining which kind of geometry you should use
# https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf
# Colours, point shapes, size, etc can also be set
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
p + geom_point()
p + geom_point(size = 3)
p + geom_point(size = 3, colour = "red")
p + geom_point(size = 3, colour = "red", shape = 21)
# Point shapes are the same in base R
# https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/ggplot2/figures/003-introduction-to-ggplot2-plotting-symbol-1.png
# You can change labels by using labs()
# First, let's assume you have decided on your colours and points. So create an object with all of this info:
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(size = 3, colour = "red", shape = 21)
# Then use labs() to change the labels
p + labs(title = "Iris sepal length vs width", x = "Sepal Length (cm)", y = "Sepal Width (cm)")
# The default ggplot theme is the grey background and white lines. You can change themes:
p + theme_bw()
p + theme_dark()
p + theme_minimal()
# Several others
# https://ggplot2.tidyverse.org/reference/ggtheme.html
# It may seem that base R plotting is simpler and easier
# Again, the same data with base R:
plot(iris$Sepal.Length, iris$Sepal.Width)
# Vs ggplot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point()
# Base R plotting tools (plot(), hist(), etc) tend to be better for quickly checking your data and ggplot is better for working with more complex datasets and producing more beautiful, publishable plots.
# The additive nature of ggplot (+) is one benefit over base R plotting
# But also that ggplot loads in the whole dataset and can subsequently use multiple aspects or levels of it
# Make each species a different colour and point shape
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species, shape = Species))+
geom_point()
p
# Note that ggplot automatically adds a legend as well. To do the same in the above base R plot you'd need to provide each point type, colour and name as a separate vector - quite a lot of code.
# Last lesson we looked at adding regression lines to a plot
# In ggplot you cam use the geom_smooth() command
p <- p + geom_smooth(method='lm')
p
# This results in some overlap, so you can add facet_wrap() to split into a plot with multiple panels
p <- p + facet_wrap(~Species)
p
# There are many ways to change colours in ggplot. One example is using 'colour brewer' palettes. These are colour palettes that have been designed
# https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/ggplot2/figures/0101-rcolorbrewer-palette-rcolorbrewer-palettes-1.png
p + scale_color_brewer(palette="Set1")
p + scale_color_brewer(palette="Dark2")
# The above is generating a dot or scatter plot. Let's look at some different plot types.
# A box and whisker plot:
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
theme_bw()
# A violin plot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill=Species)) +
geom_violin() +
theme_bw()
# A stacked barplot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_bar() +
theme_bw()
# Separate barplots (for histograms, just substitute geom_bar() with geom_histogram())
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_bar() +
theme_bw() +
facet_wrap(~Species)
# Density curves
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.5) +
theme_bw() +
scale_fill_brewer(palette="Set1")
# Note that in the above I've changed the colour palette and the transparency ('alpha = 0.5')
# 2d density plot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
geom_density2d() +
theme_bw()
# Note that for colours in box plots, bar plots, etc, the colour is a 'fill', so instead of using scale_color_brewer() you use scale_fill_brewer()
p <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill=Species)) +
geom_violin() +
theme_bw()
p
p + scale_fill_brewer(palette="Dark2")
# As I say, there's TONS of colour options. Far too many to go over here
# This is a great website for ggplot colours:
# https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/
##### Bringing it all together #####
mtcars
iris
data("swiss")
data("PlantGrowth") # long format
# Read in your own dataset
# What's your working directory? Set it if you haven't already done so.
# Then read in your data using:
read.table("dsdfsfd.txt", sep = "\t", header = T)
read.csv()
# Or if it's an Excel file
# Install and load the readxl package
install.packages("readxl")
library(readxl)
# And then read in your file
read_excel("excel_file.xlsx")
# Make sure you read it in as an object
irisdata <- iris
# Have a look at the structure of your data. What does this tell you?
str(irisdata)
# What type of object is it?
class(irisdata)
# How many rows does it have? How many columns?
dim(irisdata)
nrow(irisdata)
ncol(irisdata)
# What are the column names? The row names?
colnames(irisdata)
row.names(irisdata)
# Pull out just one of the columns, using the column name
irisdata$Sepal.Length
# Pull out columns 3, 5 and 6
irisdata[c(3,5,6)]
# Pull out row 1 to 20
irisdata[c(1:20),]
# Pull out just the last two columns for rows 1 to 20
newiris <- irisdata[c(1:20),]
ncol(newiris)
newiris[c(4:5)]
# or
irisdata[c(1:20),c(4:5)]
# or
irisdata[c(1:20),c((ncol(irisdata)-1):ncol(irisdata))]
# Which columns are numeric vectors?
str(irisdata)
# What is the mean of one of the numeric columns?
mean(irisdata$Sepal.Length)
# What is the minimum? The maximum?
max(irisdata$Sepal.Length)
min(irisdata$Sepal.Length)
# Get the statistical summary, for a numeric vector, a character vector, the entire object
summary(irisdata$Sepal.Length)
summary(irisdata$Species)
# For one of the numerical vectors, how many entries are larger than the median? How many are smaller?
median(irisdata$Sepal.Length)
irisdata$Sepal.Length > median(irisdata$Sepal.Length)
sum(irisdata$Sepal.Length > median(irisdata$Sepal.Length))
# Which of the entries are larger or smaller than the median? (i.e. output the actual numbers)
irisdata$Sepal.Length[irisdata$Sepal.Length > median(irisdata$Sepal.Length)]
# Subset all the rows where the entries in that one vector are greater than the median
irisdata[irisdata$Sepal.Length > median(irisdata$Sepal.Length), ]
# Make this a new object
newdata <- irisdata[irisdata$Sepal.Length > median(irisdata$Sepal.Length), ]
# For one of the numerical vectors of the new object, calculate the log value of each number. Add these results to the object as a new column.
newdata$log <- log(newdata$Sepal.Length)
# Write this object out as a text file.
write.csv(newdata, "newdata.csv")
# Plot a histogram, using one variable
hist(newdata$log)
# Plot a basic dot plot, comparing two variables
plot(newdata$Sepal.Length, newdata$Petal.Length)
# Bring up the help file for the plot function
# Have a look at the arguments you can use
# Change the colour of the dots
plot(newdata$Sepal.Length, newdata$Petal.Length, col = "red")
# Change the size of the dots. Change them from circles to squares
plot(newdata$Sepal.Length, newdata$Petal.Length, col = "red", cex = 2)
# Give the plot a title. Change the x and y labels
plot(newdata$Sepal.Length, newdata$Petal.Length, col = "red", cex = 2, main = "My Plot", xlab = "Sepal Length", ylab = "Petal Length")
# Make the above dot plot (colour, size, etc), but use ggplot.
#https://www.publichealth.columbia.edu/sites/default/files/media/fdawg_ggplot2.html
ggplot(iris, aes(Sepal.Length, Petal.Length)) +
geom_point(colour = "red") +
theme_bw()
ggplot(iris, aes(Sepal.Length, Petal.Length, colour = Species, shape = Species)) +
geom_point() +
theme_bw()
##### Putting it in a report #####
# File -> New File -> R Markdown
R markdown
The last hour of the final session is writing an R markdown script and ‘knitting’ it into a HTML report:
---
title: "Intro to R"
output:
html_document:
code_folding: hide
theme: cerulean
toc: true
toc_depth: 4
toc_float:
collapsed: false
number_sections: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(plotly)
library(DT)
library(ggsci)
# https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html
```
# R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## R Markdown cheat sheet
https://rmarkdown.rstudio.com/lesson-15.html
https://github.com/rstudio/cheatsheets/raw/main/rmarkdown-2.0.pdf
Lots of text options.
### Headers begin with a hash (number of hashes indicate the header level)
**Bold text is bracketed by two asterisks**
> Quotes are preceded by a greater than symbol.
`Example code is bracketed by a backtick (under the tilde, top left corner of keyboard`
And many more. See [R markdown cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/rmarkdown-2.0.pdf)
Lines can be added with 3 dashes
---
<br></br>
## HTML report options
To be added to the top of the R markdown script.
https://bookdown.org/yihui/rmarkdown/html-document.html
https://www.rdocumentation.org/packages/rmarkdown/versions/2.8/topics/html_document
## Themes
https://www.datadreaming.org/post/r-markdown-theme-gallery/
---
<br></br>
<br></br>
# Plots and images
## Images
A static image can be inserted using `![](http://image.url/image.png)`:
![](https://1000logos.net/wp-content/uploads/2019/07/Queensland-University-of-Technology-logo.jpg)
## Static ggplot
You can also embed plots, for example:
```{r pressure, echo=FALSE}
p <- ggplot(iris, aes(Sepal.Length, Petal.Length, colour = Species, shape = Species)) +
geom_point() +
theme_bw() +
scale_color_startrek() +
facet_wrap(~Species)
p
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
## Interactive plots
```{r }
ggplotly(p)
```
---
<br></br>
<br></br>
# Tables
## Basic table
First 20 lines of iris dataset.
```{r }
iris[1:20,]
```
## Including a cooler table
https://www.r-bloggers.com/2021/05/datatable-editor-dt-package-in-r/
Interactive table (using 'DT' package) of mpg dataset.
```{r }
# datatable(iris)
datatable(mpg)
```