Intro to R

Usage-Rigts

These materials were created for educational purposes. The are licensed under creative commons, and can be used with the corrresponding citation

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Learning outcomes

By the end of this activity, you will be able to:

  • Create variables, assign and read values to/from these variables
  • Compute simple calculatinos among variables/values
  • Convert the data type of the variables
  • Create and invoke functions that compute simple calculations

Variables

The symbol <- assigns a value to a variable For instance, if we want to assign the value 5 to a variable ‘y’, we write the following line code:

y <- 5
# Let's make sure it has the corresponding value - please execute the following line
y
y <- 5
Now we can use the variable y. In the following box, let’s multiply y by the value 2, and assign it to the variable x.
y*2
# Since we did not assign it to any variable, 'y' will keep the same value
y
# Now, if we want to save the result in a variable 'x':
x <- y*2
# Check out the value of variable 'x' now
x
y <- 5
x <- y*2

Practice Activity

  • Divide ‘x’ by ‘y’ and assign it to a variable a

  • Make sure that a has the correct value in it

Now, let’s create a small data set called vector. A vector is a data list that we store in a single variable. We can create a vector using the instrucción c():
z <- c(1.1, 9, pi, 5 , 6, 7, 8) 

# What is the meaning of having 'pi' in this vector?

# Check out the value of the variable 'z' 
z
This vector can be used to create larger vectors. For example, if we want to create a vector with all these values in z, but also the values from ‘x’ and y’, we can use:
y <- 5
x <- y*2
z <- c(1.1, 9, pi, 5 , 6, 7, 8) 
c(x,z,y)

Practice Activity

  • Create a vector with all the values from z twice, and x’s value in the middle This should look something like: 1.1, 9, 3.141593, 5 , 6, 7, 8, 10, 1.1, 9, 3.141593, 5 , 6, 7, 8

  • What happens if we multiply a vector by a scalar number?

Arithmetic

Other operators that can be used are: +, -, /, y ^. x^2 means x-squared.

x^2

#if you want to find the square root of a variable, you will need to use the function sqrt().

# A function is a group of instructions that we assign a name to
sqrt(x)

Help

Any time you have a question about a function from R, you can access its information using ‘?’ For instance, if you want to know what the mean function does:

?mean

Quiz

Quiz

You may invoke this fucntion usig vector z as parameter, like this:

mean(z)

We can also make calculations between vectors, for instance

# The following instruction will multiply one by one the elements from both vectors
z * c(2, 1, 2,3,5,6,4)

# When the length is different for both vectors, R 'recycles' the numbers from the smaller vector
# This recylcing process re-uses the numbers on the same order. For instance:
c(2, 2, 4, 4) + c(1, 2)

# The outcome is a vector including four elements, just like executing:
# c(2, 2, 4, 4) + c(1, 2, 1, 2)

Matrices

Now, if we want to have more than one column in a single variable, we can use a matrix or a data frame. The main difference between these two data structures is that matrices can only store one data type (e.g., numbers), while data frames can include both of them

# Let's first create a vector called mi_vector containing all integers from 1 to 20.
# We can do that using the operator ':' 
mi_vector <-1:20
# Check out the content of mi_vector
mi_vector

# To validate the data type, we can use the function 'class'
class(mi_vector)

# If we want to know how many elements are in there, we can use the function 'length'
length(mi_vector)

Practice Activity

How many elements are in z?

mi_vector <-1:20
dim(mi_vector)<-c(4,5)
mi_matrix <- mi_vector
students <- c('Charlie','Hayden','Alex','Ben')
randNumbers <- rnorm(20)
#  Now we can turn our vector into a 4x5 matrix
dim(mi_vector)<-c(4,5)

# What we just did was to tell R that mi_vector will have four rows and five columns
mi_vector

Practice Activity

Check out the data type of mi_vector again

# Let's use an appropriate name for our variable
mi_matrix <- mi_vector

# Another approach to create this matrix is:
matrix(1:20,nrow=5, ncol=5)

Now, let’s assume that the numbers in each row correspond to individual students from which we know five values: age, height, grade, GPA, and weight

It would be good to know which student corresponds to each row, so we can create this additional column

students <- c('Charlie','Hayden','Alex','Ben')
mi_data <- cbind (students, mi_matrix)

# Now all the values in mi_matrix have quotation marks around them
mi_data

Now all the values in mi_matrix have quotation marks around them. This is because the matrix can only have a single data type: in this case all the values are considered to be text

mi_data <- data.frame(students, mi_matrix)
mi_data

# Now, to assign names to the columns, we use the function 'colnames'
colnames(mi_data)<- c("Name", "age", "height", "grade", "GPA", "weight")


# If you want to access a single column from the data frame, use the '$' symbol
# Uncomment (remove the #) the line you would like to execute
# mi_data
# mi_data$age
# mi_data$Name

Functions

As you have seen, we can invoke existing functions such as mean or cbind to perform certain tasks There are many of these functions in R, and any time you want to do something, you should try to find an existing function first. For instance, to get a vectors with 20 random numbers normally distributed:

randNumbers <- rnorm(20)
randNumbers

Practice Activity

Find the functions that will find the minimum value in randNumbers

Creating my Functions

If we don’t find an existing function for our task, we can create our own. This is similar than creating a variable in the sense that we assign a function with <- de variables en el sentido que podemos asignar una función con’<-’.

The following function multiples a times b and divides the result by c

myFunction <- function(a,b,c)
{
  x <- (a*b)
  y <- x/c
  y
}

# We can invoke this function using the following parameters a=10 b=5, c=2 like
myFunction(10,5,2)
# Likewise
myFunction(b=5, c=2, a=10)

The Apply Function in R

The apply function is often challenging to understand, but it is very useful, These are a family of functions that will execute a given functions through the elements of a given data set

Explore: ??apply

# For instance, if we create a function to print numbers from a list one by one:
printNumber <- function(number)
{
  textToPrint <- paste('The number is: ',number)
  print(textToPrint)
}

printNumber(3)
printNumber(5)

# We can invoke this function for all the elements from randNumbers 
output <- lapply(randNumbers, printNumber)

Note that we pass the name of our function (printNumber) as a parametter to lapply. What is the output?

Now that you are comfortable using R, we are going to explore the methods we brought for this workshop. Go to the Next Topic, or choose from the menu on the left.

Creating Plots in R

GGPlot

There are different ways to create plots in R. For example, you can create a histogram with the function hist . However, most of the plots created in R today are made using a library called ggplot . We need to first load ggplot

We now want to load some data and our data is stored in a server.

# The data is stored in a variable called litReview

litReview <- read.csv2("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/litReviewData.csv", header=TRUE, sep=";")
litReview 

Activity

Let’s see the first rows our data set. Modify the code above to see what is in the data set

The data set contains four columns. THe first three (CVG, CET and SoV) represent a numerical score, while the other two columns are categories and descriptive text:

  • Connection to Visualization Background (CVG)

  • Connection to Educational Theories (CET)

  • Sophistication of the Visualization (SoV)

  • Venue - Journal or conference where the paper was published

  • Author - First author of the paper

Plotting

The function ggplot will only create the blank plot specifying our axes

For example, in the following instruction, we set up the CET as the x axis, and the SoV as the y-axis.

Check out what happens when you run it.

ggplot(litReview, aes(x=CET, y=SoV))

Activity

Why did we include that function ‘aes()’ in the code?

?aes() # Describes how variables in the data are mapped to visual properties of geoms.
       # Everything we want to include into the visualization should be included there.

We now need to tell ggplot what kind of plot we want.

ggplot(litReview, aes(x=CET, y=SoV))+
  geom_point()  # Adding this line, we are saying we want a scatter plot

Here are some other ideas:

  • geom_point() ==> Scatter Plot

  • geom_bar() ==> Bar Plot

  • geom_line() ==> Line Plot

  • geom_histogram() ==> Histogram

Each plot would require specific data and columns within the aes function. For instance, to plot a histogram, we only need x variable:

ggplot(litReview, aes(x=CET))+
  geom_histogram()

Here is a useful cheatsheet for different plots:

https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Activity

Change the plot to a line plot
# write your code here.

Let’s now get back to our scatter plot, which seems to be best for our purpose: identifying gaps in this literature

ggplot(litReview, aes(x=CET, y=SoV))+
      geom_point()  # Adding this line we are saying we want a scatter plot

These are still many points, and they are coming from differnt venues. So, let’s use the state to differentiate them with colors

ggplot(litReview, aes(x=CET, y=SoV))+
  geom_point(aes(color=Venue), size=3)  # Do you remember the aes() function? Here it is again

Aesthetics

You can change the color palette, edit the legend on the right, and many other things, but we will leave these aesthetics functions for some other time. If you want to explore, here are a few functions you could use:

  • theme(legend.position=“None”) # remove legend

  • scale_colour_brewer(palette = “Set1”) # change color palette

This is a nice simple scatter plot, but the scales in each axis seem odd. Let’s adjust the axes to go on the whole range of our scale: 0-10
ggplot(litReview, aes(x=CET, y=SoV))+
  geom_point(aes(color=Venue), size=3)+
  ylim(c(0, 10))

Did you notice that we just added up these new lines of code to the plot?

That’s how ggplot works, we can just continue adding new lines to modify our plot

Activity

Set the limits 0 to 10 to the x-axis.

#Write your code here.
If we don’t want to write the whole thing again, we can just store it in a variable
myPlot <- ggplot(litReview, aes(x=CET, y=SoV))+
  geom_point(aes(color=Venue), size=3)+
  ylim(c(0, 10))
myPlot

And then, use that variable to add new things to our plot

myPlot+
  ggtitle("Two Dimensional Comparison", subtitle="Gap between Education and Visualization Researchers") + # Add a title and subtitle
  xlab("Connection to Edu. Theories") # Set the axis names

Practice Activity

Set the name for the y-axis

#Write your code here.

We can also change the breaks in each axis. Here are a couple of parameters to explain

myPlot+
scale_y_continuous(breaks=seq(0, 10, 1), # Set a break in the y axis every unit
                   limits=c(0, 10)) # Set the limits

Also note that we used scale_y_continuous because y is a continuous variable, but there are other types:

  • scale_x_continuous, scale_y_continuous ==> For numbers

  • scale_x_discrete, scale_y_discrete ==> For categories

  • scale_x_date, scale_y_date ==> For dates

Activity

Change the x-axis limit so have breaks every 1

#Write your code here.

Now, there is an issue with our plot, there may be more than one point overlapping with each other at a single position.

We can use the size as an indicaton of how many papers are at each point. To do that, we use geom_count instead of geom_point

myPlot <- ggplot(litReview, aes(x=CET, y=SoV))+
  geom_count(aes(color=Venue))+
  scale_x_continuous(breaks=seq(0, 10, 1), limits=c(0, 10))+
  scale_y_continuous(breaks=seq(0, 10, 1), limits=c(0, 10))+ 
  scale_size_continuous(breaks = seq(0, 5, 1))+
  labs(title="Two Dimensional Comparison", 
       subtitle="Gap between Education and Visualization Researchers",
       y="Sophistication of the Visualization", 
       x="Connection to Educational Theories", 
       size="# of Studies")
myPlot

Activity

Create a scatter plot for Connection to Educational Theories (CET) vs. Connectino to Visualizaiton Background (CET vs CVG)

#Write your code here.

Student Explanations

We are going to first read some sample data to work on the visualizations and clustering. The rows represent the students, while the columns correspond to the codes from the coding scheme.

Remember that each example students explained included more than one section. In this dataset, we have two sections.

heatM <- read.csv("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/twoSampleSections.csv",header=TRUE,sep=",") 
heatM

Quiz

Quiz

Preparing the set-up variables

It will be handy to have the categories in a vector, so let’s create a vector with all the categories from the coding scheme

categories <- c('SIM','INC','LIM','PHR','COA','VAR','PAR','DAT', 'COD',
                   'HOW','EXE','PRO','GOA','BGK','WHY','RAG','INS','CON','CHK',
                   'MON','OWN')
categories

To get only the data from the first section, we can get a subset of these colums 1:22. Let’s create a simple heatmap of this section. If you have doubts about how to create plots, go back to the section Creating Plots in R

# Select the columns corresponding to the first section
firstSection <- heatM[,1:22]

# "Melt" the dataset into a *Vertical* format to plot it.
meltedHM<- melt(firstSection,id.vars=c("STD")) 


ggplot(meltedHM, aes(x = variable, y = STD, fill = factor(value)))+
    geom_tile(color = "grey") +
    labs(y= 'Student', x='Explanation Type')+
    scale_fill_manual(values = c('white','blue'))+
    theme(text = element_text(size=10), axis.text.x = element_text(angle=90, hjust=1)) 

Practice Activity

Create the heatmap plot for section #2. Hint: use heatM[,c(1,23:43)] to get the data for the corresponding section

#Write your code here.

Finding Clusters

Now we will find the clusters (i.e. different approaches to self-explain) and visualize them with colors

We need to: * Create a vector with the number of sections, and * Aggregate the types of knowledge in all sections

sections<- 1:2
# Aggregate the types of knowledge in all sections
knowTypeDF <- as.data.frame(do.call("cbind",lapply(sections,aggregateKnowSubType, heatM=heatM )))

# Include the students column
rownames(knowTypeDF)<-heatM$STD
# Assign the column names
names(knowTypeDF)<- rep(c('LK', 'CK1','CK2','PK','SK','SK2','TK','TK2'),2)

# Check out the resulting dataset
knowTypeDF
sections<- 1:2
# Aggregate the types of knowledge in all sections
knowTypeDF <- as.data.frame(do.call("cbind",lapply(sections,aggregateKnowSubType, heatM=heatM )))

# Include the students column
rownames(knowTypeDF)<-heatM$STD
# Assign the column names
names(knowTypeDF)<- rep(c('LK', 'CK1','CK2','PK','SK','SK2','TK','TK2'),2)
categories <- c('SIM','INC','LIM','PHR','COA','VAR','PAR','DAT', 'COD',
                   'HOW','EXE','PRO','GOA','BGK','WHY','RAG','INS','CON','CHK',
                   'MON','OWN')
stdClusters <- computeClusters(knowTypeDF[,1:16], 4)

We will now identify and visualize the number of clusters.

How many clusters?

We can try with different number of clusters, and choose from the dendongrams, as follows:

# The first parameter is removing the column of student 'name' because that is irrelevant for our purpose
# The second parameter of the 'computeClusters' allows us to try different number of clusters
createDendograms(knowTypeDF[,1:16])

We may say that four clusters seem to distribute the distances evenly, so we use computeClusters and stdClusters will have a list of the students with the number of the correponding cluster

# The first parameter is removing the column of student 'name' because that is irrelevant for our purpose
# The second parameter of the 'computeClusters' allows us to try different number of clusters
stdClusters <- computeClusters(knowTypeDF[,1:16], 4)
stdClusters

Let’s now prepare the color spectrum, and assign a value for the color of each cluster

# Each number will correspond to a color in our "heatmap"
colorsLevels <- as.character(c(0,seq( from = 10 , to = 90, by = 10 )))
# Assing an arbitrary number within the range defined in colorsLevels
stdClusters[stdClusters==1,]$stdClusters <- 10
stdClusters[stdClusters==2,]$stdClusters <- 30
stdClusters[stdClusters==3,]$stdClusters <- 50
stdClusters[stdClusters==4,]$stdClusters <- 70

# Students are ordered by the clusters for visualization purposes
# i.e., Update the ordering to respect the colors
students <- stdClusters[order(stdClusters$stdClusters),]$std

# Creating the heat maps for each of the sections
section1 <- createHeatMapData(heatM[,c(1,2:22)], 
                              stdClusters, 
                              students, 
                              categories, 
                              '(a) Section 1 - Creating the Function', 
                              11,
                              colorsLevels, 
                              FALSE)
section1

Practice Activity

  • Are these clusters meaninguful?
  • Modify the code above to see the heatmap for section 2

Types of Knowledge

As usual, we first want to load some data, so we need to tell the computer where to get it from. For this example, we need to load the dataset in two different formats to avoid dealing with data transformation. Explore both data sets. How are they different from each other?

typesOfKnowledge <- read.csv("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/typesOfKnowledge.csv",header=TRUE,sep=",")
meltedToK <- read.csv("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/meltedData.csv",header=TRUE,sep=",")

Let’s get to the clusters now. We need to set a seed to always get the same set of clusters for the same data. We then use the kmeans function to identify four clusters in the data set.

set.seed(1234)

stdClusters<- kmeans(typesOfKnowledge[,-1], 4)
# Check out the cluster for each student
stdClusters
set.seed(1234)

stdClusters<- kmeans(typesOfKnowledge[,-1], 4)
We need to transform the matrix into a data frame to map the clusters, and assign the cluster to the data frame we just created
groups<- as.data.frame(typesOfKnowledge)

groups$stdClusters<- as.factor(stdClusters$cluster)

# The students are as row names and we need to have them as column to merge them later
groups <- rownames_to_column(groups, "std")

# Assign the clusters to each student in the melted format (which we will use to plot)
meltedToK <- merge(meltedToK,groups[,c('X','stdClusters')], by.x = 'Student', by.y = 'X')

# Make sure R recognizes the cluster as a categorical variable (it should not be a number!)
meltedToK$stdClusters <- factor(meltedToK$stdClusters, levels=c("1", "2","3","4"))
meltedToK
stdClusters<- kmeans(typesOfKnowledge[,-1], 4)

groups<- as.data.frame(typesOfKnowledge)

groups$stdClusters<- as.factor(stdClusters$cluster)

# The students are as row names and we need to have them as column to merge them later
groups <- rownames_to_column(groups, "std")

# Assign the clusters to each student in the melted format (which we will use to plot)
meltedToK <- merge(meltedToK,groups[,c('X','stdClusters')], by.x = 'Student', by.y = 'X')

# Make sure R recognizes the cluster as a categorical variable (it should not be a number!)
meltedToK$stdClusters <- factor(meltedToK$stdClusters, levels=c("1", "2","3","4"))

Visualizing the Clusters

Now we can create the plot using ggplot. Note that our X axis will have an interaction between the Types of Knowledge and the Steps of de modeling process (variable)

ggplot(meltedToK, aes(x=interaction(ToK,variable),Student))+
  geom_point(aes(size=sum, shape=stdClusters))+
  theme(text = element_text(size=10),
    axis.text.x = element_text(angle=90, hjust=1)) 


# The plot is ready, but it is better to se it organized, so we can do the following:
# Get the list of students in order based on the clusters.
students <- groups[order(groups$stdClusters),]$X
# And use that to organize the students in the data frame
meltedToK$Student <- factor(meltedToK$Student, levels=students)

#Plot it again:
ggplot(meltedToK, aes(x=interaction(ToK,variable),Student))+
  geom_point(aes(size=sum, shape=stdClusters))+
  theme(text = element_text(size=10),
    axis.text.x = element_text(angle=90, hjust=1)) 

Practice Activity

  • How differet does our data look like if we choose three or four clusters?

Hint: modify the call to kmeans, but make sure you start the script from the beginning because our structures have been modified.

  • What else would you need to change?