These materials were created for educational purposes. The are licensed under creative commons, and can be used with the corrresponding citation
This work is licensed under a Creative Commons Attribution 4.0 International License.
By the end of this activity, you will be able to:
The symbol <- assigns a value to a variable For instance, if we want to assign the value 5 to a variable ‘y’, we write the following line code:
y <- 5
# Let's make sure it has the corresponding value - please execute the following line
y
y <- 5
y*2
# Since we did not assign it to any variable, 'y' will keep the same value
y
# Now, if we want to save the result in a variable 'x':
x <- y*2
# Check out the value of variable 'x' now
x
y <- 5
x <- y*2
Divide ‘x’ by ‘y’ and assign it to a variable a
Make sure that a has the correct value in it
z <- c(1.1, 9, pi, 5 , 6, 7, 8)
# What is the meaning of having 'pi' in this vector?
# Check out the value of the variable 'z'
z
y <- 5
x <- y*2
z <- c(1.1, 9, pi, 5 , 6, 7, 8)
c(x,z,y)
Create a vector with all the values from z twice, and x’s value in the middle This should look something like: 1.1, 9, 3.141593, 5 , 6, 7, 8, 10, 1.1, 9, 3.141593, 5 , 6, 7, 8
What happens if we multiply a vector by a scalar number?
Other operators that can be used are: +
, -
, /
, y ^
. x^2 means x-squared.
x^2
#if you want to find the square root of a variable, you will need to use the function sqrt().
# A function is a group of instructions that we assign a name to
sqrt(x)
Any time you have a question about a function from R, you can access its information using ‘?’ For instance, if you want to know what the mean function does:
?mean
You may invoke this fucntion usig vector z as parameter, like this:
mean(z)
We can also make calculations between vectors, for instance
# The following instruction will multiply one by one the elements from both vectors
z * c(2, 1, 2,3,5,6,4)
# When the length is different for both vectors, R 'recycles' the numbers from the smaller vector
# This recylcing process re-uses the numbers on the same order. For instance:
c(2, 2, 4, 4) + c(1, 2)
# The outcome is a vector including four elements, just like executing:
# c(2, 2, 4, 4) + c(1, 2, 1, 2)
Now, if we want to have more than one column in a single variable, we can use a matrix or a data frame. The main difference between these two data structures is that matrices can only store one data type (e.g., numbers), while data frames can include both of them
# Let's first create a vector called mi_vector containing all integers from 1 to 20.
# We can do that using the operator ':'
mi_vector <-1:20
# Check out the content of mi_vector
mi_vector
# To validate the data type, we can use the function 'class'
class(mi_vector)
# If we want to know how many elements are in there, we can use the function 'length'
length(mi_vector)
How many elements are in z?
mi_vector <-1:20
dim(mi_vector)<-c(4,5)
mi_matrix <- mi_vector
students <- c('Charlie','Hayden','Alex','Ben')
randNumbers <- rnorm(20)
# Now we can turn our vector into a 4x5 matrix
dim(mi_vector)<-c(4,5)
# What we just did was to tell R that mi_vector will have four rows and five columns
mi_vector
Check out the data type of mi_vector again
# Let's use an appropriate name for our variable
mi_matrix <- mi_vector
# Another approach to create this matrix is:
matrix(1:20,nrow=5, ncol=5)
Now, let’s assume that the numbers in each row correspond to individual students from which we know five values: age, height, grade, GPA, and weight
It would be good to know which student corresponds to each row, so we can create this additional column
students <- c('Charlie','Hayden','Alex','Ben')
mi_data <- cbind (students, mi_matrix)
# Now all the values in mi_matrix have quotation marks around them
mi_data
Now all the values in mi_matrix have quotation marks around them. This is because the matrix can only have a single data type: in this case all the values are considered to be text
mi_data <- data.frame(students, mi_matrix)
mi_data
# Now, to assign names to the columns, we use the function 'colnames'
colnames(mi_data)<- c("Name", "age", "height", "grade", "GPA", "weight")
# If you want to access a single column from the data frame, use the '$' symbol
# Uncomment (remove the #) the line you would like to execute
# mi_data
# mi_data$age
# mi_data$Name
As you have seen, we can invoke existing functions such as mean or cbind to perform certain tasks There are many of these functions in R, and any time you want to do something, you should try to find an existing function first. For instance, to get a vectors with 20 random numbers normally distributed:
randNumbers <- rnorm(20)
randNumbers
Find the functions that will find the minimum value in randNumbers
If we don’t find an existing function for our task, we can create our own. This is similar than creating a variable in the sense that we assign a function with <- de variables en el sentido que podemos asignar una función con’<-’.
The following function multiples a times b and divides the result by c
myFunction <- function(a,b,c)
{
x <- (a*b)
y <- x/c
y
}
# We can invoke this function using the following parameters a=10 b=5, c=2 like
myFunction(10,5,2)
# Likewise
myFunction(b=5, c=2, a=10)
The apply function is often challenging to understand, but it is very useful, These are a family of functions that will execute a given functions through the elements of a given data set
Explore: ??apply
# For instance, if we create a function to print numbers from a list one by one:
printNumber <- function(number)
{
textToPrint <- paste('The number is: ',number)
print(textToPrint)
}
printNumber(3)
printNumber(5)
# We can invoke this function for all the elements from randNumbers
output <- lapply(randNumbers, printNumber)
Note that we pass the name of our function (printNumber) as a parametter to lapply. What is the output?
Now that you are comfortable using R, we are going to explore the methods we brought for this workshop. Go to the Next Topic, or choose from the menu on the left.
There are different ways to create plots in R. For example, you can create a histogram with the function hist . However, most of the plots created in R today are made using a library called ggplot . We need to first load ggplot
We now want to load some data and our data is stored in a server.
# The data is stored in a variable called litReview
litReview <- read.csv2("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/litReviewData.csv", header=TRUE, sep=";")
litReview
Let’s see the first rows our data set. Modify the code above to see what is in the data set
The data set contains four columns. THe first three (CVG, CET and SoV) represent a numerical score, while the other two columns are categories and descriptive text:
Connection to Visualization Background (CVG)
Connection to Educational Theories (CET)
Sophistication of the Visualization (SoV)
Venue - Journal or conference where the paper was published
Author - First author of the paper
The function ggplot will only create the blank plot specifying our axes
For example, in the following instruction, we set up the CET as the x axis, and the SoV as the y-axis.
Check out what happens when you run it.
ggplot(litReview, aes(x=CET, y=SoV))
Why did we include that function ‘aes()’ in the code?
?aes() # Describes how variables in the data are mapped to visual properties of geoms.
# Everything we want to include into the visualization should be included there.
We now need to tell ggplot what kind of plot we want.
ggplot(litReview, aes(x=CET, y=SoV))+
geom_point() # Adding this line, we are saying we want a scatter plot
geom_point() ==> Scatter Plot
geom_bar() ==> Bar Plot
geom_line() ==> Line Plot
geom_histogram() ==> Histogram
Each plot would require specific data and columns within the aes function. For instance, to plot a histogram, we only need x variable:
ggplot(litReview, aes(x=CET))+
geom_histogram()
Here is a useful cheatsheet for different plots:
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
# write your code here.
Let’s now get back to our scatter plot, which seems to be best for our purpose: identifying gaps in this literature
ggplot(litReview, aes(x=CET, y=SoV))+
geom_point() # Adding this line we are saying we want a scatter plot
These are still many points, and they are coming from differnt venues. So, let’s use the state to differentiate them with colors
ggplot(litReview, aes(x=CET, y=SoV))+
geom_point(aes(color=Venue), size=3) # Do you remember the aes() function? Here it is again
You can change the color palette, edit the legend on the right, and many other things, but we will leave these aesthetics functions for some other time. If you want to explore, here are a few functions you could use:
theme(legend.position=“None”) # remove legend
scale_colour_brewer(palette = “Set1”) # change color palette
ggplot(litReview, aes(x=CET, y=SoV))+
geom_point(aes(color=Venue), size=3)+
ylim(c(0, 10))
That’s how ggplot works, we can just continue adding new lines to modify our plot
Set the limits 0 to 10 to the x-axis.
#Write your code here.
myPlot <- ggplot(litReview, aes(x=CET, y=SoV))+
geom_point(aes(color=Venue), size=3)+
ylim(c(0, 10))
myPlot
And then, use that variable to add new things to our plot
myPlot+
ggtitle("Two Dimensional Comparison", subtitle="Gap between Education and Visualization Researchers") + # Add a title and subtitle
xlab("Connection to Edu. Theories") # Set the axis names
Set the name for the y-axis
#Write your code here.
We can also change the breaks in each axis. Here are a couple of parameters to explain
myPlot+
scale_y_continuous(breaks=seq(0, 10, 1), # Set a break in the y axis every unit
limits=c(0, 10)) # Set the limits
Also note that we used scale_y_continuous because y is a continuous variable, but there are other types:
scale_x_continuous, scale_y_continuous ==> For numbers
scale_x_discrete, scale_y_discrete ==> For categories
scale_x_date, scale_y_date ==> For dates
Change the x-axis limit so have breaks every 1
#Write your code here.
Now, there is an issue with our plot, there may be more than one point overlapping with each other at a single position.
We can use the size as an indicaton of how many papers are at each point. To do that, we use geom_count instead of geom_point
myPlot <- ggplot(litReview, aes(x=CET, y=SoV))+
geom_count(aes(color=Venue))+
scale_x_continuous(breaks=seq(0, 10, 1), limits=c(0, 10))+
scale_y_continuous(breaks=seq(0, 10, 1), limits=c(0, 10))+
scale_size_continuous(breaks = seq(0, 5, 1))+
labs(title="Two Dimensional Comparison",
subtitle="Gap between Education and Visualization Researchers",
y="Sophistication of the Visualization",
x="Connection to Educational Theories",
size="# of Studies")
myPlot
Create a scatter plot for Connection to Educational Theories (CET) vs. Connectino to Visualizaiton Background (CET vs CVG)
#Write your code here.
We are going to first read some sample data to work on the visualizations and clustering. The rows represent the students, while the columns correspond to the codes from the coding scheme.
Remember that each example students explained included more than one section. In this dataset, we have two sections.
heatM <- read.csv("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/twoSampleSections.csv",header=TRUE,sep=",")
heatM
It will be handy to have the categories in a vector, so let’s create a vector with all the categories from the coding scheme
categories <- c('SIM','INC','LIM','PHR','COA','VAR','PAR','DAT', 'COD',
'HOW','EXE','PRO','GOA','BGK','WHY','RAG','INS','CON','CHK',
'MON','OWN')
categories
To get only the data from the first section, we can get a subset of these colums 1:22. Let’s create a simple heatmap of this section. If you have doubts about how to create plots, go back to the section Creating Plots in R
# Select the columns corresponding to the first section
firstSection <- heatM[,1:22]
# "Melt" the dataset into a *Vertical* format to plot it.
meltedHM<- melt(firstSection,id.vars=c("STD"))
ggplot(meltedHM, aes(x = variable, y = STD, fill = factor(value)))+
geom_tile(color = "grey") +
labs(y= 'Student', x='Explanation Type')+
scale_fill_manual(values = c('white','blue'))+
theme(text = element_text(size=10), axis.text.x = element_text(angle=90, hjust=1))
Create the heatmap plot for section #2. Hint: use heatM[,c(1,23:43)] to get the data for the corresponding section
#Write your code here.
Now we will find the clusters (i.e. different approaches to self-explain) and visualize them with colors
We need to: * Create a vector with the number of sections, and * Aggregate the types of knowledge in all sections
sections<- 1:2
# Aggregate the types of knowledge in all sections
knowTypeDF <- as.data.frame(do.call("cbind",lapply(sections,aggregateKnowSubType, heatM=heatM )))
# Include the students column
rownames(knowTypeDF)<-heatM$STD
# Assign the column names
names(knowTypeDF)<- rep(c('LK', 'CK1','CK2','PK','SK','SK2','TK','TK2'),2)
# Check out the resulting dataset
knowTypeDF
sections<- 1:2
# Aggregate the types of knowledge in all sections
knowTypeDF <- as.data.frame(do.call("cbind",lapply(sections,aggregateKnowSubType, heatM=heatM )))
# Include the students column
rownames(knowTypeDF)<-heatM$STD
# Assign the column names
names(knowTypeDF)<- rep(c('LK', 'CK1','CK2','PK','SK','SK2','TK','TK2'),2)
categories <- c('SIM','INC','LIM','PHR','COA','VAR','PAR','DAT', 'COD',
'HOW','EXE','PRO','GOA','BGK','WHY','RAG','INS','CON','CHK',
'MON','OWN')
stdClusters <- computeClusters(knowTypeDF[,1:16], 4)
We will now identify and visualize the number of clusters.
How many clusters?
We can try with different number of clusters, and choose from the dendongrams, as follows:
# The first parameter is removing the column of student 'name' because that is irrelevant for our purpose
# The second parameter of the 'computeClusters' allows us to try different number of clusters
createDendograms(knowTypeDF[,1:16])
We may say that four clusters seem to distribute the distances evenly, so we use computeClusters and stdClusters will have a list of the students with the number of the correponding cluster
# The first parameter is removing the column of student 'name' because that is irrelevant for our purpose
# The second parameter of the 'computeClusters' allows us to try different number of clusters
stdClusters <- computeClusters(knowTypeDF[,1:16], 4)
stdClusters
Let’s now prepare the color spectrum, and assign a value for the color of each cluster
# Each number will correspond to a color in our "heatmap"
colorsLevels <- as.character(c(0,seq( from = 10 , to = 90, by = 10 )))
# Assing an arbitrary number within the range defined in colorsLevels
stdClusters[stdClusters==1,]$stdClusters <- 10
stdClusters[stdClusters==2,]$stdClusters <- 30
stdClusters[stdClusters==3,]$stdClusters <- 50
stdClusters[stdClusters==4,]$stdClusters <- 70
# Students are ordered by the clusters for visualization purposes
# i.e., Update the ordering to respect the colors
students <- stdClusters[order(stdClusters$stdClusters),]$std
# Creating the heat maps for each of the sections
section1 <- createHeatMapData(heatM[,c(1,2:22)],
stdClusters,
students,
categories,
'(a) Section 1 - Creating the Function',
11,
colorsLevels,
FALSE)
section1
As usual, we first want to load some data, so we need to tell the computer where to get it from. For this example, we need to load the dataset in two different formats to avoid dealing with data transformation. Explore both data sets. How are they different from each other?
typesOfKnowledge <- read.csv("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/typesOfKnowledge.csv",header=TRUE,sep=",")
meltedToK <- read.csv("https://grupoinformaticaeducativa.uninorte.edu.co/shiny/datos/meltedData.csv",header=TRUE,sep=",")
Let’s get to the clusters now. We need to set a seed to always get the same set of clusters for the same data. We then use the kmeans function to identify four clusters in the data set.
set.seed(1234)
stdClusters<- kmeans(typesOfKnowledge[,-1], 4)
# Check out the cluster for each student
stdClusters
set.seed(1234)
stdClusters<- kmeans(typesOfKnowledge[,-1], 4)
groups<- as.data.frame(typesOfKnowledge)
groups$stdClusters<- as.factor(stdClusters$cluster)
# The students are as row names and we need to have them as column to merge them later
groups <- rownames_to_column(groups, "std")
# Assign the clusters to each student in the melted format (which we will use to plot)
meltedToK <- merge(meltedToK,groups[,c('X','stdClusters')], by.x = 'Student', by.y = 'X')
# Make sure R recognizes the cluster as a categorical variable (it should not be a number!)
meltedToK$stdClusters <- factor(meltedToK$stdClusters, levels=c("1", "2","3","4"))
meltedToK
stdClusters<- kmeans(typesOfKnowledge[,-1], 4)
groups<- as.data.frame(typesOfKnowledge)
groups$stdClusters<- as.factor(stdClusters$cluster)
# The students are as row names and we need to have them as column to merge them later
groups <- rownames_to_column(groups, "std")
# Assign the clusters to each student in the melted format (which we will use to plot)
meltedToK <- merge(meltedToK,groups[,c('X','stdClusters')], by.x = 'Student', by.y = 'X')
# Make sure R recognizes the cluster as a categorical variable (it should not be a number!)
meltedToK$stdClusters <- factor(meltedToK$stdClusters, levels=c("1", "2","3","4"))
Now we can create the plot using ggplot. Note that our X axis will have an interaction between the Types of Knowledge and the Steps of de modeling process (variable)
ggplot(meltedToK, aes(x=interaction(ToK,variable),Student))+
geom_point(aes(size=sum, shape=stdClusters))+
theme(text = element_text(size=10),
axis.text.x = element_text(angle=90, hjust=1))
# The plot is ready, but it is better to se it organized, so we can do the following:
# Get the list of students in order based on the clusters.
students <- groups[order(groups$stdClusters),]$X
# And use that to organize the students in the data frame
meltedToK$Student <- factor(meltedToK$Student, levels=students)
#Plot it again:
ggplot(meltedToK, aes(x=interaction(ToK,variable),Student))+
geom_point(aes(size=sum, shape=stdClusters))+
theme(text = element_text(size=10),
axis.text.x = element_text(angle=90, hjust=1))
Hint: modify the call to kmeans, but make sure you start the script from the beginning because our structures have been modified.