R tutorial

From Sinfronteras
Jump to: navigation, search

Data types


Checking the type of a data

  • We can check the type of a variable with the class function:
x <- 28
class(x)

y <- "R is Fantastic"
class(y)

z <- TRUE
class(z)

To add a value to the variable, use <- or =



Numeric

  • 4 is a Integers. In R this data is called numerics.
  • 4.5 is a Decimal value. They also are called numerics in R.



Character - String

  • The value inside " " or ' ' are text (string). In R this data is called character



Logical - Boolean

  • TRUE or FALSE is a Boolean value, which is called logical in R.



Vector - One-dimensional array

A vector is a one-dimensional array. We can create a vector with all the basic data type we learnt before. The simplest way to build a vector in R, is to use the c command.

vec_num <- c(1, 10, 49)

vec_chr <- c("a", "b", "c")

vec_bool <-  c(TRUE, FALSE, TRUE)


We can do arithmetic calculations on vectors:

vect_1 <- c(1, 3, 5)
vect_2 <- c(2, 4, 6)

sum_vect <- vect_1 + vect_2


We can use the [1:5] command to extract the value 1 to 5:

slice_vector <- c(1,2,3,4,5,6,7,8,9,10)
slice_vector[1:5]


We can write c(1:10) to create a vector of value from one to ten:

c(1:10)



Matrix - N-dimensional array

Note: It is possible to create more than two dimensions arrays with R.

# Construct a matrix with 5 rows that contain the numbers 1 up to 10 and byrow =  TRUE:
matrix_a <-matrix(1:10, byrow = TRUE, nrow = 5)

# Print dimension of the matrix with dim()
dim(matrix_a)
Matrix1.png


# Construct a matrix with 5 rows that contain the numbers 1 up to 10 and byrow =  FALSE
matrix_b <-matrix(1:10, byrow = FALSE, nrow = 5)

Note: Using command matrix_b <-matrix(1:10, byrow = FALSE, ncol = 2) will have same effect as above.


You can also create a 4x3 matrix using ncol. R will create 3 columns and fill the row from top to bottom. Check an example:

# Construct a matrix with 5 rows that contain the numbers 1 up to 10 and byrow =  FALSE
matrix_b <-matrix(1:10, byrow = FALSE, nrow = 5)

Output:
##       [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

You can add a column to a matrix with the cbind() command. cbind() means column binding. cbind()can concatenate as many matrix or columns as specified. For example, our previous example created a 5x2 matrix. We concatenate a third column and verify the dimension is 5x3

# concatenate c(1:5) to the matrix_a
matrix_a1 <- cbind(matrix_a, c(1:5))

# Check the dimension
dim(matrix_a1)

# Output:
[1] 5 3

matrix_a1
# Output
#       [,1] [,2] [,3]
# [1,]    1    2    1
# [2,]    3    4    2
# [3,]    5    6    3
# [4,]    7    8    4
# [5,]    9   10    5


We can also add more than one column. Let's see the next sequence of number to the matrix_a2 matrix. The dimension of the new matrix will be 4x6 with number from 1 to 24.

matrix_a2 <-matrix(13:24, byrow = FALSE, ncol = 3)

# Output:
#      [,1] [,2] [,3]
# [1,]   13   17   21
# [2,]   14   18   22
# [3,]   15   19   23
# [4,]   16   20   24

matrix_c <-matrix(1:12, byrow = FALSE, ncol = 3)
matrix_d <- cbind(matrix_a2, matrix_c)
dim(matrix_d)

# Output:
# [1] 4 6

NOTE: The number of rows of matrices should be equal for cbind work

cbind() concatenate columns, rbind() appends rows.

Let's add one row to our matrix_c matrix and verify the dimension is 6x3

matrix_c <-matrix(1:12, byrow = FALSE, ncol = 3)

# Create a vector of 3 columns
add_row <- c(1:3)

# Append to the matrix
matrix_c <- rbind(matrix_b, add_row)

# Check the dimension
dim(matrix_c)

# Output:
# [1] 6 3



Factor

Factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables.

In a dataset, we can distinguish two types of variables: categorical and continuous:

  • In a categorical variable, the value is limited and usually based on a particular finite group. For example, a categorical variable can be countries, year, gender, occupation.
  • A continuous variable, however, can take any values, from integer to decimal. For example, we can have the revenue, price of a share, etc..



Categorical variables

R stores categorical variables into a factor. Let's check the code below to convert a character variable into a factor variable. Text Rojo: Characters are not supported in machine learning algorithm, and the only way is to convert a string to an integer. Let's create a factor data frame.

# Create gender vector
gender_vector <- c("Male", "Female", "Female", "Male", "Male")
class(gender_vector)

# Convert gender_vector to a factor
factor_gender_vector <-factor(gender_vector)
class(factor_gender_vector)

# Output:
# [1] "character"
# [1] "factor"

It is important to transform a string into factor when we perform Machine Learning task.

A categorical variable can be divided into nominal categorical variable and ordinal categorical variable:



Nominal categorical variable

A categorical variable has several values but the order does not matter. For instance, male or female categorical variable do not have ordering.

# Create a color vector
color_vector <- c('blue', 'red', 'green', 'white', 'black', 'yellow')
# Convert the vector to factor
factor_color <- factor(color_vector)
factor_color

# Output:
# [1] blue   red    green  white  black  yellow
# Levels: black blue green red white yellow

From the factor_color, we can't tell any order.



Ordinal categorical variable

Ordinal categorical variables do have a natural ordering. We can specify the order, from the lowest to the highest with order = TRUE and highest to lowest with order = FALSE.

We can use summary to count the values for each factor.

# Create Ordinal categorical vector 
day_vector <- c('evening', 'morning', 'afternoon', 'midday', 'midnight', 'evening')
# Convert `day_vector` to a factor with ordered level
factor_day <- factor(day_vector, order = TRUE, levels =c('morning', 'midday', 'afternoon', 'evening', 'midnight'))
# Print the new variable
factor_day
Output:
## [1] evening   morning   afternoon midday    
midnight  evening  
## Levels: morning < midday < afternoon < evening < midnight
# Append the line to above code
# Count the number of occurence of each level
summary(factor_day)
Output:
##   morning    midday afternoon   evening  midnight
##         1         1         1         2         1

R ordered the level from 'morning' to 'midnight' as specified in the levels parenthesis.



Continuous variables

Continuous class variables are the default value in R. They are stored as numeric or integer. We can see it from the dataset below. mtcars is a built-in dataset. It gathers information on different types of car. We can import it by using mtcars and check the class of the variable mpg, mile per gallon. It returns a numeric value, indicating a continuous variable.

dataset <- mtcars
class(dataset)

#Output
# [1] "numeric"



Data Frame

A data frame is a list of vectors which are of equal length. A matrix contains only one type of data, while a data frame accepts different data types (numeric, character, factor, etc.).

When we print a data frame in R, the result is shown as a Table. If we create it out of vectors, every column will be compose of each vector and usually the name of the column will be the name of the corresponding vector.



Create a data frame

We can create a data frame with the data.frame() function:

data.frame(data, stringsAsFactors = TRUE)
  • data can be a matrix to convert to a data frame or a collection of variables (vector for example) to join.
  • stringsAsFactors: Convert string to factor by default
By default, data.frame() returns string variables as a factor.

We can create our first data frame by combining four vectors of same length:

a <- c(10, 20, 30, 40)
b <- c('book', 'pen', 'textbook', 'pencil_case')
c <- c(TRUE, FALSE, TRUE, FALSE)
d <- c(2.5, 8, 10, 7)

# Join the vectors to create a data frame
df <- data.frame(a,b,c,d)
df

## Output:
##    a           b     c     d
## 1  1        book  TRUE   2.5
## 2  2         pen  TRUE   8.0
## 3  3    textbook  TRUE  10.0
## 4  4 pencil_case FALSE   7.0


We can see the column headers have the same name as the variables. We can change the column name with the names() function. Check the example below:

# Name the data frame
names(df) <- c('ID', 'items', 'store', 'price')
df

## Output:
##   ID       items store price
## 1 10        book  TRUE   2.5
## 2 20         pen FALSE   8.0
## 3 30    textbook  TRUE  10.0
## 4 40 pencil_case FALSE   7.0


# Print the structure
str(df)

## Output:
## 'data.frame':    4 obs. of  4 variables:
##  $ ID   : num  10 20 30 40
##  $ items: Factor w/ 4 levels "book","pen","pencil_case",..: 1 2 4 3
##  $ store: logi  TRUE FALSE TRUE FALSE
##  $ price: num  2.5 8 10 7



Tibble

A tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, and does not use row names.

tibble() function:

We can also create a data frame using the tibble() function from library(dplyr)

a <- c(10, 20, 30, 40)
b <- c('book', 'pen', 'textbook', 'pencil_case')
c <- c(TRUE, FALSE, TRUE, FALSE)
d <- c(2.5, 8, 10, 7)

library(dplyr)
tible <- tibble(a,b,c,d)
tible

## Output:
## A tibble: 4 x 4
#      a    b           c     d
#  <dbl> <chr>       <lgl> <dbl>
#1    10 book        TRUE    2.5
#2    20 pen         FALSE   8  
#3    30 textbook    TRUE   10  
#4    40 pencil_case FALSE   7



Slice Data Frame

It is possible to SLICE values of a Data Frame. We select the rows and columns to return into bracket precede by the name of the data frame.

A data frame is composed of rows and columns, df[A, B]. A represents the rows and B the columns. We can slice either by specifying the rows and/or columns.

In the following picture, the left part represents the rows, and the right part is the columns. Note that the symbol : means to. For instance, 1:3 intends to select values from 1 to 3.


Select rows columns in R.png


In below diagram we display how to access different selection of the data frame:

  • The yellow arrow selects the row 1 in column 2
  • The green arrow selects the rows 1 to 2
  • The red arrow selects the column 1
  • The blue arrow selects the rows 1 to 3 and columns 3 to 4

Note that, if we let the left part blank, R will select all the rows. By analogy, if we let the right part blank, R will select all the columns.

Select rows columns in R 2.png


We can run the code in the console:

## Select row 1 in column 2
df[1,2]
Output:
## [1] book
## Levels: book pen pencil_case textbook
## Select Rows 1 to 2
df[1:2,]

Output:
##   ID items store price
## 1 10  book  TRUE   2.5
## 2 20   pen FALSE   8.0
## Select Columns 1
df[,1]

Output:
## [1] 10 20 30 40
## Select Rows 1 to 3 and columns 3 to 4
df[1:3, 3:4]

Output:
##   store price
## 1  TRUE   2.5
## 2 FALSE   8.0
## 3  TRUE  10.0
It is also possible to select the columns with their names. For instance, the code below extracts two columns: ID and store.
# Slice with columns name
df[, c('ID', 'store')]

Output:
##   ID store
## 1 10  TRUE
## 2 20 FALSE
## 3 30  TRUE
## 4 40 FALSE



Append a Column to Data Frame

You can also append a column to a Data Frame. You need to use the symbol $ to append a new variable.

# Create a new vector
quantity <- c(10, 35, 40, 5)

# Add `quantity` to the `df` data frame
df$quantity <- quantity
df

Output:
##   ID       items store price quantity
## 1 10        book  TRUE   2.5       10
## 2 20         pen FALSE   8.0       35
## 3 30    textbook  TRUE  10.0       40
## 4 40 pencil_case FALSE   7.0        5


Note: The number of elements in the vector has to be equal to the no of elements in data frame. Executing the following statement:

quantity <- c(10, 35, 40)
# Add `quantity` to the `df` data frame
df$quantity <- quantity

Gives error:
Error in `$<-.data.frame`(`*tmp*`, quantity, value = c(10, 35, 40)) 
 replacement has 3 rows, data has 4



Select a Column of a Data frame

Sometimes, we need to store a column of a data frame for future use or perform operation on a column. We can use the $ sign to select the column from a data frame:

# Select the column ID
df$ID

Output:
## [1] 1 2 3 4




Subset a Data frame

In the previous section, we selected an entire column without condition. It is possible to subset based on whether or not a certain condition was true.

We use the subset() function:

subset(x, condition)

Arguments:

* x: data frame used to perform the subset

* condition: define the conditional statement

We want to return only the items with price above 10, we can do:

# Select price above 5
subset(df, subset = price > 5)

Output:
ID       items store price
2 20         pen FALSE     8
3 30    textbook  TRUE    10
4 40 pencil_case FALSE     7



Built-in a Data frame

Before to create our own data frame, we can have a look at the R data set available online. The prison dataset is a 714x5 dimension. We can get a quick look at the bottom of the data frame with tail() function. By analogy, head() displays the top of the data frame. You can specify the number of rows shown with head (df, 5). We will learn more about the function read.csv() in future tutorial.

# Print the head of the data
PATH<-'https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/wooldridge/prison.csv'
df <- read.csv(PATH)[1:5]
head(df, 5)

## Output:
##   X state year govelec black
## 1 1     1   80       0 0.2560
## 2 2     1   81       0 0.2557
## 3 3     1   82       1 0.2554
## 4 4     1   83       0 0.2551
## 5 5     1   84       0 0.2548



List

A list is a great tool to store many kinds of object in the order expected. We can include matrices, vectors data frames or lists. We can imagine a list as a bag in which we want to put many different items. When we need to use an item, we open the bag and use it. A list is similar; we can store a collection of objects and use them when we need them.

  • Step 1: Create a Vector:
# Vector with numeric from 1 up to 5
vect  <- 1:5
  • Step 2: Create a Matrices:
# A 2x 5 matrix
mat  <- matrix(1:9, ncol = 5)
dim(mat)

Output:
## [1] 2 5
  • Step 3: Create Data Frame:
# select the 10th row of the built-in R data set EuStockMarkets
df <- EuStockMarkets[1:10,]
  • Step 4: Create a List:

Now, we can put the three object into a list.

# Construct list with these vec, mat, and df:
my_list <- list(vect, mat, df)
my_list

Output:
## [[1]]
## [1] 1 2 3 4 5

## [[2]]
##       [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8    1

## [[3]]
##          DAX    SMI    CAC   FTSE
##  [1,] 1628.75 1678.1 1772.8 2443.6
##  [2,] 1613.63 1688.5 1750.5 2460.2
##  [3,] 1606.51 1678.6 1718.0 2448.2
##  [4,] 1621.04 1684.1 1708.1 2470.4
##  [5,] 1618.16 1686.6 1723.1 2484.7
##  [6,] 1610.61 1671.6 1714.3 2466.8
##  [7,] 1630.75 1682.9 1734.5 2487.9
##  [8,] 1640.17 1703.6 1757.4 2508.4
##  [9,] 1635.47 1697.5 1754.0 2510.5
##  [10,] 1645.89 1716.3 1754.3 2497.4



Select elements from list

After we built our list, we can access it quite easily. We need to use the index to select an element in a list. The value inside the double square bracket represents the position of the item in a list we want to extract. For instance, we pass 2 inside the parenthesis, R returns the second element listed. Let's try to select the second items of the list named my_list, we use my_list2:

# Print second element of the list
my_list[[2]]

## Output:
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8    1



Arithmetic Operators

  • +, -, *, /
  • Exponentiation: ^ or **
  • Modulo: %%
(5+5)/2

Modulo:

28%%6



Logical Operators

  • < , > , >= , <=
  • == : Exactly equal to
  • != : Not equal to
  • !x : Not x
  • x  : y
  • x & y : x AND y
  • isTRUE(x) : Test if x is TRUE

The logical statements in R are wrapped inside the []. We can add many conditional statements as we like but we need to include them in a parenthesis. We can follow this structure to create a conditional statement:

 variable_name[(conditional_statement)]

The logical statements in R are wrapped inside the []. We can add many conditional statements as we like but we need to include them in a parenthesis. We can follow this structure to create a conditional statement:

 variable_name[(conditional_statement)]
# Create a vector from 1 to 10
logical_vector <- c(1:10)
logical_vector>5

Output
## [1]FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

In the example below, we want to extract the values that only meet the condition 'is strictly superior to five':

# Print value strictly above 5
logical_vector[(logical_vector>5)]

Output:
## [1]  6  7  8  9 10
# Print 5 and 6
logical_vector <- c(1:10)
logical_vector[(logical_vector>4) & (logical_vector<7)]



Functions

A function should be:

  • written to carry out a specified a tasks
  • may or may not include arguments
  • contain a body
  • may or may not return one or more values.

We will see three groups of function in action:

  • General function
  • Maths function
  • Statistical function



General functions

We are already familiar with general functions like cbind(), rbind(), range(), sort(), order() functions. Each of these functions has a specific task, takes arguments to return an output. Following are important functions one must know.



diff function

If you work on time series, you need to stationary the series by taking their lag values. A stationary process allows constant mean, variance and autocorrelation over time. This mainly improves the prediction of a time series. It can be easily done with the function diff(). We can build a random time-series data with a trend and then use the function diff() to stationary the series. The diff() function accepts one argument, a vector, and return suitable lagged and iterated difference.

Note: We often need to create random data, but for learning and comparison we want the numbers to be identical across machines. To ensure we all generate the same data, we use the set.seed() function with arbitrary values of 123. The set.seed() function is generated through the process of pseudorandom number generator that make every modern computers to have the same sequence of numbers. If we don't use set.seed() function, we will all have different sequence of numbers.


set.seed(123)

## Create the data
x = rnorm(1000)
ts <- cumsum(x)

## Stationary the serie
diff_ts <- diff(ts)
par(mfrow=c(1,2))

## Plot the series
plot(ts, type='l')
plot(diff(ts), type='l')


R plot diff function.png



length function

In many cases, we want to know the length of a vector for computation or to be used in a for loop. The length() function counts the number of rows in vector x. The following codes import the cars dataset and return the number of rows.

Note: length() returns the number of elements in a vector. If the function is passed into a matrix or a data frame, the number of columns is returned.

dt <- cars

## number columns
length(dt)

##Output:
## [1] 1

## number rows
length(dt[,1])

## Output:
## [1] 50




Math functions

R has an array of mathematical functions.

  • abs(x): Takes the absolute value of x
  • log(x,base=y): Takes the logarithm of x with base y; if base is not specified, returns the natural logarithm
  • exp(x): Returns the exponential of x
  • sqrt(x): Returns the square root of x
  • factorial(x): Returns the factorial of x (x!)


# sequence of number from 44 to 55 both including incremented by 1
x_vector <- seq(45,55, by = 1)
#logarithm
log(x_vector)

Output:
##  [1] 3.806662 3.828641 3.850148 3.871201 3.891820 3.912023 3.931826
##  [8] 3.951244 3.970292 3.988984 4.007333


#exponential
exp(x_vector)
#squared root
sqrt(x_vector)

Output:
##  [1] 6.708204 6.782330 6.855655 6.928203 7.000000 7.071068 7.141428
##  [8] 7.211103 7.280110 7.348469 7.416198


#factorial
factorial(x_vector)

Output:
##  [1] 1.196222e+56 5.502622e+57 2.586232e+59 1.241392e+61 6.082819e+62
##  [6] 3.041409e+64 1.551119e+66 8.065818e+67 4.274883e+69 2.308437e+71
## [11] 1.269640e+73




Subseting Vectors, Matrices and Data Frames using the subset function

https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/subset



The R Graph Gallery

Aquí se muestran ejemplos de plots, graph, ... :

https://www.r-graph-gallery.com/



Embedding R In A Website

http://fabian-kostadinov.github.io/2015/09/21/embedding-r-in-a-website/



Shiny

http://shiny.rstudio.com/

http://rstudio.github.io/shiny/tutorial/#hello-shiny

http://shiny.rstudio.com/tutorial/

https://shiny.rstudio.com/articles/js-build-widget.html


Examples of Shiny web apps: https://www.rstudio.com/products/shiny/shiny-user-showcase/



Hosting and deployment

http://shiny.rstudio.com/deploy/

https://docs.rstudio.com/shinyapps.io/getting-started.html#deploying-applications

Run R Shiny App on Apache Server (Not possible): https://stackoverflow.com/questions/43527041/run-r-shiny-app-on-apache-server/43528264



How to Deploy Interactive R Apps with Shiny Server

https://www.linode.com/docs/development/r/how-to-deploy-rshiny-server-on-ubuntu-and-debian/

  • Shiny sever fetch the pages from:
/srv/shiny-server/


  • It uses by defaul port 3838



  • Para que mi aplicación funcionara fue necesario:
chown -R shiny:shiny gofaaas/


  • Luego de hacer cambios en el directorio de la applicación es necesario to restart Shiny Server:
sudo systemctl restart shiny-server



Deploy to the cloud with Shinyapps

Open in GoogleChrome: