You are on page 1of 84

Spis treści

Basci Syntax....................................................................................................................................................................... 3
Managing objects .............................................................................................................................................................. 3
Getting help ...................................................................................................................................................................... 4
Data Types......................................................................................................................................................................... 5
NA and NULLs.................................................................................................................................................................... 6
Dealing with NAs ........................................................................................................................................................... 7
Vectors .............................................................................................................................................................................. 8
Using [] brackets ........................................................................................................................................................... 9
Vectorized operations ................................................................................................................................................... 9
Naming vector elements ............................................................................................................................................. 11
Vector Element Sorting ............................................................................................................................................... 11
Vectorized operation example.................................................................................................................................... 11
Lists ................................................................................................................................................................................. 12
Matrices .......................................................................................................................................................................... 14
Tworzenie macierzy przy matrix() function ................................................................................................................ 15
Tworzenie macierzy przez rbind() - row binds ............................................................................................................ 16
Tworzenie macierz przez cbind() - column binds........................................................................................................ 16
Changing the size of matrix......................................................................................................................................... 16
Named vectors ............................................................................................................................................................ 17
Naming Matrix Dimensions......................................................................................................................................... 17
Subsetting ................................................................................................................................................................... 18
Matrix Computations .................................................................................................................................................. 18
Matrix functions – apply() ........................................................................................................................................... 20
Linear Algebra on Matrices ......................................................................................................................................... 21
Wykresy........................................................................................................................................................................... 22
Creating graphs - plot() ............................................................................................................................................... 22
Tworzenie wykresów przy uzyciu qplot() .................................................................................................................... 23
Tworzenie wykresów przy użyciu funkcji ggplot() ...................................................................................................... 24
Zapisywanie wykresu do pliku .................................................................................................................................... 29
Arrays .............................................................................................................................................................................. 30
Factors............................................................................................................................................................................. 31
Data Frames .................................................................................................................................................................... 34
CREATING A DATA FRAME .......................................................................................................................................... 34
Using $ sign ................................................................................................................................................................. 36
Operations with data frames ...................................................................................................................................... 37
Analyzing Data Frame ................................................................................................................................................. 38
data exploration with dplyr package .......................................................................................................................... 39
data exploration and visualization with dplyr and ggplot2 ........................................................................................ 41
Working with time series data .................................................................................................................................... 42
Variables.......................................................................................................................................................................... 44
Importing data into R ...................................................................................................................................................... 45
Reading text files......................................................................................................................................................... 45
Reading csv data from file........................................................................................................................................... 46
Reading from online csv files ...................................................................................................................................... 47
Writing to csv .............................................................................................................................................................. 48
Reading excel file ........................................................................................................................................................ 48
Saving to excel............................................................................................................................................................. 49
Reading from xml file .................................................................................................................................................. 49
Reading JSON data ...................................................................................................................................................... 49
Accessing the json data from the web ........................................................................................................................ 50
Reading HTML files (rvest librbary) ............................................................................................................................. 50
reading data from online HTML tables ....................................................................................................................... 52
Exploring imported data ............................................................................................................................................. 52
Input / Output ................................................................................................................................................................. 52
Input – scan() – reading text files................................................................................................................................ 52
Printing to screen ........................................................................................................................................................ 54
Arithmetic Operators ...................................................................................................................................................... 54
Assignment operators ..................................................................................................................................................... 57
Specific Purpose Operators............................................................................................................................................. 58
Sets Operations ............................................................................................................................................................... 58
Functions ......................................................................................................................................................................... 58
Packages .......................................................................................................................................................................... 60
Strings ............................................................................................................................................................................. 61
Regular expressions .................................................................................................................................................... 65
Grep() ...................................................................................................................................................................... 65
Funckja gsub() ......................................................................................................................................................... 65
LOOPS.............................................................................................................................................................................. 67
Statistical Distributions ................................................................................................................................................... 68
MEASURE OF CENTER ................................................................................................................................................. 68
MEASURE OF VARIATION ............................................................................................................................................ 69
CORRELATION ............................................................................................................................................................. 69
Testing normal distribution......................................................................................................................................... 71
CONFIDENCE INTERVAL for NORMAL DISTRIBUTION ................................................................................................. 71
CONVIDENCE INTERVAL FOR t-Distribution ................................................................................................................ 72
T Test ........................................................................................................................................................................... 73
Linear regression ......................................................................................................................................................... 74
Multiple Linear Regression ......................................................................................................................................... 75
Defining Class .................................................................................................................................................................. 76
NBA DATA........................................................................................................................................................................ 77
Housing market ............................................................................................................................................................... 82

Basci Syntax
# - komentarz jednej lkinii
R does not support multi-line comments but you can perform a trick

if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
}
myString <- "Hello, World!"
print ( myString)
Podanie komentarza w “ “ sprawi ze zostanie on wykonany przez interprator ale nie zakłóci jego pracy

”Wpisywanie teslstu w cudzuslowie”

> myString <- "Hello, World!"


> print ( myString)
[1] "Hello, World!"

Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if you are using
Windows or other system, syntax will remain same.
$ Rscript test.R

x <- c(1, 2, 3)
mode(x) # "numeric"
length(x) #3

y <- c("abc")
mode(y) #character
length(y) #1

u <- paste ("abc", "de", "f") #concatenating the string


u
v <- strsplit(u, " ") #split the string according to blanks
v #"abc" "de" "f"

%>% explanation - it pass the the first argument of to the function


#iris %>% head() is equivalent of head(iris)
#Thus iris %>% head() %>% summary() is equivalent to summary(head(iris))

Managing objects
listing objects
ls()
xxxtwoxxx<-c(1,2)
#list all object that have two in the name
ls(pattern="two") #"two" "two2" "xxxtwoxxx"
removing object rm()
rm(x)
#removing mulitple objects
rm(y, y1)
#removing ovjects that are returned by ls() function, we need tp assing it to attribute
list
rm(list = ls(pattern="two"))
#removing all objects
rm(list = ls())

saving the collection of object with save()


rm(z)
z<-rnorm(10000)
#creating histogram
hz<-hist(z)
#saving the variable hz with assinged historgram to hzfile
save(hz, file = "hzfile")
#removing all objects
rm(list = ls())
ls()#character(0)

ladowanie poprzednio zapisanego obiektu


load("hzfile")
ls() #hz
#rysowanie wykresu z odzyskanego objektu hz
plot(hz)

Getting help

##getting help

#wyswietlanie dokumentacji dla funkcji


?seq
help(seq)

help(mvrnorm)
?mvrnorm
'''
No documentation for ‘mvrnorm’ in specified packages and libraries:
you could try ‘??mvrnorm
'''

??mvrnorm
#search result zwroci nam informacje MASS::mvrnorm
#bedzie to infor o package i funckji w niej

#getting help for enire package


help(package=MASS)

#R przedstawi nam przyklad uzycia funkcji


example(seq)

#Google style search throught R documentation


#gdy nie wiemczy jakiej funckji szukmay mozomy wpisac slowo klucz
help.search("histogram")

#help for general topics, for example about files manipulation


?files
#wyswietlanie dokumentacji dla funkcji
?seq
help(seq)

help(mvrnorm)
?mvrnorm
'''
No documentation for ‘mvrnorm’ in specified packages and libraries:
you could try ‘??mvrnorm
'''

??mvrnorm
#search result zwroci nam informacje MASS::mvrnorm
#bedzie to infor o package i funckji w niej

#getting help for enire package


help(package=MASS)

#R przedstawi nam przyklad uzycia funkcji


example(seq)

#Google style search throught R documentation


#gdy nie wiemczy jakiej funckji szukmay mozomy wpisac slowo klucz
help.search("histogram")

#help for general topics, for example about files manipulation


?files

Data Types

#R samo bedzie decuydowal jaki typ przypisac do zmiennej, dla liczb defaul bedzie double
x1 <-1
typeof(x1)

#definicja x jako integer


x2 <-2L #mozna wymusic przypisanie jako ineteger poprzez L
typeof(x2)
#locigal
q <- T #True
q2 <- F #false
q3 <- TRUE
q4 <- FALSE

Data Type Example Verify

Logical TRUE, FALSE v <- TRUE


print(class(v))
it produces the following result −
[1] "logical"
Numeric 12.3, 5, 999 v <- 23.5
print(class(v))
it produces the following result −
[1] "numeric"

Integer 2L, 34L, 0L v <- 2L


print(class(v))
it produces the following result −
[1] "integer"

Complex 3 + 2i v <- 2+5i


print(class(v))
it produces the following result −
[1] "complex"

Character 'a' , '"good", "TRUE", '23.4' v <- "TRUE"


print(class(v))
it produces the following result −
[1] "character"

Raw "Hello" is stored as 48 65 6c 6c 6f v <- charToRaw("Hello")


print(class(v))
it produces the following result −
[1] "raw"

NA and NULLs
Missing data represents in R with the value NA, data exists but it is unknown
NULL, on the other hand, represents that the value in question simply doesn’t exist,
rather than being existent but unknown.

#1. Using NA

rm(x)
x <- c(88, NA, 12, 168, 13)
x
mean(x) #NA
#argument na.rm=T sets NA remove to TRUE, such values are skipped from calculations
mean(x, na.rm=T) #70,25

rm(y)
y<-c(88, NULL, 12, 168, 13)
mean(y) #70.25

#filterning
rm(z)
z <- c(5,2,-3,8)
w<-z[z*z>8]
w #5 -3 8

#filtering simply checks if condition is TRUE or FALSE for each vector component
z*z > 8 #TRUE FALSE TRUE TRUE
#when it is TRUE it is retuning this value
w #5 -3 8
#we can use filterning to assigning values, i.e replace all elements larger than 3 with
0
z[z>3] <-0
z #0 2 -3 0

#filterning vectors with NA


rm(z)
z<-c(6,1:3,NA,12)
z#6 1 2 3 NA 12
z[z>5] #6 NA 12

#using subset, subsets removes the NA values from list of observations while filterning
subset(z, z>5)#6 12

#using which() - return the positions in vector that meet the filterning condition
z <- c(5, 2, -3, NA, 8)
which(z*z > 8)#1 3 5

#which() function exludes NA from the results

#finding first occurance of value in vector, iht example we are looking for 8
first1 <- function(x) return(which(x==8))
first1(z) #5

Dealing with NAs

library(MASS)
#lista wbudowanych data sets w R
data()

data(airquality)
#getting more data for the data set that we selected
??airquality

#checking the structure data


str(airquality)
summary(airquality)

#removing rows containing NAs


ag <- na.omit(airquality)
head(ag)
str(ag)

#filterning for rows that have values in all columns, in case of NA in any column row
would be filtered out

#complete.cases(data) - return T if entire row is non NA, returns F if any value in


rows is NA
complete.cases(airquality) #return vector of T and F representing if condition is met

#we can use complete.cases to filter out rows contaning NA from our data set
ag2=airquality[complete.cases(airquality),]
str(ag2)
smmary(ag2)

#replacing NAs with 0

ag3 <- airquality


#is.na() returns data frame with T and F for each of the filed, T if positio in data
frame is NA
is.na(ag3)
#assigning 0 for all NA fileds in data frame
ag3[is.na(ag3)] <-0
ag3

#replacing missing values with average values

meanozon = mean(airquality$Ozone, na.rm=T)


#liczymy srednia z kolumny Ozone, na.rm = T -- do obliczania sredniej nie bierzemy pod
uwage wartosci NA

#ifelse(warunek logiczny, wartosc jak prawda, wartosc jak falsz)


#w miejsce NAs z kolumnu Ozon wstawiamy srednia
agty.fix<-ifelse(is.na(airquality$Ozone), meanozon, airquality$Ozone)
agty.fix

summary(agty.fix)

Vectors

What is vector - segquence of element of the same datatype !!!


it is like an array but in R all emlements has to be with the same type! i.e. only numbers or characters
index starts with 1
even a signle number is stored as a vector with lenght 1

When you want to create vector with more than one element, you should use c( ) function which means to combine
the elements into a vector.

# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))
When we execute the above code, it produces the following result −
[1] "red" "green" "yellow"
[1] "character"

Definign a vector - COMBINE c()

MyVector <- c(3, 45, 56, 732) # function c() combines numbers into vector
print(MyVector)

is.numeric(MyVector) # returns TRUE, vector is numeric


is.integer(MyVector) # returns FALSE,
is.double(MyVector) #returns TRUE, R by default keep all numbers as double

MyVecotorInt = c(3L, 45L, 56L, 732L)


is.integer(MyVecotorInt)#returns TRUE as numbers in vector was defined as integers

V2 <- c("a", "b", "c")#defining character vector


is.character(V2)#returns TRUE
is.numeric(V2)#returns FALSE
V3 <- c("a", "b", 7, 44L)
V3
is.character(V3)#returns TRUE, number were converted to character as vector can hold only 1
type of variables
is.numeric(V3)#returns FALSE

seq() #sequence seq(start, end, step)


seq(1, 15) #produces numebers from 1 to 15
seq(1,15,2)#produce numbers from 1 to 15 with step 2

rep() #replicate rep(numberToreplicate, number of times)


rep(3, 5)# returns 3 3 3 3 3

x <- c(80, 20)


y <- rep(x, 10)#vector of 10 pairs 80, 20
print(y)

Using [] brackets

to access single value of vector we need to use [ ]

w[] # accessing all element from the vector


w[1] # accessing 1st element of vector
w[3] # accessing 3rd element of vector
w[-1] # accessing all elements beside 1st one
w[-3] # accessing all elements beside 3rd one
w[1:3] # accessing elements 1 to 3
w[-3:-5] # accessing elements besides elements 3 to 5

w[c(1,3,5)] # we can use another vector to access elments of other vector. Return 1st, 3rd, 5th element
w[c(-2, -4)] # accesing all elements of vector w besides 2nd and 4ht element

Vectorized operations

a<-c(1,2,3)
b<-c(4,5,6)

#dodawanie vektórow, R sam dodaje wartosci z 2 vectorw a[1] + b[1] = 1+4 =5; a[2] +b[2] = 2+5 = 7
suma = a +b

#mnożenie vectorów
iloczyn = a * b
iloczyn # iloczyn[1] = 1 * 4, iloczyn [2] = 2 * 5, itd.

#dzielenie vectrów
iloraz = a / b
iloraz # iloraz[1] = 1/4, iloraz[2] = 2/5

x1 <- c(88, 5, 12, 13)


is.vector(x1)
x1
#inserting value 168 after 12 and before 13
x1 <- c(x1[1:3], 168, x1[4])
x1

#matrices and arrays as vectors


r1 <- c(1,2)
r2 <- c(3,4)
m <- rbind(r1, r2)
m

m + 10:13
'''
> m + 10:13
[,1] [,2]
r1 11 14
r2 14 17
'''

#using all() and any() functions


rm(x)
x <- 1:10
x
any(x > 8) #TRUE - if any value in vector meets the condition
any(x > 88) #FALSE

all(x > 8) # FALSE - checks if all values in vector meets the condition

#example of usage the functions - find runs of consecutive 1s in vector

findruns <- function (x,k) {


n <- length(x)
runs <- NULL
for (i in 1:(n-k+1)){
if (all(x[i:(i+k-1)] == 1)) runs<-c(runs, i)

return(runs)
}

rm(c1)
c1 <- c(1,0,0,1,1,1,0,1,1)
#runs of 1s of length 2 beginning at indices 4, 5, and 8
findruns(c1, 2)

#operacje na vectorach o roznej dlugosci

v1 <- c(1,2,3,4,5)
v2 <- c(10,11)

suma <- v1 + v2 # 1 + 10; 2 + 11; 3 + 10; 4 + 11; 5 + 10;


suma# 11 13 13 15 15

roznica <- v2 -v1 # 10 - 1; 11 - 2; 10 - 3; 11 - 4; 10 - 5;


roznica#9 9 7 7 5

iloczyn <- v1 * v2 # 1 * 10, 2 * 11, 3 * 10; 4 * 11; 5 * 10


iloczyn# 10 22 30 44 50

iloraz <- v2 / v1 # 10:1; 11 : 2; 10 : 3; 11 : 4; 10 : 5


iloraz# 10.000000 5.500000 3.333333 2.750000 2.000000

Przy róznej długosci wektorów elemnty krótszego będą kolejno dopisywane az do uzyskania wektórw tych
samych długości.
nie możemy dodac do siebie nie liczb, zwroci blad
litery1 <-c("a", "b", "c")
litery2 <- c("A", "B", "C")
wynik = litery1 + litery2
wynik # ERROR

Zeby polaczyc dwa wektory z tekstem musimy uzyc funckji paste()


wynik=paste(litery1, litery2)
wynik # a A" "b B" "c C"

Naming vector elements

rm(x)
x <- c(1,2,4)
names(x)
#assigning names to each vector element
names(x) <- c("a", "b", "c")
x
'''
a b c
1 2 4
'''
#we can access the vector element by calling both position or name
x[1]
x["a"]

#removing vector names


names(x) <- NULL

Vector Element Sorting

liczby <- c(1, 44, 66, -200, 100, 0, -2, 123)


liczby.sorted <- sort(liczby)
liczby.sorted#-200 -2 0 1 44 66 100 123

liczby.desc <- sort(liczby, decreasing = TRUE)


liczby.desc # 123 100 66 44 1 0 -2 -200

kolory <- c("czerwony", "zielony", "bialy", "zolty")


kolory.sort <- sort(kolory)
kolory.sort #"bialy" "czerwony" "zielony" "zolty"

Vectorized operation example

#Data
revenue <- c(14574.49, 7606.46, 8611.41, 9175.41, 8058.65, 8105.44, 11496.28, 9766.09,
10305.32, 14379.96, 10713.97, 15433.50)
expenses <- c(12051.82, 5695.07, 12319.20, 12089.72, 8658.57, 840.20, 3285.73, 5821.12,
6976.93, 16618.61, 10054.37, 3803.96)

#Solution
#Calculate Profit As The Differences Between Revenue And Expenses
profit <- revenue - expenses
profit

#Calculate Tax As 30% Of Profit And Round To 2 Decimal Points


tax <- round(0.30 * profit, 2)
tax

#Calculate Profit Remaining After Tax Is Deducted


profit.after.tax <- profit - tax
profit.after.tax

#Calculate The Profit Margin As Profit After Tax Over Revenue


#Round To 2 Decimal Points, Then Multiply By 100 To Get %
profit.margin <- round(profit.after.tax / revenue, 2) * 100
profit.margin

#Calculate The Mean Profit After Tax For The 12 Months


mean_pat <- mean(profit.after.tax)
mean_pat

#Find The Months With Above-Mean Profit After Tax


good.months <- profit.after.tax > mean_pat # zeróci wartosci T albo F
good.months

#Bad Months Are The Opposite Of Good Months !


bad.months <- -good.months
bad.months

#The Best Month Is Where Profit After Tax Was Equal To The Maximum
best.month <- profit.after.tax == max(profit.after.tax)
best.month

#The Worst Month Is Where Profit After Tax Was Equal To The Minimum
worst.month <- profit.after.tax == min(profit.after.tax)
worst.month

#Convert All Calculations To Units Of One Thousand Dollars


revenue.1000 <- round(revenue / 1000, 0)
expenses.1000 <- round(expenses / 1000, 0)
profit.1000 <- round(profit / 1000, 0)
profit.after.tax.1000 <- round(profit.after.tax / 1000, 0)

Lists

A list is an R-object which can contain many different types of elements inside it like vectors, functions and even
another list inside it. Each element of the list can be in different data-type

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")
x <- list (u=2, v="abc") # list () function
is.list(x)#TRUE
#accesing list values
x$u
x$v

# Create a list containing a vector, a matrix and a list. – list () function


list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

list_data

# Give names to the elements in the list – names()


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Show the list.


list_data

'''
$`1st Quarter`
[1] "Jan" "Feb" "Mar"

$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8

$`A Inner list`


$`A Inner list`[[1]]
[1] "green"

$`A Inner list`[[2]]


[1] 12.3
'''

# Access the first element of the list.


print(list_data[1])

# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])

# Access the list element using the name of the element and $
print(list_data$A_Matrix)

# Add element at the end of the list.


list_data[4] <- "New element"
print(list_data[4])

# Remove the last element.


list_data[4] <- NULL

# Print the 4th Element.


print(list_data[4])

# Update the 3rd Element.


list_data[3] <- "updated element"
print(list_data[3])

#adding elements to a list via vector index


list_data[5:7] <-c(FALSE, TRUE, TRUE)

#Merging list
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.


merged.list <- c(list1,list2)

# Print the merged list.


print(merged.list)

#converting list to a vector - unlist()


# Create lists.
list1 <- list(1:5)
print(list1)

list2 <-list(10:14)
print(list2)

# Convert the lists to vectors.


v1 <- unlist(list1)
v2 <- unlist(list2)

print(v1)
print(v2)

# Now add the vectors


result <- v1+v2
print(result)

list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),


list("green",12.3))
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
list_data
#adding elements to a list via vector index
list_data[5:7] <-c(FALSE, TRUE, TRUE)

list_data

#using lapply() with lists


?lapply()
#lapply(x = list, FUN = function to use), it returns list
l1 <-list(c(1:5), c("raz", "dwa", "trzy"), c(10:15))
l1
lapply(list(1:3, 25:29), median) #[1] 2, [2] 27; list returned
#lapply() zwroci nam mediane dla kazdego elemtu z podanej przez nas listy

#sapply() - simplified lapply, return vector or matrix instead of list


sapply(list(1:3, 25:29), median)#2 27; vector returned

Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.

# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"

Matrixes in R just like vector must have all object with the same data type (just like vector)
Odwołanie sie do elementu w macierzy [numer wiersza, numer kolumny]
A[1,] - wybranie calego wiersz nr 1 z macierzy
A[,1] - wybranie całej kolumny nr 1 z macierzy

Tworzenie macierzy przy matrix() function


?matrix
#parameters nrow - liczba wierszy ile ma miec macierz, ncol - liczba kolumn w macierzy
my.data <- 1:20
my.data # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A <- matrix(my.data, 4, 5)
A

'''
Dane z my.data umieszane sa w kolejnych kolumnach

[,1] [,2] [,3] [,4] [,5]


[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

'''
A2 <- matrix(my.data, 4, 5, byrow=TRUE)
A2 # byrow = TRUE sprawa, ze dane z naszego zbioru umieszczane sa na poczatek w
wierszach

'''
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
'''

#wybieranie pojedyńczej wartości z macierzy


A2[2,5] #10

#nadanie nazw każdemu wierszowi i kolumnie, parameter dimnames


A3 <- matrix(my.data, 4, 5, dimnames =list(c("Wiersz1", "Wiersz2", "Wiersz3",
"Wiersz4"),
c("Kolumna1", "Kolumna2", "Kolumna3",
"Kolumna4", "Kolumna5")))

A3

'''
Kolumna1 Kolumna2 Kolumna3 Kolumna4 Kolumna5
Wiersz1 1 5 9 13 17
Wiersz2 2 6 10 14 18
Wiersz3 3 7 11 15 19
Wiersz4 4 8 12 16 20

'''
Tworzenie macierzy przez rbind() - row binds
#Let's define some vectors
r1 <- c("I", "am", "happy")
r2 <- c("What", "a", "day")
r3 <- c(1,2,3)

#creating matrix by binding row by row rbind()


B <- rbind(r1, r2, r3)
B

'''
Cyfry z wektora r3 zostały zamienione na characters bo macierz podobnie jak wektor musi
miec wszystkie dane tego samego typu
[,1] [,2] [,3]
r1 "I" "am" "happy"
r2 "What" "a" "day"
r3 "1" "2" "3"
'''

Tworzenie macierz przez cbind() - column binds


#creating matrix by binding columns
C <- cbind(r1, r2, r3)
C

'''
r1 r2 r3
[1,] "I" "What" "1"
[2,] "am" "a" "2"
[3,] "happy" "day" "3"
'''

Changing the size of matrix


#operations on vectors
rm(x)
x <- c(12, 5, 13, 16, 8)
#append 20 at the end of vector
x <- c(x, 20)
#insert 40 after 3rd value
x <- c (x[1:3], 40, x[4:6])
x
#adding new value as new elemt of vector
x[8] <- 99 #12 5 13 40 16 8 20 99
x

#Chaning the size of the matrix - functions rbind and cbind

rm(m)
m <-matrix(1:6, nrow=3)
m
'''
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
'''

one <- c(1,1,1)


#adding new column to matrix
cbind(m, one) #taka operacja nie nadpisze obiektu m, zwroca macierz po dodaniu kolumny
'''
one
[1,] 1 4 1
[2,] 2 5 1
[3,] 3 6 1
'''

#adin new row to matrix


two <- c(2,2)
rbind(m, two)#nie nadpisuje m, zwaraca macirz po dodaniu wiersza

Named vectors
v1 <- 1:4
v1
#give name to vector
names(v1) # returns NULL as there is no name assigned

names(v1) <- c("1", "2", "3", "4")


names(v1) # zwraca "1" "2" "3" "4"
names(v1[1]) # zwraca 1, kazdy element vecktora ma swoja wlasna nazwe

#clear names
names(v1) <- NULL

Naming Matrix Dimensions

c("a", "B", "Zz")


temp.vec <- rep(c("a", "B", "Zz"), times =3)
temp.vec # "a" "B" "Zz" "a" "B" "Zz" "a" "B" "Zz"

temp.vec <- rep(c("a", "B", "Zz"), each =3)


temp.vec # "a" "a" "a" "B" "B" "B" "Zz" "Zz" "Zz"

Bravo <-matrix(temp.vec, 3, 3)
Bravo

'''
[,1] [,2] [,3]
[1,] "a" "B" "Zz"
[2,] "a" "B" "Zz"
[3,] "a" "B" "Zz"

'''

#naming rows in matrix


rownames(Bravo) <- c("How", "are", "you")

#namin the columns


colnames(Bravo) <- c("x", "y", "z")
Bravo

'''
x y z
How "a" "B" "Zz"
are "a" "B" "Zz"
you "a" "B" "Zz"
'''
#accessing the data in matrix
Bravo[2,2] #returns B
Bravo["are", "y"]#returns B
Bravo[2,2] == Bravo["are", "y"] # returns TRUE
#assigning values based on row and column name
Bravo["are", "y"] = 44
Bravo

#clearing the row name for a matrix


rownames(Bravo)= NULL
#clearin the column name for a matrix
colnames(Bravo) = NULL
Bravo

'''
[,1] [,2] [,3]
[1,] "a" "B" "Zz"
[2,] "a" "44" "Zz"
[3,] "a" "B" "Zz"
'''

Subsetting

x <- c("a", "b", "c", "d", "e")


x
#extracting 1st value from vector
x[1]
#extracting 1st and 5th value from vector
x[c(1,5)] # to extract 2 values from vector we need to pass it as vector

#subsetting the matrix


# wybieranie zakresu wierszy i kolumn - na poczatku podajemy wiersze, które wybieramy a
potem kolumny
Games[1:3, 6:10]

#wybieranie pojedynczych wierszy i kolumn z matrix, numery wierszy i kolumn musimy


przekazac jako vector
Games[c(1,10), ] # wybieramy wiersz 1szy i 10ty i wszystkie kolumny

Games[, c(1,5)] # wybieramy wszystkie wiersze i kolumny 1 i 5

#jak mamy nazwane wiersze i kolumny to mozemy wybierac po nazwach

Games[c("KobeBryant", "ChrisPaul"), c("2007", "2009")]

Games[1,] #wybierajac tylko jeden wiersz z matrix jako wynik otrzymujemy vector
is.matrix(Games[1,]) #FALSE
is.vector(Games[1,]) #TRUE

#R by defaul drops unnecessary dimensions of when returning vector from matrix, it can
be changed by changing drop paramater to F
Games[1,,drop=F]
is.matrix(Games[1,,drop=F])#TRUE
is.vector(Games[1,,drop=F])#FALSE

Jak chcemy wykluczyc pojedyńcza kolumne (-ny) to mozmey to zapisać:

GameNew <- Games[!c("KobeBryant", "ChrisPaul"), ]

GameNew bedzie zawierać wszystkie kolumny z Games oprócz "KobeBryant", "ChrisPaul"),

Matrix Computations

# Create two 2x3 matrices.


matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)

# Add the matrices.


result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result) # dodaje te same elementy z kazdej maceirzy

# Subtract the matrices


result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result) # odejmuje elememnty na tych samych pozycjach od siebie

# Multiply the matrices.


result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result) # mnozy elememnty na tych samych pozycjach

# Divide the matrices


result <- matrix1 / matrix2
cat("Result of division","\n")
print(result) # dzieli elemty na tych samych pozycjach

<- matrix( c(7, 15, 10, 22), nrow=2, byrow=TRUE)


y

#mathematical matrix multiplocation


y %*% y

#y[1,1] = 7*7 + 15*10


'''
[,1] [,2]
[1,] 199 435
[2,] 290 634
'''

Mnożenie macierzy ‘*’ vs ‘%*%’

rm(y)
y<-matrix( c(7, 15, 10, 22), nrow=2, byrow=TRUE)
y

[,1] [,2]
[1,] 7 15
[2,] 10 22

y1<-matrix( c(1, 3, 2, 4), nrow=2, byrow=TRUE)


y1

[,1] [,2]
[1,] 1 3
[2,] 2 4

w1 <- y * y1 # mnozenie wartosci w1[1,1]=y[1,1] * y1[1,1]; y[2,1] * y1[2,1]


w2 <- y %*% y1 # mnożenie matematyczne macierzy 7 * 1 + 10 * 2 = 37; 7 * 3 + 15 * 4 =
81

w1 [,1] [,2]
[1,] 7 45
[2,] 20 88
w2
[,1] [,2]
[1,] 37 81
[2,] 54 118

#mathematical multiplication by scalar


3*y
'''
[,1] [,2]
[1,] 21 45
[2,] 30 66
'''

#mathematical matric addition


y + y
'''
[,1] [,2]
[1,] 14 30
[2,] 20 44
'''

Matrix functions – apply()

Matrix functions - apply()


apply(m, dimcode, f, fargs)
m - matrix,
dimcode - dimension, 1 if function applies to rows, 2 if function applies for columns
f - function to be applied
fargs - optional set of arguments to be supplied to f
"
rm(z)
z <- matrix (c(1,2,3,4,5,6), ncol = 2)
z
#to apply the R function mean() to each column of matrix z
apply(z,2,mean)#2 5

rm(f)
f <- function(x) x/c(2,8) #zamiarem funkcji jest dzielenie wiersza przez wektor (2,8)
rm(y)
y <- apply(z,1,f)
y

#apply wola zdefiniowana przez nas funkcje f i bedzie dzielic wiersze przez wektor
(2,8)
#pierwsze dzialanie to podzielenie (1,4) / (2,8) = (0.5, 0.5) i to zapisuje do
pierwszej kolumny
'''
[,1] [,2] [,3]
[1,] 0.5 1.000 1.50
[2,] 0.5 0.625 0.75
'''

#to have the outcome of apply() more intuitive we can transpose using t() function
y <- t(apply(z,1,f))
y

'''
[,1] [,2]
[1,] 0.5 0.500
[2,] 1.0 0.625
[3,] 1.5 0.750
'''
Linear Algebra on Matrices

#solve() it is solving linear equations and find matrix inverses


'''
To solve following problem
x1+x2=2
-x1+x2=4
we can put those 2 equations into matrix form

'''
a <- matrix( c(1,1,-1,1), nrow=2, ncol=2)
b <- c(2,4)
#to solve such equation we can use function solve()
solve(a, b) #3 1

#if we would provide only 1 parameter - matrix a - it would compute inverse of matrix
solve(a)
'''
[,1] [,2]
[1,] 0.5 0.5
[2,] -0.5 0.5

Other functions on matrix

t ( ) – transpose matrix
qr ( ) – qr decomposition
chol ( ) – cholesky decomposition
det ( ) – determinant
diag ( ) – extract the diagonals of a square matrix – useful to obtain variances from variance – covariance matrix
sweep ( ) – numerical analysis sweep operations

Diagonals for matrix

rm (m)

m <- cbind(c(1,7), c(2,8))


m
'''
[,1] [,2]
[1,] 1 2
[2,] 7 8
'''
dm <- diag(m)
dm # 1 8

diag(dm)
'''
[,1] [,2]
[1,] 1 0
[2,] 0 8
'''

diag(3)
'''
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
'''
Sweep function

rm(m)
m <- cbind(c(1,4,7), c(2,5,8), c(3,6,9))
m
'''
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
'''
sweep(m,1,c(1,4,7), "+")
#sweep works somehow like apply()
#1st argument is array, 2nd is the marign which is 1 in this example, 4th - function to
use, 3rd - arguments to be used on function provided as 4th argument

'''
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 8 9 10
[3,] 14 15 16
'''

Wykresy

Creating graphs - plot()

#calling plot with 2 vector


plot(c(1,2,3), c(1,2,4))
#3 points are ploted (1,1) (2,2) and (3,4) - laczone sa wartosci z tych samych pozycji
z 2 wektrow w punkty
?plot

plot(c(1,2,3), c(1,2,4), type = "l")


#przekazanie type="l" sprawia, ze rysujemy linie laczaca 3 punkty

#nadawanie naz osiom x i y poprzez atrybut xlab, ylab


plot(c(1,2,3), c(1,2,4), type = "l", xlab="os x", ylab="os y")

#drawing empty graph with no point or lines >>> type="n"


plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y")
#pomimo, ze wykres ma byc pusty to musimy podac jakies punkty

x<-c(1,2,3)
y<-c(1,3,8)
plot(x,y)#showing 3 point for the data from vector

#abline() draws a line with function arguments being treated asd intercept and slope
of line
abline(c(0,1)) # dodanie linii o wzorze y = 0 + 1 * x
abline(c(2,1)) #dodanie linii o wzorze y = 2 + 1 * x
abline(c(3,4)) # dodanie linie o wzorze y =3 + 4 * x

#lines () - drawing line between points


lines(c(1.5, 2.5), c(3,3))
#2 vectors are interpreted as 2 points between which the line should be drawn (1.5,3) &
(2.5,3)
#adding point to the graph
wykres <-plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y")
#ponieważ 1szy punkt to (-3, -1) a drugi (3,5) to wykres bedzie poazywal taki zakres
wartosci

#dodajemy punkty do pustego wykresu poprzez + points()


wykres <-plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y") +
points(c(-2, 0, 2), c(0, 4, 2)) # punkty dodane do pustego wykresu (-2,0) (0,4) (2,2)

#many graphical parameters could be added with par function


?par

wykres <-plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y") +
points(c(-2, 0, 2), c(0, 4, 2)) +
par(bg="grey") # adding background for entire graph

#adding legend with legend() function


example(legend)

rm(wykres)
wykres <-plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y") +
points(c(-2, 0, 2), c(0, 4, 2)) +
par(bg="grey") # adding background for entire graph

Tworzenie wykresów przy uzyciu qplot()

#Visualazing subsets

Data <- MinutesPlayed[1:3,]


#creating the chart
matplot(t(Data), type="b", pch=c(15:18), col = c(1:4,6))
#adding legend
legend("bottomleft", inset=0.01, legend=Players[1:3], col = c(1:4,6), pch=c(15:18),
horiz=F)

#if we want to visualize single player (single row) in chart we need to make sure that
matrix is returned in subsetting the data
#if we get vector returned (which is by default) the chart wouldn't show us what we
expect
Data <- MinutesPlayed[1,,drop=F]
#creating the chart
matplot(t(Data), type="b", pch=c(15:18), col = c(1:4,6))
#adding legend
legend("bottomleft", inset=0.01, legend=Players[1:3], col = c(1:4,6), pch=c(15:18),
horiz=F)

#---introduction to qplor
install.packages("ggplot2")
library("ggplot2")
getwd() # returns curretn Working Directory # "C:/Users/Pc/Documents"
#Set new Working Directory on windows
setwd("C:\\Users\\Pc\\Desktop\\R files")
getwd()

#-----qplot()
stats <- read.csv("DemographicData.csv")
stats
#ploting histogram
qplot(data = stats, x=Internet.users) #providing only data

#podanie zmiennych dla x i y sprawia, ze wykres nie jest juz histogramem


#zeby zwiekszyc rozmiar punktów w size muis podac cyfre jako I(cyfra), inaczej qplot
bedzie staral sie przypisac dodatkowa zmienna
qplot(data = stats, x=Income.Group, y = Birth.rate, size =I(3))

#podobnie z kolorem dla puktów, trzeba podać go jako I(kolor), inczej zostanie dodana
kolejna zmienna do wykresu
qplot(data = stats, x=Income.Group, y = Birth.rate, size =I(3),
colour = I("blue"))

qplot(data = stats, x=Internet.users, y = Birth.rate)

#INCREASING THE SIZE OF POINTS


#becasue we are increasing size increase to all points we are providing the size in I()
qplot(data = stats, x=Internet.users, y = Birth.rate, size = I(4))

#CHANING THE POINTS COLOUR


#becasue we are chaninng color for all points we are providing the color in I()
qplot(data = stats, x=Internet.users, y = Birth.rate, size = I(4), colour = I("red"))

#to change the points colour depending on the income.group we are assignign colour to
Income.Group
qplot(data = stats, x=Internet.users, y = Birth.rate, size = I(4), colour =
Income.Group)

#Visualization
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region)

# 1. Shapes
#zmienianie symbolu jakie bedzie widoczny dla punktów - np. kola, trojkaty, diamenty,
paramter dla shape przekazujmy jako I()
#kazdy shape dla punktow ma przypisane swoj numer
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region, size = I(4),
shape =I(19))

#2. Transparency - used to see when point / dots are overlapping


#transparency is under parameter aplha, it takes values from 0 to 1, 0 - fully
transparent
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region, size = I(4),
shape =I(19), alpha =I(0.6))

#3. Tittle
#tiitle for plot under parameter main
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region, size = I(4),
shape =I(19), alpha =I(0.6), main = "Birth Rate vs Internet Users")

Tworzenie wykresów przy użyciu funkcji ggplot()


#----Movie Ratings
movies <- read.csv("MovieRatings.csv")
movies
head(movies)
#chaning the column names
colnames(movies) <- c("Film", "Genre", "CriticRating", "AudienceRating",
"BudgetMillions", "Year")

#checking the data frame structure


str(movies)
summary(movies)

#FACTOR in R is category variable used to assign some label or used to assign to some
group
#Genre for movies is good example o factor, it assignt type of movie (Action, Comody
Drama..) to particular title

#we can force R to treat some column as a factor event if it has numerical value
movies$Year <- factor(movies$Year)

#Year is now treated as factor not integer


summary(movies)
str(movies)

# --- Aesthetics - how our data are mapped to that what we want to see

library(ggplot2)

#w aes() podajemy co ma byc zmapowane na której osi, + geom_point() dodaje punkty na


wykres
ggplot(data = movies, aes(x=CriticRating, y=AudienceRating)) +
geom_point()

#addign colour for each Genre


ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre)) +
geom_point()

#adding the size for points, the bigger the point size is the higher the Budget was
ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre, size =
BudgetMillions)) +
geom_point()

#Plotting with layers


#assigning the plot to object
p <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre, size =
BudgetMillions))

#under the object p we have the data assign, but to have the plot we need to add +
geom_point()
p + geom_point()

#Overriding Aesthetics
#Overriding the aes() paramters does not modify the orgianl data assign to q
q <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre, size =
BudgetMillions))
#we can override the parameters given in ggplot and assign to variable within
geom_point()
q + geom_point(aes(size=CriticRating))
q + geom_point(aes(colour=BudgetMillions))

q + geom_point(aes(x=BudgetMillions))

#adding the labels to x and y axis


q + geom_point(aes(x=BudgetMillions)) + xlab("BudgetMillions $$$") + ylab("Critic
Ratings")
q + geom_point(size = 1)

# MAPPING vs SETTING
#basic chart
r <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating)) # assigning the data to
variable
r + geom_point()

#TO MAP A COLOUR TO VARIABLE USE AESTHETICS => MAPPING


#adding colour for each of factor in Genre
r + geom_point(aes(colour=Genre)) #we would be adding additional feature via
geom_point()

#TO SET A COLOUR FOR ALL POINTS DO NOT USE AESTHETICS => SETTING
#setting color point manually
r + geom_point(colour="DarkGreen")

#EX mapping size to BudgetMillion factor


r + geom_point(aes(size = BudgetMillions)) # size of the point is presenting the budget
of film

#ex seting - we are setting the size for all point on plot
r + geom_point(size=10)

#HISTOGRAMS AND DENSITY CHARTS

s<-ggplot(data=movies, aes(x=BudgetMillions))
s + geom_histogram(binwidth = 10) #creating histogram with geom_histogram

#mapping the colour (fill) for Genre


s + geom_histogram(binwidth = 10, aes(fill=Genre))
#adding a border, setting the sam border colour for all bins
s + geom_histogram(binwidth = 10, aes(fill=Genre), colour ="Black")

#creating density chart


s + geom_density()
#adding colours
s + geom_density(aes(fill=Genre))
#forbiding the density charts from overlapping - postion
s + geom_density(aes(fill=Genre), position="stack")

#STARTING LAYER TIPS


t <- ggplot(data=movies, aes(x=AudienceRating))
#creating histogram with geom_histogram()
t + geom_histogram(binwidth=10, fill="White", colour="Blue")

#another way, we can assign the aesthetics in geom_histogram()


t <- ggplot(data=movies)
t + geom_histogram(binwidth=10,
aes(x=AudienceRating),
fill="White", colour="Blue")

#we can assign to variable empty ggplot() - it is called sceleton plot


#we can use that when we want to create a plot but we would be using different data set
for that
t<-ggplot()

#STATISTICAL TRANSFORMATION
#1. smoothed mean
u <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating, colour=Genre))
u + geom_point() + geom_smooth(fill=NA)

#boxplots
u <- ggplot(data=movies, aes(x=Genre, y=AudienceRating, colour=Genre))
u + geom_boxplot(size =1.2) + geom_point()
#tip / hack - using geom_jitter instead of geom_point()
u + geom_boxplot(size =1.2) + geom_jitter()

#making boxes half transparent, aplha parameter

u + geom_boxplot(size =1.2, alpha = 0.5) + geom_jitter()

#FACETS - dividing data into separates plots for each of factor


v <- ggplot(data=movies, aes(x=BudgetMillions))
v + geom_histogram(binwidth = 10, aes(fill=Genre), colour = "Black")

#facets:
v + geom_histogram(binwidth = 10, aes(fill=Genre), colour = "Black") +
facet_grid(Genre~., scales="free") #scales = "free" = each of the histogram would
have it's own scale depending on the values it gets
#by defaul scales would be the same for all plots

#scatterplots
w <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating, colour=Genre))
w + geom_point(size=3)
#adding some facets
w + geom_point(size=3) +
facet_grid(Genre~Year) #zapis Genre~Year spowoduje utworzenie serii wykresów
#Genre bedzie w wierszach, Year w kolumnach, wiec bedziemy
mieli dla kazdego Genre wykres dla kazedego z roku

facet_grid(Rows~Columns) #po lewej podajemy co ma byc w wierszach ~ po prawej co ma byc


w kolumnach
# . oznacza ze nie bierzemy wszystkich pojedynczych wartosci do
oddzielnych wykresow

w + geom_point(aes(size=BudgetMillions)) +
facet_grid(Genre~Year)+
geom_smooth()

#COORDINATES - setting the values that would be presented on axis

c <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating, size=BudgetMillions,


colour=Genre))
c + geom_point()

#limiting the values presnted on axis - xlim(), ylim() -


c + geom_point() +xlim(50, 100) +ylim(50,100) # x i y bedzie pokazywalo tylko od 50 do
100
#setting the xlim and ylim on axis would remove part of observation for our plot!!!

#zoom instead of cut off data from plot


h <- ggplot(data=movies, aes(x=CriticRating, size=BudgetMillions, colour=Genre))
#using coord_cartesian(ylim=c(0,50)) zrobi zoom na wykres dla osi y od 0 do 50 ale nie
usunie zadnych danych
h + geom_histogram(binwidth = 10, aes(fill=Genre), colour = "Black") +
coord_cartesian(ylim=c(0,50))

#improving facet plot


w + geom_point(aes(size=BudgetMillions)) +
facet_grid(Genre~Year)+
geom_smooth()+
coord_cartesian(ylim=c(0,100)) # zooming the plots to show data only between 0 and
100

#ADDING THEME

o <- ggplot(data = movies, aes(x=BudgetMillions))


h <- o + geom_histogram(binwidth = 10, aes(fill=Genre), colour="Black")

#1. adding axis label - xlab(), ylab()

h + xlab("Money Axis") + ylab("Number of Movies")

#2. Label formatting


h + xlab("Money Axis") + ylab("Number of Movies") +
theme(axis.title.x = element_text(colour = "DarkGreen", size = 30),
axis.title.y=element_text(colour="Red", size =30))

#3 changing tick points for x any y axis

h + xlab("Money Axis") + ylab("Number of Movies") +


theme(axis.title.x = element_text(colour = "DarkGreen", size = 30),
axis.title.y=element_text(colour="Red", size = 30),
axis.text.x = element_text(size = 20), #zwiekszamy rozmiar opisu podzialki
(punktow) na x
axis.text.y = element_text(size = 20)) #zwiekszamy rozmiar opisue podzialki na
x

#4 Legend formatting

h + xlab("Money Axis") + ylab("Number of Movies") +


theme(axis.title.x = element_text(colour = "DarkGreen", size = 30),
axis.title.y=element_text(colour="Red", size = 30),
axis.text.x = element_text(size = 20), #zwiekszamy rozmiar opisu podzialki
(punktow) na x
axis.text.y = element_text(size = 20),#zwiekszamy rozmiar opisue podzialki na x
legend.title = element_text(size = 30), #zwiekszamy rozmiar czcionki dla
legendy
legend.text = element_text(size = 20), # zwiekszamy rozmiar czcionki dla
emekentow w legendzie
legend.position = c(1,1), # setting position of legend
legend.justification = c(1,1)
)

#5 Adding Title of the plot

h +
xlab("Money Axis") + ylab("Number of Movies") + #naming the axis
ggtitle("Movie Budget Distribution") + #adding title for a plot
theme(axis.title.x = element_text(colour = "DarkGreen", size = 30),
axis.title.y=element_text(colour="Red", size = 30),
axis.text.x = element_text(size = 20), #zwiekszamy rozmiar opisu podzialki
(punktow) na x
axis.text.y = element_text(size = 20),#zwiekszamy rozmiar opisue podzialki na x
legend.title = element_text(size = 30), #zwiekszamy rozmiar czcionki dla
legendy
legend.text = element_text(size = 20), # zwiekszamy rozmiar czcionki dla
emekentow w legendzie
legend.position = c(1,1), # setting position of legend
legend.justification = c(1,1),
plot.title = element_text(colour="DarkBlue", size = 40, family ="Courier")
#setting title colour, size and font type(family)
)
Zapisywanie wykresu do pliku

#saving the graph - GENERAL RULE


List of drivers that can be used
jpeg(), png(), pdf(),
win.metafile() # good for Word, windows only
postscript() # for opem office

#.1 Choose the format to be used and call driver


jpeg("wykresB.jpg") #saving in current working directory

#2. call the plot


plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y") +
points(c(-2, 0, 2), c(0, 4, 2)) +
par(bg="blue") # adding background for entire graph

#3. close the device opnened for saving the plot


dev.off()

##Another way - przy wyswietlanym wykresie


dev.copy(pdf, "NowyWykres.pdf") # robimy kopie, podejmy driver i nazwe pliku
dev.off() # zamykamy plik, ktory zostanie automatycznie otworzony przez R

#saving graphs to files


getwd()
setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")

#otwieranie pliki - plik wczesniej nie istanial w working directory


pdf("wykres.pdf")

#checking the currently opened devices - we should see pdf in the list
dev.list()
#screen is named RStudioGD when using R studio
'''
RStudioGD png pdf
2 3 4
'''

#checking the active device


dev.cur() #it should point to pdf
'''
pdf
4
'''

#saving the displayed graph


#at firs we need to re-establish the screen as currently used device
dev.set(2)
#then we need to copy the screen to file. Numer 4 wzielismy z komendy dev.list()
dev.copy(which=4)

#wylacanie zrodla
dev.off()

pdf("wykres2.pdf")
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes
a dim attribute which creates the required number of dimension. In the below example we create an array with two
elements which are 3x3 matrices each.

# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
, , 1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

, , 2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

#ARRAYS
f1 <- c(46, 21, 50)
f2 <- c (30, 25, 50)

firsttest <- cbind(f1, f2)


firsttest
colnames(firsttest) <-NULL

'''
[,1] [,2]
[1,] 46 30
[2,] 21 25
[3,] 50 50
'''

f3 <- c(46, 41, 50)


f4 <- c(43, 35, 50)
secondtest <- cbind(f3, f4)
secondtest
colnames(secondtest) <-NULL
''' [,1] [,2]
[1,] 46 43
[2,] 41 35
[3,] 50 50
'''

#putting 2 matrices into ARRAY data structure


tests <- array(data=c(firsttest, secondtest), dim=c(3,2,2))

#by argument dim(3,2,2) we are specyfying 2 layers each containing of matrix with 3rx2c
attributes(tests)
#$`dim 3 2 2
tests

'''
Array zawiera 2 x matrix, jako (, , , 1) traktowana ta podana przez nas jako pierwsza
w nszym przypadku firsttest
, , 1

[,1] [,2]
[1,] 46 30
[2,] 21 25
[3,] 50 50

layer 2 z array to macierz podana przez nas jako druga, u nas secondtest

, , 2

[,1] [,2]
[1,] 46 43
[2,] 41 35
[3,] 50 50
'''

#accessing data from each matrix in array


tests[3,2,1] # wybieranie elementu z 3w i 2k z macierzy1
tests[1,2,2] # wybieranie elementu z 1w i 2k z macierzy2

Factors
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings
and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and
True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor( ) function. The nlevels functions gives the count of levels.

# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result −
[1] green green yellow red red red green
Levels: green red yellow
[1] 3

# Create the vectors for data frame.


height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)
gender <- c("male","male","female","female","male","female","male")

# Create the data frame.


input_data <- data.frame(height,weight,gender)
print(input_data)

# Test if the gender column is a factor.


print(is.factor(input_data$gender))

# Print the gender column so see the levels.


print(input_data$gender)

#Structure of data frame


str(input_data)

#Generating factor levels – gl() – generate leveles

v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))


print(v)

Generating Factor Levels


We can generate factor levels by using the gl() function. It takes two integers as
input which indicates how many levels and how many times each level.

Syntax
gl(n, k, labels)

Following is the description of the parameters used −


n is a integer giving the number of levels.
k is a integer giving the number of replications.
labels is a vector of labels for the resulting factor levels.

v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))


> print(v)
[1] Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston Boston
Boston Boston
Levels: Tampa Seattle Boston

rm(x)
x<-c(5,12,13,12)
#making factor out of x vector
xf<-factor(x)
is.factor(xf) #TRUE
xf #5 12 13

xff <-factor(x, levels=c(5,12,13,88, 100))


xff

'''
[1] 5 12 13 12
Levels: 5 12 13 88
'''
xff[2]<-88
xff
'''
1] 5 88 13 12
Levels: 5 12 13 88
'''

#tapply() function

tapply(x,f,g)
x-vector, f - factor or list of factors, g - function
Each factor in f must have the same lenghth as vector x1
ages<-c(25,26,55,37,21,42)
affils<-c("R", "D", "D", "R", "U", "D")
tapply(ages, affils, mean)
'''
D R U
41 31 21
'''
What happened: to each element of vector ages the factor from affils is assign
For each D, R, U the mean is calculated

d<- data.frame(list(gender=c("M", "M", "F", "M", "F", "F"),


age=c(47, 59, 21, 32, 33, 24),
income=c(55000, 88000, 32450, 76500, 123000, 45650)))
d

'''
gender age income
1 M 47 55000
2 M 59 88000
3 F 21 32450
4 M 32 76500
5 F 33 123000
6 F 24 45650
'''

#mark those over 25y


d$over25 <- ifelse(d$age > 25, 1, 0)
d
'''
gender age income over25
1 M 47 55000 1
2 M 59 88000 1
3 F 21 32450 0
4 M 32 76500 1
5 F 33 123000 1
6 F 24 45650 0
'''

#tapply will calculate mean() for the income based on the factors gender and over25
tapply(d$income, list(d$gender, d$over25), mean)

'''
0 1
F 39050 123000.00
M NA 73166.67
'''

#the split() function

split(vector, factor)
vector - vector or data frame
factor - factor or list of factors

split(d$income, list(d$gender, d$over25))

'''
gender age income over25
1 M 47 55000 1
2 M 59 88000 1
3 F 21 32450 0
4 M 32 76500 1
5 F 33 123000 1
6 F 24 45650 0
'''
the split() function would return values for each of the factor list provided
Female and age<25y
Make and age<25y
Female and age>25y
Male and age>25y

'''$`F.0`
[1] 32450 45650

$M.0
numeric(0)

$F.1
[1] 123000

$M.1
[1] 55000 88000 76500
'''

Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data.
The first column can be numeric while the second column can be character and third column can be logical. It is a list
of vectors of equal length.
Data Frames are created using the data.frame( ) function.

# Create the data frame.


BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

CREATING A DATA FRAME


#we do have 3 vectors and would like to create a data frame from it
Countries_2012_Dataset <- c("Aruba","Afghanistan","Angola","Albania","United Arab
Emirates","Argentina","Armenia","Antigua and
Barbuda","Australia","Austria","Azerbaijan","Burundi","Belgium","Benin","Burkina
Faso","Bangladesh","Bulgaria","Bahrain","Bahamas, The","Bosnia and
Herzegovina","Belarus","Belize","Bermuda","Bolivia","Brazil","Barbados","Brunei
Darussalam","Bhutan","Botswana","Central African
Republic","Canada","Switzerland","Chile","China","Cote d'Ivoire","Cameroon","Congo,
Rep.","Colombia","Comoros","Cabo Verde","Costa Rica","Cuba","Cayman
Islands","Cyprus","Czech Republic","Germany","Djibouti","Denmark","Dominican
Republic","Algeria","Ecuador","Egypt, Arab
Rep.","Eritrea","Spain","Estonia","Ethiopia","Finland","Fiji","France","Micronesia,
Fed. Sts.","Gabon","United Kingdom","Georgia","Ghana","Guinea","Gambia, The","Guinea-
Bissau","Equatorial
Guinea","Greece","Grenada","Greenland","Guatemala","Guam","Guyana","Hong Kong SAR,
China","Honduras","Croatia","Haiti","Hungary","Indonesia","India","Ireland","Iran,
Islamic
Rep.","Iraq","Iceland","Israel","Italy","Jamaica","Jordan","Japan","Kazakhstan","Kenya"
,"Kyrgyz Republic","Cambodia","Kiribati","Korea, Rep.","Kuwait","Lao
PDR","Lebanon","Liberia","Libya","St. Lucia","Liechtenstein","Sri
Lanka","Lesotho","Lithuania","Luxembourg","Latvia","Macao SAR,
China","Morocco","Moldova","Madagascar","Maldives","Mexico","Macedonia,
FYR","Mali","Malta","Myanmar","Montenegro","Mongolia","Mozambique","Mauritania","Maurit
ius","Malawi","Malaysia","Namibia","New
Caledonia","Niger","Nigeria","Nicaragua","Netherlands","Norway","Nepal","New
Zealand","Oman","Pakistan","Panama","Peru","Philippines","Papua New
Guinea","Poland","Puerto Rico","Portugal","Paraguay","French
Polynesia","Qatar","Romania","Russian Federation","Rwanda","Saudi
Arabia","Sudan","Senegal","Singapore","Solomon Islands","Sierra Leone","El
Salvador","Somalia","Serbia","South Sudan","Sao Tome and Principe","Suriname","Slovak
Republic","Slovenia","Sweden","Swaziland","Seychelles","Syrian Arab
Republic","Chad","Togo","Thailand","Tajikistan","Turkmenistan","Timor-
Leste","Tonga","Trinidad and
Tobago","Tunisia","Turkey","Tanzania","Uganda","Ukraine","Uruguay","United
States","Uzbekistan","St. Vincent and the Grenadines","Venezuela, RB","Virgin Islands
(U.S.)","Vietnam","Vanuatu","West Bank and Gaza","Samoa","Yemen, Rep.","South
Africa","Congo, Dem. Rep.","Zambia","Zimbabwe")
Codes_2012_Dataset <-
c("ABW","AFG","AGO","ALB","ARE","ARG","ARM","ATG","AUS","AUT","AZE","BDI","BEL","BEN","
BFA","BGD","BGR","BHR","BHS","BIH","BLR","BLZ","BMU","BOL","BRA","BRB","BRN","BTN","BWA
","CAF","CAN","CHE","CHL","CHN","CIV","CMR","COG","COL","COM","CPV","CRI","CUB","CYM","
CYP","CZE","DEU","DJI","DNK","DOM","DZA","ECU","EGY","ERI","ESP","EST","ETH","FIN","FJI
","FRA","FSM","GAB","GBR","GEO","GHA","GIN","GMB","GNB","GNQ","GRC","GRD","GRL","GTM","
GUM","GUY","HKG","HND","HRV","HTI","HUN","IDN","IND","IRL","IRN","IRQ","ISL","ISR","ITA
","JAM","JOR","JPN","KAZ","KEN","KGZ","KHM","KIR","KOR","KWT","LAO","LBN","LBR","LBY","
LCA","LIE","LKA","LSO","LTU","LUX","LVA","MAC","MAR","MDA","MDG","MDV","MEX","MKD","MLI
","MLT","MMR","MNE","MNG","MOZ","MRT","MUS","MWI","MYS","NAM","NCL","NER","NGA","NIC","
NLD","NOR","NPL","NZL","OMN","PAK","PAN","PER","PHL","PNG","POL","PRI","PRT","PRY","PYF
","QAT","ROU","RUS","RWA","SAU","SDN","SEN","SGP","SLB","SLE","SLV","SOM","SRB","SSD","
STP","SUR","SVK","SVN","SWE","SWZ","SYC","SYR","TCD","TGO","THA","TJK","TKM","TLS","TON
","TTO","TUN","TUR","TZA","UGA","UKR","URY","USA","UZB","VCT","VEN","VIR","VNM","VUT","
PSE","WSM","YEM","ZAF","COD","ZMB","ZWE")
Regions_2012_Dataset <- c("The Americas","Asia","Africa","Europe","Middle East","The
Americas","Asia","The
Americas","Oceania","Europe","Asia","Africa","Europe","Africa","Africa","Asia","Europe"
,"Middle East","The Americas","Europe","Europe","The Americas","The Americas","The
Americas","The Americas","The Americas","Asia","Asia","Africa","Africa","The
Americas","Europe","The Americas","Asia","Africa","Africa","Africa","The
Americas","Africa","Africa","The Americas","The Americas","The
Americas","Europe","Europe","Europe","Africa","Europe","The Americas","Africa","The
Americas","Africa","Africa","Europe","Europe","Africa","Europe","Oceania","Europe","Oce
ania","Africa","Europe","Asia","Africa","Africa","Africa","Africa","Africa","Europe","T
he Americas","The Americas","The Americas","Oceania","The Americas","Asia","The
Americas","Europe","The Americas","Europe","Asia","Asia","Europe","Middle East","Middle
East","Europe","Middle East","Europe","The Americas","Middle
East","Asia","Asia","Africa","Asia","Asia","Oceania","Asia","Middle
East","Asia","Middle East","Africa","Africa","The
Americas","Europe","Asia","Africa","Europe","Europe","Europe","Asia","Africa","Europe",
"Africa","Asia","The
Americas","Europe","Africa","Europe","Asia","Europe","Asia","Africa","Africa","Africa",
"Africa","Asia","Africa","Oceania","Africa","Africa","The
Americas","Europe","Europe","Asia","Oceania","Middle East","Asia","The Americas","The
Americas","Asia","Oceania","Europe","The Americas","Europe","The
Americas","Oceania","Middle East","Europe","Europe","Africa","Middle
East","Africa","Africa","Asia","Oceania","Africa","The
Americas","Africa","Europe","Africa","Africa","The
Americas","Europe","Europe","Europe","Africa","Africa","Middle
East","Africa","Africa","Asia","Asia","Asia","Asia","Oceania","The
Americas","Africa","Europe","Africa","Africa","Europe","The Americas","The
Americas","Asia","The Americas","The Americas","The Americas","Asia","Oceania","Middle
East","Oceania","Middle East","Africa","Africa","Africa","Africa")
#creating data from with data.frame() function
mydf <- data.frame(Countries_2012_Dataset, Codes_2012_Dataset, Regions_2012_Dataset)
mydf

#using head() we are selecting 6 first rows and check the colnames
head(mydf) #"Countries_2012_Dataset" "Codes_2012_Dataset" "Regions_2012_Dataset"

#chaning the column names


colnames(mydf) <- c("Country", "Code", "Region")

#the same can be done when creating the data frame


rm(mydf)#removing variable mydf

mydf <- data.frame(Country = Countries_2012_Dataset, Code = Codes_2012_Dataset, Region


=Regions_2012_Dataset)
head(mydf)#od razu nazwy kolumn zostaly przypisane jako: Country, Code, Region

summary(mydf)

#---MERGING DATA FRAMES


#w obu data frames mamy rózne dane dla tych samych Country.Code w stats jest taki sam
jak Code w mydf
head(stats) # 5 kolumn
head(mydf) # 3 kolumny

#merging 2 data frames with merge() function


#podajemy jakie DF chcemy polaczyc, by.x - nazwa kolumny do joina z DF podanego jako
pierwszy, by.y - nazwa kolumny do joina z DF podanego jako drugi
merged <- merge(stats, mydf, by.x = "Country.Code", by.y="Code")
head(merged) # powstal DF z 7 kolumnami (orygnialnie stats mial 5 kolumn, mydf 3
kolumny)
#R pominal Code z mydf to po niej robilimsy merge wiec z zalozenia
powinien zawierac takie same dane jak Country.Code

str(merged) # mamy zduplikowane kolumny Country.Name i Country zawieraja te same dane


'''
data.frame: 195 obs. of 7 variables:
$ Country.Code : Factor w/ 195 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Country.Name : Factor w/ 195 levels "Afghanistan",..: 8 1 4 2 183 6 7 5 9 10 ...
$ Birth.rate : num 10.2 35.3 46 12.9 11 ...
$ Internet.users: num 78.9 5.9 19.1 57.2 88 ...
$ Income.Group : Factor w/ 4 levels "High income",..: 1 2 4 4 1 1 3 1 1 1 ...
$ Country : Factor w/ 195 levels "Afghanistan",..: 8 1 4 2 183 6 7 5 9 10 ...
$ Region : Factor w/ 6 levels "Africa","Asia",..: 6 2 1 3 4 6 2 6 5 3 ...
'''

#usuniecie kolumny z DF
merged$Country <- NULL
str(merged)
head(merged)

Using $ sign

#using $ sign

head(stat)

#selecting entire row for Angola


stat[3,]

#selecting specific value for Angola


stat[3,3]

#as the columns have names we can select the same by providing row number and column
name
stat[3, "Birth.rate"]

# $ work for data frame, we can put $ any name of columm, it returns a vector
stat$Internet.users

#the same we can get by providing following


stat[,"Internet.users"]

#but with using $ we can easily access to single value of selected vector !!!
stat$Internet.users[2]

#levels() returns the number of factors that are present in the column
levels(stat$Income.Group)# "High income" "Low income" "Lower middle income" "Upper
middle income"

Operations with data frames

#subsetting

stat[1:10,] #selecting first 10 rows


stat[c(4,100),] # selecting only 4th and 100th row

#when we select 1 row the result remains data frame


is.data.frame(stat[1,]) # TRUE
#when we select 1 column it isno longer data frame
is.data.frame(stat[,1]) # FALSE
is.data.frame(stat[,1, drop=F]) # TRUE; we need to use the drop=F parameter

#arithmetic operations on columns of data frame


stat$Birth.rate * stat$Internet.users
stat$Birth.rate + stat$Internet.users
stat$Birth.rate - stat$Internet.users
stat$Birth.rate / stat$Internet.users

#adding column to data frame

head(stat)# coumns we got: Country.Name Country.Code Birth.rate Internet.users


Income.Group

#1 we may simply assign some values to column name that does not exist yet and it would
be added
stat$NewColum <- stat$Birth.rate * stat$Internet.users
head(stat) # columns we got: Country.Name Country.Code Birth.rate Internet.users
Income.Group NewColum

#remove a colum
stat$NewColum <- NULL
head(stat) # column list: Country.Name Country.Code Birth.rate Internet.users
Income.Group

#add the row to data frame


To add more rows permanently to an existing data frame, we need to bring in the new
rows in the same structure as the existing data frame and use the rbind() function.

In the example below we create a data frame with new rows and merge it with the
existing data frame to create the final data frame.

# Create vector objects.


city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.


addresses <- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n")

# Print the data frame.


print(addresses)

# Create another data frame with similar columns


new.address <- data.frame(
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n")

# Print the data frame.


print(new.address)

# Combine rows form both the data frames.


all.addresses <- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n")

# Print the result.


print(all.addresses)

#filterning data frames


head(stat)
stat$Internet.users < 2 # returns a vector with values FALSE and TRUE
filter <- stat$Internet.users < 2

stat[filter,] #if we provide as rows vector with values FALSE and TRUE it will display
only those rows where value is TRUE

stat[stat$Birth.rate > 40, ] # select all rows where Birth Rate > 40
#Birth.rate > 40 - returns vector with values TRUE or FALSE

stat[stat$Birth.rate > 40 & stat$Internet.users < 2, ] # shows only rows where 2


conditions are TRUE

stat[stat$Income.Group == "High income", ] # shows countries only with High Income

#to check with values are availble in column Income.Group


levels(stat$Income.Group)

Analyzing Data Frame

Exploring data - using build in data set iris


#Exploring data - using biuld in data set iris
data(iris)
#listing column names
names(iris)
#getting the min, max, mean, etc values for each column
summary(iris)

#examine the data distribution of quantative variable


hist(iris$Sepal.Length)

#visualize the descreptive statistics


boxplot(iris$Sepal.Length, main="Summary of iris", xlab="Sepal Lenght")
#main - tytul wykresu
#xlab - nazwa os OX

#relation between 2 variables


plot(iris$Sepal.Length, iris$Sepal.Width)

#using barplot with in biult mtcars


data(mtcars)
names(mtcars)
str(mtcars)
summary(mtcars)

#table() function creates the tabular result of categorical variables


count <- table(mtcars$gear)
count
#in pur case we would get number of cars with gear =3, gear = 4 and gear =5
#ile obsereacji mamy w każdej kategorii
'''
3 4 5
15 12 5
'''

barplot(count, main="cars", xlab="Number of Gears")


#changing barplot display to horizontal
barplot(count, main="cars", xlab="Number of Gears", horiz=TRUE)
#chaning the color
barplot(count, main="cars", xlab="Number of Gears", horiz=TRUE, col="red")

data exploration with dplyr package

library(datasets)
library(dplyr)

#uisng biuld in package


head(airquality)

#select few columns by a name


?select # function from dplyr,selecting only given columns from data set
#select(data.frame, columns)

#1. Selecting only few columns from entire data.frame


selecn=select(airquality, Ozone, Month)

head(selecn) # New data set consist only 2 Ozone, Month

#2. Moving column to the from of data.frame


head(airquality)
#Ozone Solar.R Wind Temp Month Day
selecn2 <- select(airquality, Temp, everything()) # we are moving Temp column in front
and then selecting all other columns
head(selecn2)
#Temp Ozone Solar.R Wind Month Day

#3. Drop variable / column with "-"


selecn3 <- select(airquality, -Temp)
head(selecn3) #Ozone Solar.R Wind Month Day; Temp column was removed
#filter() function in dplyr - returns rows with matching conditions
?filter()

filter(airquality, Ozone > 25)


filter(airquality, Ozone > 25 & Temp >75)

#adding new column/variable with mutate()


?mutate
dm = mutate(airquality, TempInc = (Temp-32) * 5 /9)
head(dm)

#summarize and group by data


?summarise
summarise(airquality, mean(Ozone, na.rm=TRUE)) #42.12931 return mean value for Ozone

summarise(group_by(airquality, Month), mean(Wind, na.rm=TRUE))


#return mean value group by a month
'''
1 5 11.6
2 6 10.3
3 7 8.94
4 8 8.79
5 9 10.2
'''

#using pipe operator

head(select(airquality, Ozone, Month))


#the equivalent would be
airquality %>% select(Ozone, Month) %>% head
#the %>% operator tells what is the input for the function in chain

#data summary with %>%


airquality %>%
summarise(avg = mean(Ozone, na.rm = TRUE),
min = min (Ozone, na.rm = TRUE),
max = max (Ozone, na.rm = TRUE),
total = n()
)
'''
avg min max total
1 42.12931 1 168 153
'''

#summarizing by month
airquality %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm=TRUE))
'''
1 5 65.5
2 6 79.1
3 7 83.9
4 8 84.0
5 9 76.9
'''

#remove the "5th" month from group


airquality %>%
filter(Month !=5) %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm=TRUE))
'''
1 6 79.1
2 7 83.9
3 8 84.0
4 9 76.9
'''

#filtering the data


airquality %>%
select(Month, Ozone, Wind) %>%
filter(Wind >12) %>%
head
data exploration and visualization with dplyr and ggplot2
gather() -There are times when our data is considered unstacked and a common attribute
of concern is spread out across columns. To reformat the data such that these common
attributes are gathered together as a single variable, the gather() function will take
multiple columns and collapse them into key-value pairs, duplicating all other columns
as needed.

setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")
getwd()

library(ggplot2)
library(tidyr)
library(dplyr)

cpi = read.csv("cpi.csv")
head(cpi)
names(cpi)

#Gather columns
?gather #librabry dplyr
#takes multiple columns and collapses into key-value pairs

#gather(data, key="key", value="values")


cpi_history <-
gather(cpi, year, cpi, CPI.2012.Score : CPI.2016.Score, na.rm=TRUE)

head(cpi_history) #we get tidy data


'''
CPI.2016.Rank Country Country.Code Region year cpi
1 1 New Zealand NZL Asia Pacific CPI.2012.Score 90
2 1 Denmark DNK Europe and Central Asia CPI.2012.Score 90
3 3 Finland FIN Europe and Central Asia CPI.2012.Score 90
4 4 Sweden SWE Europe and Central Asia CPI.2012.Score 88
5 5 Switzerland CHE Europe and Central Asia CPI.2012.Score 86
6 6 Norway NOR Europe and Central Asia CPI.2012.Score 85
'''
#checking top 16 countries

top2016 <- cpi_history %>% filter(year=="CPI.2016.Score") %>% top_n(15, cpi)


#adding new column
top2016$rnk = "top"

#collect bottom 15 countries


bot2016 <- cpi_history %>% filter(year=="CPI.2016.Score") %>% top_n(-15, cpi)
#adding new column
bot2016$rnk = "bot"

#combine both sets


dt2016 <- rbind(top2016, bot2016)
head(dt2016)
tail(dt2016)

#plot the data


library(ggplot2)

ggplot(dt2016, aes(reorder(Country, cpi), cpi)) +


geom_bar(stat="identity", aes(fill=rnk)) +
coord_flip() + #transposing the data
xlab("") +
ggtitle("Top and Bottom CPI's in 2016") +
scale_fill_manual(values = c("top" = "blue", "bot" = "red"), name="CPI") +
guides(fill=guide_legend(reverse=TRUE))
Working with time series data
#Working with tempora data - time series

getwd()
stocks = read.csv("5stocks.csv")
head(stocks) #data for stock price movement Jul 2001 - May 2017

smove <- stocks[, 2:6] #removing the column date


head(smove)
ncol(smove)
nrow(smove)

#removing the NA data


smove <- na.omit(smove)

#convert data to time series object


?ts # function used to create time series object

#ts(data, start, end, frequency), frequency = the number of observations per unit of
time.

myts <- ts(smove, start=c(2001, 7), end=c(2017, 5), frequency=250) #250 number of
trading days in a year
plot(myts)

#getting details regarding time series data


start(myts)
end(myts)
frequency(myts)

#subsetting time series


?window
#window() - xtracts the subset of the object x observed between the times start and
end. If a frequency is specified, the series is then re-sampled at the new frequency.
myts2 <- window(myts, start=c(2014, 6), end=c(2014, 12))
plot(myts2)

plot.ts(myts2)

#######
install.packages("devtools")
library(devtools)
#install_github() - unction to install R packages hosted on GitHub in the devtools
package. But it requests developer’s name.
#install_github("DeveloperName/PackageName")
install_github("sinhrks/ggfortify")

library(ggfortify)
autoplot(myts) #plottting the timeseries on facets

autoplot(myts, facets=FALSE)#plotting time series on 1 chart

######
#Sudden changes in values over a period of time
g <- read.csv("growth-in-gdp.csv")
head(g)
names(g)

n <- c("TIME", "Country", "Value")


df <- g[, colnames(g) %in% n] #we are selecting data from only 3 columns specified
under variable n
head(df)

#the same could be done by providing values in vector, but using %in% might be more
convienent when the number of columns to filter is big
data<-df[,c("Country", "Value")]
head(data)

#getting the list of countries in data frame


unique(data$Country)

#getting data for sigle country


jp <- subset(data, Country == "Japan")
head(jp)

#CREATING A TIME SERIES OUT OF THE DATA FRAME

#1. Unlist the data -> create a vector of values


?unlist

#unlist(x) - returns a vector with all values that are present in x


data <- as.numeric(unlist(jp[2]))
head(data)

#2. Make a time series from vector of values


data <- ts(data, start=c(1985), end=c(2015), frequency =1)# 1 bo mamy dane roczne
plot(data)

#DETECTING THE CHANGE IN TREND IN DIFFERENT POINTS OF TIME


install.packages("changepoint")
library(changepoint)
?cpt.mean #BREAK POINT DETECTION
#Calculates the optimal positioning and (potentially) number of changepoints for data
using the user specified method
cm <-cpt.mean(data)
print(cm)
plot(cm)#present in plot the points where the mean value changed suddenly

Variables

Varaible naming:
var_name2. valid Has letters, numbers, dot and underscore
var_name% Invalid Has the character '%'. Only dot(.) and underscore allowed.
2var_name invalid Starts with a number
.var_name valid
var.name valid Can start with a dot(.) but the dot(.)should not be followed by a number.
.2var_name invalid The starting dot is followed by a number making it invalid.
_var_name invalid Starts with _ which is not valid

Variables asignment

The variables can be assigned values using leftward, rightward and equal to operator. The values of the variables can
be printed using print() or cat() function. The cat() function combines multiple items into a continuous print output.

# Assignment using equal operator.


var.1 = c(0,1,2,3)

# Assignment using leftward operator.


var.2 <- c("learn","R")

# Assignment using rightward operator.


c(TRUE,1) -> var.3

print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")

Printing variables using cat() function

var_x <- "Hello"


cat("The class of var_x is ",class(var_x),"\n")
>>The class of var_x is character

var_x <- 34.5


cat(" Now the class of var_x is ",class(var_x),"\n")
Now the class of var_x is numerix

var_x <- 27L


cat(" Next the class of var_x becomes ",class(var_x),"\n")
Next the class of var_x becomes integer

Finding Variables
To know all the variables currently available in the workspace we use the ls() function. Also the ls() function can use
patterns to match the variable names.

print(ls())
[1] "my var" "my_new_var" "my_var" "var.1"
[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"

The ls( ) function could use patterns to match variable names

# List the variables starting with the pattern "var".


print(ls(pattern = "var"))

[1] "my var" "my_new_var" "my_var" "var.1"


[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"

The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE" argument to ls() function.

print(ls(all.name = TRUE))
[1] ".cars" ".Random.seed" ".var_name" ".varname" ".varname2"
[6] "my var" "my_new_var" "my_var" "var.1" "var.2"
[11]"var.3" "var.name" "var_name2." "var_x"

Deleting variables

Variables can be deleted by using the rm() function.

All the variables can be deleted by using the rm() and ls() function together.

rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found

All the variables can be deleted by using the rm ( ) and ls ( ) function together.

rm(list = ls())
print(ls())
>> character(0)

Importing data into R

Reading text files - readLines()


reading text files - readLines()
readLines would returned vector, each line in the text file is the 1 element of vector

z1 <- readLines("z1.txt")
z1 #[1] "John 25" "Mary 28" "Jim 19"
z1[1]#"John 25"
ReadLines() function
#1. opening the connection, opened connection is needed for R to track if EOF is
reached
c<-file("z1.txt", "r")

#2. Reading line by line by specifying the argument n=1


readLines(c, n=1)
#[1] "John 25"
readLines(c, n=1)
#[1] "Mary 28"
readLines(c, n=1)
#[1] "Jim 19"
readLines(c, n=1)
#character(0)

#Detecting the EOF


d <- file("z1.txt", "r")
while(TRUE){
line <- readLines(d, n=1)
if (length(line) == 0) {
print("EOF reached")
break
} else print(line)
}

#to move again at the beginning of the file we can use seek() function
c<-file("z1.txt", "r")
readLines(c, n=2) #"John 25" "Mary 28"
# moving back at the beginning of the file
seek(con=c, where = 0) # where = 0 means that file pointer zero characters from the
start of file
readLines(c, n=1) # "John 25"

Reading csv data from file


#importing data into R
?read.csv()
#function return a data frame

#1 Selecting the file manually


stat <- read.csv(file.choose()) # podanie file.choose() jako parametru sprawi ze pojawi
sie okenko gdzie mozemy wybrac plik
stat

#2 Set Working Directory and Read DAta


getwd() # returns curretn Working Directory # "C:/Users/Pc/Documents"
#Set new Working Directory on windows
setwd("C:\\Users\\Pc\\Desktop\\R files")
getwd()

#jak Working Directory jest zmienione to możemy podać nazwe pliku, który sie tam
znajduje
stats2 <- read.csv("DemographicData.csv")
stats2
getwd() #"C:/Users/Pc/Documents"
setwd("C:\\Users\\Pc\\Desktop\\R files")
getwd()
input<-read.csv("input.csv")
input
colnames(input) = c("id", "name", "salary", "start_date", "dept")
input

Example reading csv with analysis the data

#1. getting the max salary for the employees


sal <- max(input$salary); sal
#2. listing the person having max salary
persons <- subset(input, salary == max(salary)); persons
persons <- subset(input, salary == sal); persons
#3 Get all people from IT
itguys <- subset(input, dept =="IT"); itguys
#4 Get person from IT whose salary is greater than 600
itguys600 <- subset(input, dept == "IT" & salary > 600); itguys600
#5 get people who joined after 2014
persons2014 <-subset(input, as.Date(start_date) > as.Date("2014-01-01")); persons2014
'''
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
'''

Reading from online csv files


#Using RCurl to read in csv data hosted online on github and other #sites

#install.packages("RCurl")
library(RCurl)#allows to read data from online sources

data1= read.csv(text=getURL("https://raw.githubusercontent.com/sciruela/Happiness-
Salaries/master/data.csv"))
head(data1)
summary(data1)

#sometimes we do not want to read to x lines are they are nore contaning data, we can
skip them using skip attribute
rm(data2)
data2=read.csv(text=getURL("https://raw.githubusercontent.com/opetchey/RREEBES/master/B
eninca_etal_2008_Nature/data/nutrients_original.csv"), skip=7, header=T)
head(data2)
summary(data2)

data3=read.csv(text=getURL("https://www.gov.uk/government/uploads/system/uploads/attach
ment_data/file/246663/pmgiftsreceivedaprjun13.csv"))
head(data3)

#read data from google sheets


##https://docs.google.com/spreadsheets/d/1X85PgCqavImGZPGdbrEfQc6uzNP98KXwZcf-
8JsKZto/edit#gid=8983842

install.packages("googlesheets")
library(googlesheets)

#authentication with google acount


gs_ls()

# get the Britain Elects google sheet


be = gs_title("Britain Elects / Public Opinion")
?gs_title

#goole sheet that we are accesing does have multiple sheet within
# list worksheets
gs_ws_ls(be)

# get Westminster voting


west = gs_read(ss=be, ws = "Westminster voting intentions", skip=1)
?gs_read

# convert to data.frame
wdf = as.data.frame(west)

head(wdf)

Writing to csv
write.csv(persons2014, "output.csv")
checkexport <-read.csv("output.csv"); checkexport
'''
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 5 Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
'''
#while exporting we've added additional column X which is meaningless, it can be
dropped when using additional parameter during exporting to a file

write.csv(persons2014, "output2.csv", row.names = FALSE)


checkexport2 <-read.csv("output2.csv"); checkexport2
'''
id name salary start_date dept
1 3 Michelle 611.00 2014-11-15 IT
2 4 Ryan 729.00 2014-05-11 HR
3 5 Gary 843.25 2015-03-27 Finance
4 8 Guru 722.50 2014-06-17 Finance
'''

#file.choose()
#selewcting the LOCATION MANUALLY FOR THE FILE !!!
write.csv(persons2014, file.choose(), row.names = FALSE)

Reading excel file


#Reading excel file
getwd()
setwd("C:\\Users\\Pc\\Desktop\\R files")
#1. Isntall package xlsx
install.packages("xlsx")
#2 Verify that package is installed
any(grepl("xlsx", installed.packages())) # TRUE
#3. Load library into workspace
library(xlsx)

#4. Reading the excel file


#wczytywanie 1go arkusza z pliku employees
data1 <- read.xlsx("employees.xlsx", sheetIndex =1); data1
data1
#wczytywanie 2go arkuszu z pliki employees
data2 <- read.xlsx("employees.xlsx", sheetIndex =2); data1
data2
Saving to excel
dataout <- subset(data1, dept == "IT"); dataout

#5 saving to xlsx
write.xlsx(dataout, "FilteredData.xlsx", row.names = FALSE)

#6 We can also choose the path and name the file manually using file.choose()
write.xlsx(dataout, file.choose(), row.names = FALSE)

Reading from xml file

install.packages("XML")
#load xml library
library("XML")
#load other required packages
library("methods")

#reading the xml file xmlParse()


result<-xmlParse(file="pracownicy.xml"); result
print(result)

#check the number of noodes


#1.Extract the root node from xml
rootnode <-xmlRoot(result);
#2.Find number of nodes in the root
rootsize <-xmlSize(rootnode); rootsize#8

#Return the name of the node


xmlName(rootnode) #"RECORDS"
#how many children in node
xmlSize(rootnode)

#details for 1st node


rootnode[1]
print(rootnode[1])

#xml to data frame


xmldataframe <- xmlToDataFrame("pracownicy.xml")
xmldataframe
colnames(xmldataframe) <- c("ID", "IMIE", "ZAROBKI", "DATAZATRUDNIENIA", "DZIAL")
xmldataframe

Reading JSON data

#READING JSON DATA


#intsall rjson package
install.packages("rjson")

#load the package required to read JSON files


library("rjson")

#read data from JSON file - fromJSON()


dataj <- fromJSON(file="dane.json")
dataj

#convert JSON to data frame


jdataframe <-as.data.frame(dataj)
jdataframe
Accessing the json data from the web
library(rjson)

#varaible to point to json data on the web


json_file <-
"http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"

#reading the json file


json_data <-fromJSON(file=json_file)
json_data

jdataframe <- as.data.frame(json_data)


jdataframe

#json file contains 2 elemet of the list


json_data[[1]] # header
json_data[[2]] #economic data of different countries

#to access any particular object form json data we can use following:
d3 <- lapply(json_data[[2]], function(x) c(x["id"], x["iso2Code"])) # from 2nd element
of json data we are selecting id and iso2Code
d3
'''
dane nadal nie sa zbyt czytelen

[[1]]
[[1]]$`id`
[1] "AUS"

[[1]]$iso2Code
[1] "AU"
'''
# This function allows you to call any R function, but instead of writing out the
arguments one by one, you can use a list to hold the arguments of the function.
d3r <-do.call(rbind, d3) # tworyzmz maciery ze wszystkich danych w d3
d3r

#let's get more data for each of the country


d4 <- lapply(json_data[[2]], function(x) c(x["id"], x["iso2Code"], x$region["id"],
x$region["value"], x["capitalCity"]))
d4 <- do.call(rbind, d4) # creating the matrix
d4

Reading HTML files (rvest librbary)

install.packages("rvest")
library(rvest)

#chcemy szzczytac tabele jaka jest na stronie


url <- "https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table"

#%>% explanation - it pass the the first argument of to the function


#iris %>% head() is equivalent of head(iris)
#Thus iris %>% head() %>% summary() is equivalent to summary(head(iris))

medal_tally <- url %>% read_html() %>%


html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
html_table(fill=TRUE)
## copy xpath
#//*[@id="mw-content-text"]/div/table[2]
medal_tally
#the ebtire table is kept as first argument
medal_tally <- medal_tally[[1]]
head(medal_tally)

#WHS Sites in the UK

url2="https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_in_the_United_Kingdom_
and_the_British_Overseas_Territories"

whsuk <- url2 %>% read_html() %>%


html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% html_table(fill=TRUE)

whsuk <- whsuk[[1]]


head(whsuk)

How to find xpath?


Otwietramy strone I znajdujemy tabelke. PPM I wybieramy zbadaj (inspect)

W kodzie HTML szukamy table class, po najechaniu na nia powinna nam się zaznaczyć
tabelka
PPM > Copy > Copy Xpath
reading data from online HTML tables
library(XML)
library(RCurl)

#sciezka do stony gdzie oprocz tekstu i zdjec jest tabelka z danymi


url <- "https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table"

#get data from this URL


urldata <-getURL(url)

#read HTML table


data <- readHTMLTable(urldata, stringsAsFactors = FALSE)

#checking how the data looks like


names(data)
head(data)

#in our data besides the table and data from it also text from website was read

#we need to find mark since whe the table starts


x=data$`2016 Summer Olympics medal table`
#in the varaible x we have the data from the table
head(x)
tail(x)

Exploring imported data


#Exploring imported data
stats <- read.csv("DemographicData.csv")
#checking the number of rows
nrow(stat)
#checking number of columns
ncol(stat)
#checking top 6 rows - shows column names + firs 6 columns
head(stat)
#we can check more rows if needed
head(stat, n=10)
#checking last 6 rows - shows column names + last 6 columns
tail(stat)

#shows structure od data frame. Number of columns and rows and information for each
column
str(stat) #

#shows min, max, mean, median for numeric variables; how many rows fails to each
category for factors
summary(stat)

Input / Output
Input – scan() – reading text files

#accessing the keyboard and monitor


#1 scan() function

getwd()
setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")

'''
1,txt
123
45
6

2.txt
123
4.2 5
6

3.txt:
abc
de f
g

4.txt:
abc
123 6
y
'''
#scan() - by default scan function expect numeric values as input from keyboard
#parameter what="" tells that we will be providing character data

names=scan(what="")
joe fred bob john
sam sue robin

names#"joe" "fred" "bob" "john" "sam" "sue" "robin"

#if what= argument is a list containing examples of expected data type, scan would
output list with as many elements as there are data types provided

names2 = scan(what=list(a=0, b="", c=0))


1 dog 3
2 cat 5
3 duck 7

names2
'''
$`a`
[1] 1 2 3
$b
[1] "dog" "cat" "duck"
$c
[1] 3 5 7
'''

scan("1.txt") #we are receiving vecor of integers


#[1] 123 4 5 6

scan("2.txt") #we are receiving vecotr of double as one of entry was double
#123.0 4.2 5.0 6.0

scan("3.txt")#non numeric character produces error


#Error in scan("3.txt") : scan() expected 'a real', got 'abc'
#scan() do have argument named what that speficies the mode, default value is double

scan("3.txt", what="") # using what="" indictated that we want use string mode
#[1] "abc" "de" "f" "g"
#scan be default assumes that items of the vector are separated by whitespace
#we can use option sep argument for other situations

scan("3.txt", what="", sep = "\n")


#[1] "abc" "de f" "g" - kazda linia traktowana jest jako jeden element w wektorze

scan("4.txt", what="")
#[1] "abc" "123" "6" "y"

#reading single line from keyboard


vinput <- readline()

#typicaly readline() is called with its optional promopt


vinput <- readline("type your first name here: ")

Printing to screen
Cat() vs print()

cat() is valid only for atomic types (logical, integer, real, complex, character) and
names.
It means you cannot call cat on a non-empty list or any type of object. In practice it
simply converts arguments to characters and concatenates so you can think of something
like as.character() %>% paste().
print() is a generic function so you can define a specific implementation for a certain
S3 class.

The purpose of print is to show values much as they are entered in source code, so
quotes and escaped characters such as "\n" are shown. Cat is intended to provide a way
to send characters straight to the console so the effects of special characters can be
visible (i.e. getting text on the next line when a "\n" occurs in a string). Thus the
element numbering is not relevant there.

#you can use cat to redirect output directly to file


cat('"foo"', '"bar', file = "foobar.txt", sep="\n") # w pliku oba slowa zostana wpisane
w nowych liniach

#we can append new lines


#w 3 lini zostanie dospiame "foo""bar"
cat('"foo"', file="foobar.txt", append = TRUE)
cat('"bar"', file="foobar.txt", append = TRUE)

Arithmetic Operators

Operatory artymetyczne dodaja / dziela/ mnoża itd. Odpowiadajace sbie wartosci w wektorze lub macierzy

Operator Description Example

+ Adds two vectors


v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0
− Subtracts second vector from the first
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)
it produces the following result −
[1] -6.0 2.5 2.0

* Multiplies both vectors


v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)
it produces the following result −
[1] 16.0 16.5 24.0

/ Divide the first vector with the second


v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
When we execute the above code, it produces
the following result −
[1] 0.250000 1.833333 1.500000

%% Give the remainder of the first vector


v <- c( 2,5.5,6)
with the second
t <- c(8, 3, 4)
print(v%%t)
it produces the following result −
[1] 2.0 2.5 2.0

%/% The result of division of first vector


v <- c( 2,5.5,6)
with second (quotient)
t <- c(8, 3, 4)
print(v%/%t)
it produces the following result −
[1] 0 1 1

^ The first vector raised to the exponent


v <- c( 2,5.5,6)
of second vector
t <- c(8, 3, 4)
print(v^t)
it produces the following result −
[1] 256.000 166.375 1296.000

#operatory logiczne
4<5
10 > 100
4 == 5 #porównanie 2 wartosci
4 != 5 # nie równe
4 <= 5 # mniejsze równe
4 >= 5 #wieksze równe

result <- !(1 > 2) # ! not jak operacja w () jest TRUE to zwróci przeciwieństwo FALSE
result

(1 > 2) | (1 < 2) #operato logiczny "or"


(1 > 2) & (1 < 2) #operato logiczny "and"
isTRUE(1 > 2)#isTRUE(x) sprawdza czy x jest prawda
Operator Description Example

> v <- c(2,5.5,6,9)


Checks if each element of the first vector is t <- c(8,2.5,14,9)
greater than the corresponding element of the print(v>t)
second vector.
it produces the following result −
[1] FALSE TRUE FALSE FALSE

< v <- c(2,5.5,6,9)


t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v < t)
less than the corresponding element of the
second vector. it produces the following result −
[1] TRUE FALSE TRUE FALSE

== v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v == t)
equal to the corresponding element of the
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE

<= v <- c(2,5.5,6,9)


t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v<=t)
less than or equal to the corresponding element
of the second vector. it produces the following result −
[1] TRUE FALSE TRUE TRUE

>= v <- c(2,5.5,6,9)


t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v>=t)
greater than or equal to the corresponding
element of the second vector. it produces the following result −
[1] FALSE TRUE FALSE TRUE

!= v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v!=t)
unequal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE

Operator Description Example

&
It is called Element-wise Logical AND operator. v <- c(3,1,TRUE,2+3i)
It combines each element of the first vector t <- c(4,1,FALSE,2+3i)
with the corresponding element of the second print(v&t)
vector and gives a output TRUE if both the
it produces the following result −
elements are TRUE.
[1] TRUE TRUE FALSE TRUE

| It is called Element-wise Logical OR operator.


v <- c(3,0,TRUE,2+2i)
It combines each element of the first vector
t <- c(4,0,FALSE,2+3i)
with the corresponding element of the second
print(v|t)
vector and gives a output TRUE if one the
elements is TRUE. it produces the following result −
[1] TRUE FALSE TRUE TRUE

!
v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes each print(!v)
element of the vector and gives the opposite
it produces the following result −
logical value.
[1] FALSE TRUE FALSE FALSE

The logical operator && and || considers only the first element of the vectors and give a vector of single element as
output.

Operator Description Example

&&
v <- c(3,0,TRUE,2+2i)
Called Logical AND operator. Takes first element t <- c(1,3,TRUE,2+3i)
of both the vectors and gives the TRUE only if print(v&&t)
both are TRUE.
it produces the following result −
[1] TRUE

||
v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first element
print(v||t)
of both the vectors and gives the TRUE if one of
them is TRUE. it produces the following result −
[1] FALSE

Assignment operators

These operators are used to assign values to vectors.

Operator Description Example

Called Left Assignment


v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
v3 = c(3,1,TRUE,2+3i)
<− print(v1)
or print(v2)
= print(v3)
or it produces the following result −
<<− [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i

Called Right Assignment


c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
-> print(v1)
print(v2)
or
->> it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
Specific Purpose Operators

Operator Description Example

:
Colon operator. It v <- 2:8
creates the series of print(v)
numbers in sequence for
it produces the following result −
a vector.
[1] 2 3 4 5 6 7 8

%in%
v1 <- 8
v2 <- 12
t <- 1:10
This operator is used to print(v1 %in% t)
identify if an element print(v2 %in% t)
belongs to a vector.
it produces the following result −

[1] TRUE
[1] FALSE

%*%
M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
t = M %*% t(M)
print(t)
This operator is used to
multiply a matrix with
it produces the following result −
its transpose.

[,1] [,2]
[1,] 65 82
[2,] 82 117

Sets Operations

rm(x); rm(y)
x<-c(1,2,5)
y<-c(5,1,8,9)
#wypisywanie 2 zmiennych, wywołanie w jednej linii mozliwe po ;
x;y
#union of 2 sets
union(x, y)# 1 2 5 8 9
#intersect of 2 sets - czesc wspolna
intersect(x, y)#1 5
#difference of 2 sets, all emenents of x that are not in y
setdiff(x, y) #2
#test for equality
setequal(x, y) # FALSE
# testing membership c %in% y
2 %in% x #TRUE
#number of possible subsets of size k from set with size n - choose (n,k)
choose(5,2)

Functions
By putting ? in front of the function we would get the help for this function
?rnorm()
?c()

Function defined with using key word function

function_name <- function(arg_1, arg_2, ...) {


Function body
}

We can create user-defined functions in R. They are specific to what a user wants and once created they can be used
like the built-in functions. Below is an example of how a function is created and used.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

The arguments to a function call can be supplied in the same sequence as defined in the function or they can be supplied
in a different sequence but assigned to the names of the arguments.

# Create a function with arguments.


new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}

# Call the function by position of arguments.


new.function(5,3,11)

# Call the function by names of the arguments.


new.function(a = 11, b = 5, c = 3)

#DECLARING anonymous FUNCTIONS – they have no name!


f <- function(x, c) return((x+c)^2)

f(1:3, 0)#1 4 9
f(1:3, 2)#9 16 25
f(1:3, 1:3)#4 16 36 = (1+1)^2, (2+2)^2 (3+3)^2

Calling a Function with Default Argument


We can define the value of the arguments in the function definition and call the function without supplying any argument
to get the default result. But we can also call such functions by supplying new values of the argument and get non
default result.

# Create a function with arguments.


new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}

# Call the function without giving any argument.


new.function()

# Call the function with giving new values of the argument.


new.function(9,5)

When we execute the above code, it produces the following result −


[1] 18
[1] 45

Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the function body.

# Create a function with arguments.


new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}

# Evaluate the function without supplying one of the arguments.

new.function(6)

When we execute the above code, it produces the following result −

[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default

function()
{
#body code
}

#funckja do tworenia wykresow, przekazuejmy jakie dane chcemy uzyc i które wiersze
wybrac
myplot <- function(data, rows=1:10) #paramtery przyjmuja default values
{
Data <- data[rows,,drop=F]
#creating the chart
matplot(t(Data), type="b", pch=c(15:18), col = c(1:4,6))
#adding legend
legend("bottomleft", inset=0.01, legend=Players[rows], col = c(1:4,6), pch=c(15:18),
horiz=F)
}

myplot(Salary, 1:4)

#charts for Salary matrix


myplot(Salary)
myplot(Salary / FieldGoals)

Packages

#1. Isntall package xlsx


install.packages("xlsx")
#2 Verify that package is installed
any(grepl("xlsx", installed.packages())) # TRUE
#3. Load library into workspace
library(xlsx)

Packages could be found on website cran.r-project.org


#installing package
install.packages("ggplot2") # packagae would be downloaded and installed
#activating the package
library(ggplot2)

#Get library locations containing R packages


.libPaths()

#"C:/Users/Pc/Documents/R/win-library/3.5" "C:/Program Files/R/R-3.5.1/library"

#Get the list of all the packages installed


library()

#Get all packages currently loaded in the R environment


search()

#Install package manually


install.packages(file_name_with_path, repos = NULL, type = "source")

# Install the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

Strings

Valid Strings
a <- 'Start and end with single quote'
print(a)

b <- "Start and end with double quotes"


print(b)

c <- "single quote ' in between double quotes"


print(c)

d <- 'Double quotes " in between single quote'


print(d)

Invalid Strings
e <- 'Mixed quotes"
print(e)
Error: unexpected symbol in:
"print(e)

f <- 'Single quote ' inside single quote'


print(f)
f <- 'Single"
Execution halted

g <- "Double quotes " inside double quotes"


print(g)
Concatenating Strings - paste() function
Many strings in R are combined using the paste() function. It can take any number of arguments to be combined
together.

The basic syntax for paste function is

paste(..., sep = " ", collapse = NULL)

Following is the description of the parameters used −

 ... represents any number of arguments to be combined.

 sep represents any separator between the arguments. It is optional.

 collapse is used to eliminate the space in between two strings. But not the space within two words of one
string.

 a <- "Hello"
 b <- 'How'
 c <- "are you? "

 print(paste(a,b,c))
 print(paste(a,b,c, sep = "-"))
 print(paste(a,b,c, sep = "", collapse = ""))

When we execute the above code, it produces the following result −


 [1] "Hello How are you? "
 [1] "Hello-How-are you? "
 [1] "HelloHoware you? "

Formatting numbers and strings


Numbers and strings can be formatted to a specific style using format()function.
The basic syntax for format function is
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))

Following is the description of the parameters used −


 x is the vector input.
 digits is the total number of digits displayed.
 nsmall is the minimum number of digits to the right of the decimal point.
 scientific is set to TRUE to display scientific notation.
 width indicates the minimum width to be displayed by padding blanks in the beginning.
 justify is the display of the string to left, right or center.

# Total number of digits displayed. Last digit rounded off.


result <- format(23.123456789, digits = 9)
print(result)

# Display numbers in scientific notation.


result <- format(c(6, 13.14521), scientific = TRUE)
print(result)

# The minimum number of digits to the right of the decimal point.


result <- format(23.47, nsmall = 5)
print(result)

# Format treats everything as a string.


result <- format(6)
print(result)

# Numbers are padded with blank in the beginning for width.


result <- format(13.7, width = 6)
print(result)

# Left justify strings.


result <- format("Hello", width = 8, justify = "l")
print(result)

# Justfy string with center.


result <- format("Hello", width = 8, justify = "c")
print(result)

When we execute the above code, it produces the following result −

[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] " 13.7"
[1] "Hello "
[1] " Hello "

Counting number of characters in a string - nchar() function


This function counts the number of characters including spaces in a string. The basic syntax for nchar() function is
nchar ( x)

result <- nchar("Count the number of characters")


print(result)
>>30

Changing the case - toupper() & tolower() functions

These functions change the case of characters of a string.


The basic syntax for toupper() & tolower() function is

toupper(x)
tolower(x)

# Changing to Upper case.


result <- toupper("Changing To Upper")
print(result)
>>"CHANGING TO UPPER"

# Changing to lower case.


result <- tolower("Changing To Lower")
print(result)
>>"changing to lower"

Extracting parts of a string - substring() function

This function extracts parts of a String.


substring(x,first,last)
Following is the description of the parameters used −
 x is the character vector input.
 first is the position of the first character to be extracted.
 last is the position of the last character to be extracted.

# Extract characters from 5th to 7th position.


result <- substring("Extract", 5, 7)
print(result)
>>act

1. grep(pattern, x) - looking for a pattern in text

grep("error", c("warining", "error", "Error", "alarm", "warning_error",


"warning_eRRor"))
#2 5 - return the position in the vecot where the pattern exists
?grep

grep("error",
c("warining", "error", "Error", "alarm", "warning_error", "warning_eRRor"),
ignore.case=TRUE)
#2 3 5 6 - we can set it as case insensitive

2. grepl() – return TRUE if a string contains the pattern, otherwise returns FALSE.

x <- "line 4322: He is now 25 years old, and weights 130lbs"


y <- grepl("\\d+",x)
y#zwroci TRUE bo w tekscie znalazl 2 cyfry po sobie, d+ means at least 1 digit
y1<- grepl("\\d{5}",x)
y1 #FALSE bo w tekscie nie ma co najmniej 5 cyfr kolo siebie
[1] TRUE

Vector match:

str <- c("Regular", "expression", "examples of R language")


x <- grepl("x*ress",str)
x #FALSE TRUE FALSE - dla vektora sprawdzany jest kazdy element wektora i zwracana
jest T or F

3. finds the lenght of the string - nchar(x)


nchar("Pawel") #5

4. concatenate serveral strings - paste ()


paste("North", "Pole") #by default the strings are separated with space
#"North Pole"

paste("North", "Pole", sep="") #we can remove the space from separating the strings
#"NorthPole"

paste("North", "Pole", sep="-")#setting the separator as dash


#"North-Pole"

paste("North", "and", "South", "Pole")


#"North and South Pole"

5. sprintf() - prints the text with values depending on variables


rm(i)
i<-4
sprintf("the square of %d is %d", i, i^2)
#"the square of 4 is 16"
6. Selecting substring in the given character
#substr(x, start, stop)
substr("kajak", 3, 5)#"jak"

7. strsplit() - works like text to columns


strsplit("02-10-2018", split="-")
#"02" "10" "2018"

8. regexpr(pattern, text) - finds the first isntance of pattern within text


regexpr("uat", "Equator")
#1] 3 - matched at 3rd postion
regexpr("Uat", "Equator")
# [1] -1 - not matched due to key sensitivness
regexpr("Uat", "Equator", ignore.case = T)
#[1] 3 matched at 3rd position

9. gregexpr(patern, text) - finding all instances of pattern


gregexpr("xxx", "xxxPawelxxx")
# 1 9
gregexpr("xxx", "xxxPawelXXX", ignore.case = T)
# 1 9

Regular expressions

Grep()
grep("[xy]", c("Pawelx", "Kasiay", "Zigiz"))
#[xy] - looks for any string that contains either x or y

grep("x.y", c("Pawelx", "Kasiay", "Zigixzy"))#3 bo xzy


#x.y - looks for sting where x is follwoed by any single character and then comes y

grep("N..t", c("Pawelx", "Kasiay", "Zigixzy", "North"))#4 bo North, szukane t nie musi


byc ostatnie

grep("\\.", c("Pawelx", "Kasiay", "Zigixzy", "x.y")#4 bo x.y


#po \\ szukamy znakow specjalnyhc np. kropkitart of file
readLines(c, n=1) # "John 25"

Funckja gsub()

gsub() usuwa patern ze stringu

gsub(pattern, replacement, string, ignore.case=FALSE, perl=FALSE, fixed=FALSE, useBytes=FALSE)


pattern - string to be matched, replacement - string for replacement, string - string or vector
that we would be working on

x <- "xxxPawelxxx"
gsub("xxx", "zzz", x)#"zzzPawelzzz

y <- "On ma 33 lata i wazy 90kg"


gsub("\\d+", "---", y)#"On ma --- lata i wazy ---kg"

Syntax Description
\\d Digit, 0,1,2 ... 9
\\D Not Digit
\\s Space
\\S Not Space
\\w Word
\\W Not Word
\\t Tab
\\n New line
^ Beginning of the string
$ End of the string
\ Escape special characters, e.g. \\ is "\", \+ is "+"
| Alternation match. e.g. /(e|d)n/ matches "en" and "dn"
• Any character, except \n or line terminator
[ab] a or b
[^ab] Any character except a and b
[0-9] All Digit
[A-Z] All uppercase A to Z letters
[a-z] All lowercase a to z letters
[A-z] All Uppercase and lowercase a to z letters
i+ i at least one time
i* i zero or more times
i? i zero or 1 time
i{n} i occurs n times in sequence
i{n1,n2} i occurs n1 - n2 times in sequence
i{n1,n2}? non greedy match, see above example
i{n,} i occures >= n times
[:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:] Alphabetic characters: [:lower:] and [:upper:]
[:blank:] Blank characters: e.g. space, tab
[:cntrl:] Control characters
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:] Graphical characters: [:alnum:] and [:punct:]
[:lower:] Lower-case letters in the current locale
[:print:] Printable characters: [:alnum:], [:punct:] and space
[:punct:] Punctuation character: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
Space characters: tab, newline, vertical tab, form feed, carriage return,
[:space:]
space
[:upper:] Upper-case letters in the current locale
[:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
LOOPS
WHILE

#while loop wykonywna dopki warunek jestr TRUE


while(warunek)
{
body of the loop
}

#inside loop we need to use command print()


print("Test")

n<-0
n

while(n<10)
{
print(n)
n <- n+1
}

FOR LOOP

#FOR LOOP
for(warunek)
{
bofy of the loop
}

for(i in 1:5)
{
print(i)
}

IF STATEMENT

#IF statement

x<-2
if (x > 0)
{
print("Liczba wieksza od 0")
} else if (x < 0) #wazne zeby else if bylo w tej samej linicjce co zamykajace }
{
print("Liczba mniejsza od 0")
} else #wazne zeby else bylo w tej samej linicjce co zamykajace }
{
print("Liczba rowna 0")
}

if-ten-else -> ifelse() function in R


ifelse(b,u,v) # b- boolean vector, u,v - vectors

rm(x)
rm(y)
x<-1:10
y<-ifelse(x%%2 == 0, 5, 12)# %% - mod, reszta z dzielenia
#jak reszta z dzielenia to 0 to podstaw 5, jak nie to 12
y

Statistical Distributions
d - density or probability mass function (pmf)
p – cumulative distribution function (cdf)
q – qunatiles
r – random number generation

MEASURE OF CENTER
#mean, median, skewness, kurtosis

#rnorm() - random number from normal distribution


?rnorm
#rnorm(number of observations, mean, standard deviation)
x <- rnorm(1000, 3, .25)
hist(x)

mean(x) #3.013375
median(x) #3.006536

#LIBRARY MOMENTS MEASURES THE SKEWNESS


install.packages("moments")
library(moments)

#skewness
skewness(x) #0.02695227

#kurtosis
kurtosis(x) #2.834572

#D'Agostino skewness test


agostino.test(x)
'''
data: x
skew = 0.026952, z = 0.350450, p-value = 0.726
alternative hypothesis: data have a skewness
'''

MEASURE OF VARIATION
#standard deviation
sd(x) #0.2492114

#CHI-SQUARE TEST
#H0 - 2 nominal variables (row or column) has no association between them

#creating the data


men = c(150, 120, 45)
women = c(320, 270, 100)

#creating data frame


food.survey = as.data.frame(rbind(men, women))
food.survey

names(food.survey) <- c("Chicken", "Salad", "Cake")


food.survey
chisq.test(food.survey)
'''
Pearsons Chi-squared test

data: food.survey
X-squared = 0.13751, df = 2, p-value = 0.9336
'''

CORRELATION
#Parametric form - Pearson's correlation - should be used on data that are normally
distributed
#Non-parametric - Spearman's Rank and Kendall Tau - should be used for data that are
not normally distributed

data(mtcars)

#checking the correkation with ploting 2 variables


library(ggplot2)
p1 <- qplot(data=mtcars, mpg, wt,
xlab = "Miles/gallon", ylab="Weight",
main = "Miles per gallon vs. weight")
p1

#checking the distriburion


hist(mtcars$mpg)
hist(mtcars$wt)

#Shapiro -Wilk normality test for mpg


#H0: data are normally distributed
shapiro.test(mtcars$mpg)

'''
Shapiro-Wilk normality test

data: mtcars$mpg
W = 0.94756, p-value = 0.1229
p > 0.05 => we are accepting HO, data are normally distributed
'''
shapiro.test(mtcars$wt)
'''
Shapiro-Wilk normality test

data: mtcars$wt
W = 0.94326, p-value = 0.09265

p > 0.05 => we are accepting HO, data are normally distributed
'''

#Both data set are normally distributed => we are suing Person's correlation
cor(mtcars$mpg, mtcars$wt) #default = Pearson's
#[1] -0.8676594

#when changing the variables order we are getting the same result
cor(mtcars$wt, mtcars$mpg, method="pearson")
#[1] -0.8676594

#in case of NAs being present we should specify complete.obs to use only rows that got
observations
cor(mtcars$wt, mtcars$mpg, method="pearson", use="complete.obs")
#[1] -0.8676594

#Deos the correlation is statistically significant?


r <- cor.test(mtcars$wt, mtcars$mpg)
r
'''
Pearsons product-moment correlation

data: mtcars$wt and mtcars$mpg


t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594

p-value jest mniejsze < 0.05 wiec odrzucamy HO, nasze zmienne sa skorelowane
'''

#CORRELATION between multiple variables


install.packages("corrplot")
library(corrplot)

data <- mtcars[, c(1,2,3,4,5)]


corr1 <-cor(data)
corr1 #zwroci nam macierz korelacji
corrplot(corr1) #narysuje nam ladny wykres jakie korelacje sa najsilniejsze
corrplot(corr1, method="color")

#CPRRELATION FOR DIFFERENT DATA


#checking the correkation with ploting 2 variables

p2 <- qplot(data=iris, Petal.Length, Petal.Width,


xlab = "Petal.Length", ylab="Petal.width",
main = "Petal.Length vs Petal.width")
p2

#Computing correlation
cor(iris$Petal.Length, iris$Petal.Width)
#[1] [1] 0.9628654
#should we use Perason's correlation for iris ?
shapiro.test(iris$Petal.Length)
'''
Shapiro-Wilk normality test

data: iris$Petal.Length
W = 0.87627, p-value = 7.412e-10

P< 0.05 => we are faiiling to reject H0, data are not normally distributed
'''
#we should use spearman or kendall method for computing correlation
cor(iris$Petal.Length, iris$Petal.Width, method="spearman")
#[1] 0.9376668
cor(iris$Petal.Length, iris$Petal.Width, method="kendall")
#[1] 0.8068907

Testing normal distribution

#data() loads data set


data("ToothGrowth")
head(ToothGrowth)
X <- ToothGrowth
str(X)

#1. Plotting data in histogram to checj if distriburion is "normal" like


hist(X$len)

#2.Checking the data distribution with qqplot


#qqplot

#qqnorm () - jezeli dane maja rozklad normlany to powinny sie one ukladac wzdłuż linii
prostej y=x
#qqnrom rysuje nasze dane na wykresie i mozemy ocenic czy sa dopasowane do lini prostej
czy nie

#qqline() - wychodzi z tego samego założenia co qqnorm ale dodatkowo mamy wyrysowana
linie prosta na wykresie wzdluz ktorej powinny sie ukladac punkt z rozkladu normalnego

qqnorm(X$len)
qqline(X$len)

#3. Shapiro-Wilk test


#HO: data are normally distributed
shapiro.test(X$len)
'''
Shapiro-Wilk normality test

data: X$len
W = 0.96743, p-value = 0.1091

P > 0.05 wiec nie ma podstaw do odrzucenia HO, dane maja rozklad normalny
'''

#to compute the Z value for the 95% confidence interval we can use qnorm()
#we are using 0.975 as 1-.95 = 0.05 ale 0.05 jest suma warstosci na 2 krance ogonów
(alpha)
#z jednej strony bedzie 0.5 * alpha = 0.5 * 0.05 = 0.025
#1 - 0.025 = 0.975
qnorm(0.975)

CONFIDENCE INTERVAL for NORMAL DISTRIBUTION

#1. Computing standard deviation


S= sd(X$len) # standard deviation
n=nrow(X)

SE = S / sqrt(n)
SE

#2. Z valuee
Zval = qnorm(0.975)
Zval# 1.959964

#3. Margin of error


MOE = Zval * SE
MOE #1.935508

#4.Confidence interval

srednia = mean(X$len)
CI <- srednia + c(-MOE, MOE)
CI #1] 16.87783 20.74884

CONVIDENCE INTERVAL FOR t-Distribution

#(use in cases wheb n < 30)

n=length(X$len)
n
#1.Computnig t value
tval <- qt(0.975, df=n-1)
tval #2.000995

#2.Margin of ERROR
tMOE <- tval *SE

#3 Configence Interval
sredniat <-mean(X$len)
CIt <- srednia + c(-tMOE, tMOE)
CIt #16.83731 20.78936

#Wszystkie szczegołóy możemy dostać poprzez uzycie funckji t.test()


t.test(X$len)
'''
One Sample t-test

data: X$len
t = 19.051, df = 59, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
16.83731 20.78936
sample estimates:
mean of x
18.81333
'''
#wywolujac t-test mozemy tez sprecyzowac jaki przedzial ufnosci mamy uzyc

t.test(X$len, conf.level = 0.9)

'''
One Sample t-test

data: X$len
t = 19.051, df = 59, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
17.16309 20.46358
sample estimates:
mean of x
18.81333
'''

T Test
#examine if the difference in means are significat different

#Data hast to be normally distributed to use the t-Test


shapiro.test(X$len)#p-value = 0.1091 > 0.05 so the data are normally sitributed

X <-ToothGrowth
srednia <- mean(X$len)
srednia #18.81333

#one sided t-Test, test of the mean value is equal to a certain number
#H0: true value of mean = 18
t.test(X$len, mu=18) #jako mu podajemy ile wierzymy, ze srednia wynosi
'''
One Sample t-test

data: X$len
t = 0.82361, df = 59, p-value = 0.4135
alternative hypothesis: true mean is not equal to 18
95 percent confidence interval:
16.83731 20.78936
sample estimates:
mean of x
18.81333

p-value >0.05 i CI zawera 18 => H0, srednia = 18


'''

#independent 2-group t-test

#test the difference in means between 2 grops / @ populations


#H0: there is no difference in population means of the 2 groups

OJ <- X$len[X$supp == "OJ"]


VC <- X$len[X$supp == "VC"]

?t.test
t.test(OJ, VC, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
#var.equal - are the variances are equal for both groups

'''
Welch Two Sample t-test

data: OJ and VC
t = 1.9153, df = 55.309, p-value = 0.06063
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1710156 7.5710156
sample estimates:
mean of x mean of y
20.66333 16.96333

P-value wskazuje na to, że średnie nie sa statysztycznie od siebie rózne


mała wartość p value moze na wskazywać, ze powinismy sprawdzic dalej cz jedna srednia
nie jest wieksza od drugiej
'''

t.test(OJ, VC, paired = FALSE, alternative="greater")


'''
Welch Two Sample t-test

data: OJ and VC
t = 1.9153, df = 55.309, p-value = 0.03032
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.4682687 Inf
sample estimates:
mean of x mean of y
20.66333 16.96333

p-value < 0,05 wiec mozemy odrzucic hipoteze, ze 2 srednie sa rowne, srednie OJ >
srednia VC
'''

#Paired observations in t-Test - rozkład 2 zmiennych nie jest losowy


#porównujemy np. wyniki przed podaniem leku i po jego podaniu, rozkald zmiennych
losowych bedzie poalczony (paired)

t.test(OJ, VC, paired = TRUE, var.equal = FALSE, conf.level = 0.95)


'''
Paired t-test

data: OJ and VC
t = 3.3026, df = 29, p-value = 0.00255
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.408659 5.991341
sample estimates:
mean of the differences
3.7
mala wartosc P-value wskazuja na odrzucenie H0
'''

Linear regression

checking the biult-in data sets


library(help="datasets")

data("Orange")
head(Orange)

plot(Orange$age, Orange$circumference)

#model the variation in circumference as function of age


#HO: there is no link between age and circumference (obwod)

#1. Building lineare regression equation


#lm() is used to fit linear model lm(y~x) => y = ax +b
#lm(y~x+0) => y = ax, case when the intercept is known and equal to 0

fit = lm(circumference ~ age, data=Orange)


summary(fit)

'''
Residuals:
Min 1Q Median 3Q Max
-46.310 -14.946 -0.076 19.697 45.111

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.399650 8.622660 2.018 0.0518 .
age 0.106770 0.008277 12.900 1.93e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.74 on 33 degrees of freedom


Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14

'''
#what to check: p-value <0.05 => rejrect H0, there is relation
#R2 and adjusted R2 is <> 0

#coefficients, H0 the coeeficients of sloe and intercept are 0 in population


#if p<0.05 rejrect H0, it is represented by Pr(>|t|). For age cofficient is small
1.93e-14 *** and would point that coefficient is not different then zero

#plotting the regression line and the point of the observation


library(ggplot2)
ggplot(Orange, aes(x=age, y=circumference)) +
geom_point(color='#2980B9', size =4) +
geom_smooth(method=lm, color='#2C3e50') #adding lm line to plot

#confidence interval
adding geom_smooth(method=lm, color='#2C3e50') to plot by default also provide
condifence interval in grey in addition to the linear regression line

new.dat=data.frame(age=1500)
head(new.dat)

predict(fit, newdata=new.dat, interval='confidence')


#predict() function returns the confidence interval for Y when using newdata for X

'''
fit lwr upr
1 177.5551 164.8539 190.2564
'''

#confidence interval for regression coefficent


confint(fit)
'''
2.5 % 97.5 %
(Intercept) -0.14328303 34.9425835
age 0.08993141 0.1236092

Multiple Linear Regression


# Y = aX1 + bX2 + ... + nXn +bo
#Y zależy od wecej niz jedne zmiennej objassniajacej X

data(iris)
fit1 = lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris)

#nasz model regresji wyglada nastepujaco


#Sepal.Length = b1 *Sepal.Width + b2* Petal.Length +b0

summary(fit1)

TESTING CONDITIONS FOR LINEAR REGRESSION


install.packages("lmtest")
library(lmtest)
#1. Durbin Watson - Test for Autocorrelated / non-dndependence of Errors
#H0: There is no auto-correlation between errors
dwtest(fit)

#2 Testing heteroscedasticiy
#H0: hupothesis of constant error variance, variance around regression line is the same
for all values
install.packages("car")
library(car)
ncvTest(fit)

Defining Class
Define class – setClass ( )
Create object – new ()
Reference member variable - @
Implement generefic function – setMethod ()
Declare generic – setGeneric ()

#1. Creating a class using setClass()


setClass("numbers", representation(a = "numeric", b = "numeric"))
#class numbers created which contains 2 numeric fileds - a & b
#the representation() method defines slot names and their associated data types

#2. Creating new object of class using new() function


num1 = new("numbers", a=12, b=42)
num1

#3. Adding method to a class - setMethod()


#setMethod(methodName, class, definition of the method)
#method used must match the already existing function in R

setMethod("print", "numbers", function(x){


cat(paste("a=", x@a))
}
)

print(num1) # a = 12

Example
#1. Defining class employee
setClass("employee", representation (name="character", salary="numeric",
union="logical"))

#2. create new instance of class - defining new object of class employee
#new("ClassName", "assigning values for each of value - slots- defined for the class)
joe <- new("employee", name="Joe", salary = 55000, union=T)
joe
'''
An object of class "employee"
Slot "name":
[1] "Joe"

Slot "salary":
[1] 55000

Slot "union":
[1] TRUE
'''
#3. Checing values for each of slot assinged to class object - @ or slot()
joe@salary
slot(joe, "salary")

#4 chaning the value assigned to slot for object of class


joe@salary <-88000

#5.Displaying the details for object - show()


show(joe)

#6. Overwritting the function with method()

setMethod("show", "employee", function(object){

#defining helper local variable, T than is, F than is not


inorout <-ifelse(object@union, "is", "is not")

cat(object@name, "has a salary of", object@salary,


"and", inorout, "in the union", "\n")
}

)
#7. Checking how the method works after overwridding it for employee object.
joe # Joe has a salary of 88000 and is in the union
#Joe has a salary of 88000 and is in the union

NBA DATA

#Seasons
Seasons <- c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")

#Players
Players <-
c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","
ChrisPaul","KevinDurant","DerrickRose","DwayneWade")

#Salaries
KobeBryant_Salary <-
c(15946875,17718750,19490625,21262500,23034375,24806250,25244493,27849149,30453805,2350
0000)
JoeJohnson_Salary <-
c(12000000,12744189,13488377,14232567,14976754,16324500,18038573,19752645,21466718,2318
0790)
LeBronJames_Salary <-
c(4621800,5828090,13041250,14410581,15779912,14500000,16022500,17545000,19067500,206444
00)
CarmeloAnthony_Salary <-
c(3713640,4694041,13041250,14410581,15779912,17149243,18518574,19450000,22407474,224580
00)
DwightHoward_Salary <-
c(4493160,4806720,6061274,13758000,15202590,16647180,18091770,19536360,20513178,2143627
1)
ChrisBosh_Salary <-
c(3348000,4235220,12455000,14410581,15779912,14500000,16022500,17545000,19067500,206444
00)
ChrisPaul_Salary <-
c(3144240,3380160,3615960,4574189,13520500,14940153,16359805,17779458,18668431,20068563
)
KevinDurant_Salary <-
c(0,0,4171200,4484040,4796880,6053663,15506632,16669630,17832627,18995624)
DerrickRose_Salary <-
c(0,0,0,4822800,5184480,5546160,6993708,16402500,17632688,18862875)
DwayneWade_Salary <-
c(3031920,3841443,13041250,14410581,15779912,14200000,15691000,17182000,18673000,150000
00)
#Matrix
Salary <- rbind(KobeBryant_Salary, JoeJohnson_Salary, LeBronJames_Salary,
CarmeloAnthony_Salary, DwightHoward_Salary, ChrisBosh_Salary, ChrisPaul_Salary,
KevinDurant_Salary, DerrickRose_Salary, DwayneWade_Salary)
#rm() - removes objest from cached memory, rm() cleans vectors from memory as all data
we now have in matrix
rm(KobeBryant_Salary, JoeJohnson_Salary, CarmeloAnthony_Salary, DwightHoward_Salary,
ChrisBosh_Salary, LeBronJames_Salary, ChrisPaul_Salary, DerrickRose_Salary,
DwayneWade_Salary, KevinDurant_Salary)
#colnames(matrixName) < - it is used to label columns in martix
colnames(Salary) <- Seasons
#rownames() function labels the amtrix rows
rownames(Salary) <- Players

#Games
KobeBryant_G <- c(80,77,82,82,73,82,58,78,6,35)
JoeJohnson_G <- c(82,57,82,79,76,72,60,72,79,80)
LeBronJames_G <- c(79,78,75,81,76,79,62,76,77,69)
CarmeloAnthony_G <- c(80,65,77,66,69,77,55,67,77,40)
DwightHoward_G <- c(82,82,82,79,82,78,54,76,71,41)
ChrisBosh_G <- c(70,69,67,77,70,77,57,74,79,44)
ChrisPaul_G <- c(78,64,80,78,45,80,60,70,62,82)
KevinDurant_G <- c(35,35,80,74,82,78,66,81,81,27)
DerrickRose_G <- c(40,40,40,81,78,81,39,0,10,51)
DwayneWade_G <- c(75,51,51,79,77,76,49,69,54,62)
#Matrix
Games <- rbind(KobeBryant_G, JoeJohnson_G, LeBronJames_G, CarmeloAnthony_G,
DwightHoward_G, ChrisBosh_G, ChrisPaul_G, KevinDurant_G, DerrickRose_G, DwayneWade_G)
rm(KobeBryant_G, JoeJohnson_G, CarmeloAnthony_G, DwightHoward_G, ChrisBosh_G,
LeBronJames_G, ChrisPaul_G, DerrickRose_G, DwayneWade_G, KevinDurant_G)
colnames(Games) <- Seasons
rownames(Games) <- Players

#Minutes Played
KobeBryant_MP <- c(3277,3140,3192,2960,2835,2779,2232,3013,177,1207)
JoeJohnson_MP <- c(3340,2359,3343,3124,2886,2554,2127,2642,2575,2791)
LeBronJames_MP <- c(3361,3190,3027,3054,2966,3063,2326,2877,2902,2493)
CarmeloAnthony_MP <- c(2941,2486,2806,2277,2634,2751,1876,2482,2982,1428)
DwightHoward_MP <- c(3021,3023,3088,2821,2843,2935,2070,2722,2396,1223)
ChrisBosh_MP <- c(2751,2658,2425,2928,2526,2795,2007,2454,2531,1556)
ChrisPaul_MP <- c(2808,2353,3006,3002,1712,2880,2181,2335,2171,2857)
KevinDurant_MP <- c(1255,1255,2768,2885,3239,3038,2546,3119,3122,913)
DerrickRose_MP <- c(1168,1168,1168,3000,2871,3026,1375,0,311,1530)
DwayneWade_MP <- c(2892,1931,1954,3048,2792,2823,1625,2391,1775,1971)
#Matrix
MinutesPlayed <- rbind(KobeBryant_MP, JoeJohnson_MP, LeBronJames_MP, CarmeloAnthony_MP,
DwightHoward_MP, ChrisBosh_MP, ChrisPaul_MP, KevinDurant_MP, DerrickRose_MP,
DwayneWade_MP)
rm(KobeBryant_MP, JoeJohnson_MP, CarmeloAnthony_MP, DwightHoward_MP, ChrisBosh_MP,
LeBronJames_MP, ChrisPaul_MP, DerrickRose_MP, DwayneWade_MP, KevinDurant_MP)
colnames(MinutesPlayed) <- Seasons
rownames(MinutesPlayed) <- Players

#Field Goals
KobeBryant_FG <- c(978,813,775,800,716,740,574,738,31,266)
JoeJohnson_FG <- c(632,536,647,620,635,514,423,445,462,446)
LeBronJames_FG <- c(875,772,794,789,768,758,621,765,767,624)
CarmeloAnthony_FG <- c(756,691,728,535,688,684,441,669,743,358)
DwightHoward_FG <- c(468,526,583,560,510,619,416,470,473,251)
ChrisBosh_FG <- c(549,543,507,615,600,524,393,485,492,343)
ChrisPaul_FG <- c(407,381,630,631,314,430,425,412,406,568)
KevinDurant_FG <- c(306,306,587,661,794,711,643,731,849,238)
DerrickRose_FG <- c(208,208,208,574,672,711,302,0,58,338)
DwayneWade_FG <- c(699,472,439,854,719,692,416,569,415,509)
#Matrix
FieldGoals <- rbind(KobeBryant_FG, JoeJohnson_FG, LeBronJames_FG, CarmeloAnthony_FG,
DwightHoward_FG, ChrisBosh_FG, ChrisPaul_FG, KevinDurant_FG, DerrickRose_FG,
DwayneWade_FG)
rm(KobeBryant_FG, JoeJohnson_FG, LeBronJames_FG, CarmeloAnthony_FG, DwightHoward_FG,
ChrisBosh_FG, ChrisPaul_FG, KevinDurant_FG, DerrickRose_FG, DwayneWade_FG)
colnames(FieldGoals) <- Seasons
rownames(FieldGoals) <- Players

#Field Goal Attempts


KobeBryant_FGA <- c(2173,1757,1690,1712,1569,1639,1336,1595,73,713)
JoeJohnson_FGA <- c(1395,1139,1497,1420,1386,1161,931,1052,1018,1025)
LeBronJames_FGA <- c(1823,1621,1642,1613,1528,1485,1169,1354,1353,1279)
CarmeloAnthony_FGA <- c(1572,1453,1481,1207,1502,1503,1025,1489,1643,806)
DwightHoward_FGA <- c(881,873,974,979,834,1044,726,813,800,423)
ChrisBosh_FGA <- c(1087,1094,1027,1263,1158,1056,807,907,953,745)
ChrisPaul_FGA <- c(947,871,1291,1255,637,928,890,856,870,1170)
KevinDurant_FGA <- c(647,647,1366,1390,1668,1538,1297,1433,1688,467)
DerrickRose_FGA <- c(436,436,436,1208,1373,1597,695,0,164,835)
DwayneWade_FGA <- c(1413,962,937,1739,1511,1384,837,1093,761,1084)
#Matrix
FieldGoalAttempts <- rbind(KobeBryant_FGA, JoeJohnson_FGA, LeBronJames_FGA,
CarmeloAnthony_FGA, DwightHoward_FGA, ChrisBosh_FGA, ChrisPaul_FGA, KevinDurant_FGA,
DerrickRose_FGA, DwayneWade_FGA)
rm(KobeBryant_FGA, JoeJohnson_FGA, LeBronJames_FGA, CarmeloAnthony_FGA,
DwightHoward_FGA, ChrisBosh_FGA, ChrisPaul_FGA, KevinDurant_FGA, DerrickRose_FGA,
DwayneWade_FGA)
colnames(FieldGoalAttempts) <- Seasons
rownames(FieldGoalAttempts) <- Players

#Points
KobeBryant_PTS <- c(2832,2430,2323,2201,1970,2078,1616,2133,83,782)
JoeJohnson_PTS <- c(1653,1426,1779,1688,1619,1312,1129,1170,1245,1154)
LeBronJames_PTS <- c(2478,2132,2250,2304,2258,2111,1683,2036,2089,1743)
CarmeloAnthony_PTS <- c(2122,1881,1978,1504,1943,1970,1245,1920,2112,966)
DwightHoward_PTS <- c(1292,1443,1695,1624,1503,1784,1113,1296,1297,646)
ChrisBosh_PTS <- c(1572,1561,1496,1746,1678,1438,1025,1232,1281,928)
ChrisPaul_PTS <- c(1258,1104,1684,1781,841,1268,1189,1186,1185,1564)
KevinDurant_PTS <- c(903,903,1624,1871,2472,2161,1850,2280,2593,686)
DerrickRose_PTS <- c(597,597,597,1361,1619,2026,852,0,159,904)
DwayneWade_PTS <- c(2040,1397,1254,2386,2045,1941,1082,1463,1028,1331)

#Matrix
Points <- rbind(KobeBryant_PTS, JoeJohnson_PTS, LeBronJames_PTS, CarmeloAnthony_PTS,
DwightHoward_PTS, ChrisBosh_PTS, ChrisPaul_PTS, KevinDurant_PTS, DerrickRose_PTS,
DwayneWade_PTS)
rm(KobeBryant_PTS, JoeJohnson_PTS, LeBronJames_PTS, CarmeloAnthony_PTS,
DwightHoward_PTS, ChrisBosh_PTS, ChrisPaul_PTS, KevinDurant_PTS, DerrickRose_PTS,
DwayneWade_PTS)
colnames(Points) <- Seasons
rownames(Points) <- Players

#we have following matrixes


Salary
Games
MinutesPlayed
FieldGoals
FieldGoalAttempts
Points

#operations on matrixes
FieldGoals / Games # value in each matrix is divided by corresponding value - on the
same position - in the other matrix

round(FieldGoals / Games, 2)
round(MinutesPlayed / Games, 2)
#visualization in R using matplot
?matplot # plot the columns of one matrix agains columns of another

FieldGoals
#t() - function to transpose the table
t(FieldGoals)

#creating the chart


matplot(t(FieldGoals/Games), type="b", pch=c(15:18), col = c(1:4,6))
#adding legend
legend("bottomleft", inset=0.01, legend=Players, col = c(1:4,6), pch=c(15:18), horiz=F)

#Seasons
Seasons <- c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")

#Players
Players <-
c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","
ChrisPaul","KevinDurant","DerrickRose","DwayneWade")

#Free Throws
KobeBryant_FT <- c(696,667,623,483,439,483,381,525,18,196)
JoeJohnson_FT <- c(261,235,316,299,220,195,158,132,159,141)
LeBronJames_FT <- c(601,489,549,594,593,503,387,403,439,375)
CarmeloAnthony_FT <- c(573,459,464,371,508,507,295,425,459,189)
DwightHoward_FT <- c(356,390,529,504,483,546,281,355,349,143)
ChrisBosh_FT <- c(474,463,472,504,470,384,229,241,223,179)
ChrisPaul_FT <- c(394,292,332,455,161,337,260,286,295,289)
KevinDurant_FT <- c(209,209,391,452,756,594,431,679,703,146)
DerrickRose_FT <- c(146,146,146,197,259,476,194,0,27,152)
DwayneWade_FT <- c(629,432,354,590,534,494,235,308,189,284)
#Matrix for free throws
FreeThrows <- rbind(KobeBryant_FT,
JoeJohnson_FT,
LeBronJames_FT,
CarmeloAnthony_FT,
DwightHoward_FT,
ChrisBosh_FT,
ChrisPaul_FT,
KevinDurant_FT,
DerrickRose_FT,
DwayneWade_FT)

FreeThrows

#nadanie nazw kolumnom


colnames(FreeThrows) = Seasons
#nadanie nazw wierszom by pokazywały tylko imie i nazwisko bez FT
rownames(FreeThrows) = Players

#Free Throw Attempts


KobeBryant_FTA <- c(819,768,742,564,541,583,451,626,21,241)
JoeJohnson_FTA <- c(330,314,379,362,269,243,186,161,195,176)
LeBronJames_FTA <- c(814,701,771,762,773,663,502,535,585,528)
CarmeloAnthony_FTA <- c(709,568,590,468,612,605,367,512,541,237)
DwightHoward_FTA <- c(598,666,897,849,816,916,572,721,638,271)
ChrisBosh_FTA <- c(581,590,559,617,590,471,279,302,272,232)
ChrisPaul_FTA <- c(465,357,390,524,190,384,302,323,345,321)
KevinDurant_FTA <- c(256,256,448,524,840,675,501,750,805,171)
DerrickRose_FTA <- c(205,205,205,250,338,555,239,0,32,187)
DwayneWade_FTA <- c(803,535,467,771,702,652,297,425,258,370)

#Matrix

FreeThrowAttempts <- rbind(KobeBryant_FTA,


JoeJohnson_FTA,
LeBronJames_FTA,
CarmeloAnthony_FTA,
DwightHoward_FTA,
ChrisBosh_FTA,
ChrisPaul_FTA,
KevinDurant_FTA,
DerrickRose_FTA,
DwayneWade_FTA)

FreeThrowAttempts

#naming the columns


colnames(FreeThrowAttempts) = Seasons
#nadanie nazw wierszom by pokazywały tylko imie i nazwisko bez FT
rownames(FreeThrowAttempts) = Players

#Re-create the plotting function


myplot <- function(z, who=1:10) {
matplot(t(z[who,,drop=F]), type="b", pch=15:18, col=c(1:4,6), main="Basketball
Players Analysis")
legend("bottomleft", inset=0.01, legend=Players[who], col=c(1:4,6), pch=15:18,
horiz=F)
}

#Visualize the new matrices


myplot(FreeThrows)
myplot(FreeThrowAttempts)

#Part 1 - Free Throw Attempts Per Game


#(You will need the Games matrix)
myplot(FreeThrowsAttempts/Games)
#Notice how Chris Paul gets few attempts per game

#Part 2 - Free Throw Accuracy


myplot(FreeThrows/FreeThrowAttempts)
#And yet Chris Paul's accuracy is one of the highest
#Chances are his team would get more points if he had more FTA's
#Also notice that Dwight Howard's FT Accuracy is extremely poor
#compared to other players. If you recall, Dwight Howard's
#Field Goal Accuracy was exceptional:
myplot(FieldGoals/FieldGoalAttempts)
#How could this be? Why is there such a drastic difference?
#We will see just now...

#Part 3 - Player Style Patterns Excluding Free Throws


myplot((Points-FreeThrows)/FieldGoals)
#Because we have excluded free throws, this plot now shows us
#the true representation of player style change. We can verify
#that this is the case because all the marks without exception
#on this plot are between 2 and 3. That is because Field Goals
#can only be for either 2 points or 3 points.
#Insights:
#1. You can see how players' preference for 2 or 3 point shots
# changes throughout their career. We can see that almost all
# players in this dataset experiment with their style throughout
# their careers. Perhaps, the most drastic change in style has
# been experienced by Joe Johnson.
#2. There is one exception. You can see that one player has not
# changed his style at all - almost always scoring only 2-pointers.
# Who is this mystert player? It's Dwight Howard!
# Now that explains a lot. The reason that Dwight Howard's
# Field Goal accuracy is so good is because he almost always
# scores 2-pointers only. That means he can be close to the basket
# or even in contact with it. Free throws, on the other hand require
# the player to stand 15ft (4.57m) away from the hoop. That's
# probably why Dwight Howard's Free Throw Accuracy is poor.

Housing market

#------ HELPERS

#1. reading the data, musimy miec zinstalowanego perla i wskazac do niego sciezke

bk <-read.xls("rollingsales_manhattan.xls", pattern="BOROUGH", sheet=1,


perl="C:\\Perl64\\bin\\perl.exe")

#at beginning of the xls we have some description of the file, BOROUGH is the one of
the column name
#so we are nor importing anything that was present before the headers of the columns

#2. Correcting the data

#gsub() usuwa patern ze stringu


#gsub(pattern, replacement, string, ignore.case=FALSE, perl=FALSE, fixed=FALSE,
useBytes=FALSE)
#pattern - string to be matched, replacement - string for replacement, strin - string
or vector that we would be working on

x <- "xxxPawelxxx"
gsub("xxx", "zzz", x)#"zzzPawelzzz

y <- "On ma 33 lata i wazy 90kg"


gsub("\\d+", "---", y)#"On ma --- lata i wazy ---kg"

#3 as.numeric() - funckje uzyjemy dla data frame kiedy checmy by pewna kolumna byla
traktowna jako numeric a nie np. factor czy character

#-------
install.packages("gdata")
library("gdata")

getwd()
setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")

bk <-read.xls("rollingsales_manhattan.xls", pattern="BOROUGH", sheet=1,


perl="C:\\Perl64\\bin\\perl.exe")

#checking the details of data


head(bk)
summary(bk)
str(bk)
#w str(bk) widzimy, ze SALE.PRICE taktowane jest jako factor, a chcemy traktowac jako
umeric
#$ SALE.PRICE : Factor w/ 5293 levels "$0","$1","$1,000",
#problem jest to, że kazdy wiersz w SALE.PRICE ma prefix $

#removing the $ sign from column and converting it to numeric, SALE.PRICE.N to nowa
kolumna jaka dodamy do naszego data frame
bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]", "", bk$SALE.PRICE)) #zamien kazdy
znak poza cyframi na ""

str(bk)
#$ SALE.PRICE.N : num 2214693 1654656 1069162 1374637 1649565 ...

#sprawdzenie czy SALE.PRICE.N nie ma obserwacji brakujacych - NA


sum(is.na(bk$SALE.PRICE.N))# [1] 0 - brak obserwacji NA

#listowanie wszystkich nazw kolum


names(bk)
#zamiana nazw kolumn na male litery
names(bk) <-tolower(names(bk))

#clearing / formatiing the data with regular expression


bk$gross.square.feet <- as.numeric(gsub("[^[:digit:]]", "", bk$gross.square.feet))
bk$land.square.feet <- as.numeric(gsub("[^[:digit:]]", "", bk$land.square.feet))
bk$sale.date <- as.Date(bk$sale.date)
bk$year.built <- as.numeric(as.character(bk$year.built))

#doing to make sure anythin weird is taking place with sale prices
attach(bk)
#The database is attached to the R search path. This means that the database is
searched by R when evaluating a variable, so objects in the database can be accessed by
simply giving their names.

#nie musimy odnosic sie do kolumn przez dodanie bk$


hist(sale.price.n)
hist(sale.price.n[sale.price.n>0])
hist(gross.square.feet(sale.price==0))

detach(bk) # teraz znow musimy odnosic sie do kolumn przez bk$

#filterning the data to keep only actual sales (where sale price >0)
bk.sale <- bk[bk$sale.price.n!=0, ] #stworzylismy nowy data frame bk.sale

plot(bk.sale$gross.square.feet, bk.sale$sale.price.n)
plot(log(bk.sale$gross.square.feet), log(bk.sale$sale.price.n))

#let's concentrate on 1-, 2- and 3-family homes


#grepl returns TRUE if a string contains the pattern, otherwise FALSE;
#which() function return the idicies of object for which the logical statement = TRUE

bk.homes <- bk.sale[which(grepl("FAMILY",


bk.sale$building.class.category)), ]
#grepl() wyszukuje wiersze gdzie building class category = Family, a which() zwraca ich
numery
#bk.homes <-bk.sale[<numery wierzy gdzie category = Family>, ] - wiec wybieramy
wszystkie takie wierzse i kolumny

plot(log(bk.homes$gross.square.feet), log(bk.homes$sale.price.n))

bk.homes[which(bk.homes$sale.price.n<100000),
order(bk.homes[which(bk.homes$sale.price.n<100000),]
$sale.price.n),]
#removing outliers that seem like they weren't actual sales

bk.homes$outliers <- (log(bk.homes$sale.price.n) <=5) + 0


bk.homes <- bk.homes[which(bk.homes$outliers==0),]

plot(log(bk.homes$gross.square.feet),log(bk.homes$sale.price.n))

You might also like