Professional Documents
Culture Documents
Basci Syntax....................................................................................................................................................................... 3
Managing objects .............................................................................................................................................................. 3
Getting help ...................................................................................................................................................................... 4
Data Types......................................................................................................................................................................... 5
NA and NULLs.................................................................................................................................................................... 6
Dealing with NAs ........................................................................................................................................................... 7
Vectors .............................................................................................................................................................................. 8
Using [] brackets ........................................................................................................................................................... 9
Vectorized operations ................................................................................................................................................... 9
Naming vector elements ............................................................................................................................................. 11
Vector Element Sorting ............................................................................................................................................... 11
Vectorized operation example.................................................................................................................................... 11
Lists ................................................................................................................................................................................. 12
Matrices .......................................................................................................................................................................... 14
Tworzenie macierzy przy matrix() function ................................................................................................................ 15
Tworzenie macierzy przez rbind() - row binds ............................................................................................................ 16
Tworzenie macierz przez cbind() - column binds........................................................................................................ 16
Changing the size of matrix......................................................................................................................................... 16
Named vectors ............................................................................................................................................................ 17
Naming Matrix Dimensions......................................................................................................................................... 17
Subsetting ................................................................................................................................................................... 18
Matrix Computations .................................................................................................................................................. 18
Matrix functions – apply() ........................................................................................................................................... 20
Linear Algebra on Matrices ......................................................................................................................................... 21
Wykresy........................................................................................................................................................................... 22
Creating graphs - plot() ............................................................................................................................................... 22
Tworzenie wykresów przy uzyciu qplot() .................................................................................................................... 23
Tworzenie wykresów przy użyciu funkcji ggplot() ...................................................................................................... 24
Zapisywanie wykresu do pliku .................................................................................................................................... 29
Arrays .............................................................................................................................................................................. 30
Factors............................................................................................................................................................................. 31
Data Frames .................................................................................................................................................................... 34
CREATING A DATA FRAME .......................................................................................................................................... 34
Using $ sign ................................................................................................................................................................. 36
Operations with data frames ...................................................................................................................................... 37
Analyzing Data Frame ................................................................................................................................................. 38
data exploration with dplyr package .......................................................................................................................... 39
data exploration and visualization with dplyr and ggplot2 ........................................................................................ 41
Working with time series data .................................................................................................................................... 42
Variables.......................................................................................................................................................................... 44
Importing data into R ...................................................................................................................................................... 45
Reading text files......................................................................................................................................................... 45
Reading csv data from file........................................................................................................................................... 46
Reading from online csv files ...................................................................................................................................... 47
Writing to csv .............................................................................................................................................................. 48
Reading excel file ........................................................................................................................................................ 48
Saving to excel............................................................................................................................................................. 49
Reading from xml file .................................................................................................................................................. 49
Reading JSON data ...................................................................................................................................................... 49
Accessing the json data from the web ........................................................................................................................ 50
Reading HTML files (rvest librbary) ............................................................................................................................. 50
reading data from online HTML tables ....................................................................................................................... 52
Exploring imported data ............................................................................................................................................. 52
Input / Output ................................................................................................................................................................. 52
Input – scan() – reading text files................................................................................................................................ 52
Printing to screen ........................................................................................................................................................ 54
Arithmetic Operators ...................................................................................................................................................... 54
Assignment operators ..................................................................................................................................................... 57
Specific Purpose Operators............................................................................................................................................. 58
Sets Operations ............................................................................................................................................................... 58
Functions ......................................................................................................................................................................... 58
Packages .......................................................................................................................................................................... 60
Strings ............................................................................................................................................................................. 61
Regular expressions .................................................................................................................................................... 65
Grep() ...................................................................................................................................................................... 65
Funckja gsub() ......................................................................................................................................................... 65
LOOPS.............................................................................................................................................................................. 67
Statistical Distributions ................................................................................................................................................... 68
MEASURE OF CENTER ................................................................................................................................................. 68
MEASURE OF VARIATION ............................................................................................................................................ 69
CORRELATION ............................................................................................................................................................. 69
Testing normal distribution......................................................................................................................................... 71
CONFIDENCE INTERVAL for NORMAL DISTRIBUTION ................................................................................................. 71
CONVIDENCE INTERVAL FOR t-Distribution ................................................................................................................ 72
T Test ........................................................................................................................................................................... 73
Linear regression ......................................................................................................................................................... 74
Multiple Linear Regression ......................................................................................................................................... 75
Defining Class .................................................................................................................................................................. 76
NBA DATA........................................................................................................................................................................ 77
Housing market ............................................................................................................................................................... 82
Basci Syntax
# - komentarz jednej lkinii
R does not support multi-line comments but you can perform a trick
if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
}
myString <- "Hello, World!"
print ( myString)
Podanie komentarza w “ “ sprawi ze zostanie on wykonany przez interprator ale nie zakłóci jego pracy
Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if you are using
Windows or other system, syntax will remain same.
$ Rscript test.R
x <- c(1, 2, 3)
mode(x) # "numeric"
length(x) #3
y <- c("abc")
mode(y) #character
length(y) #1
Managing objects
listing objects
ls()
xxxtwoxxx<-c(1,2)
#list all object that have two in the name
ls(pattern="two") #"two" "two2" "xxxtwoxxx"
removing object rm()
rm(x)
#removing mulitple objects
rm(y, y1)
#removing ovjects that are returned by ls() function, we need tp assing it to attribute
list
rm(list = ls(pattern="two"))
#removing all objects
rm(list = ls())
Getting help
##getting help
help(mvrnorm)
?mvrnorm
'''
No documentation for ‘mvrnorm’ in specified packages and libraries:
you could try ‘??mvrnorm
'''
??mvrnorm
#search result zwroci nam informacje MASS::mvrnorm
#bedzie to infor o package i funckji w niej
help(mvrnorm)
?mvrnorm
'''
No documentation for ‘mvrnorm’ in specified packages and libraries:
you could try ‘??mvrnorm
'''
??mvrnorm
#search result zwroci nam informacje MASS::mvrnorm
#bedzie to infor o package i funckji w niej
Data Types
#R samo bedzie decuydowal jaki typ przypisac do zmiennej, dla liczb defaul bedzie double
x1 <-1
typeof(x1)
NA and NULLs
Missing data represents in R with the value NA, data exists but it is unknown
NULL, on the other hand, represents that the value in question simply doesn’t exist,
rather than being existent but unknown.
#1. Using NA
rm(x)
x <- c(88, NA, 12, 168, 13)
x
mean(x) #NA
#argument na.rm=T sets NA remove to TRUE, such values are skipped from calculations
mean(x, na.rm=T) #70,25
rm(y)
y<-c(88, NULL, 12, 168, 13)
mean(y) #70.25
#filterning
rm(z)
z <- c(5,2,-3,8)
w<-z[z*z>8]
w #5 -3 8
#filtering simply checks if condition is TRUE or FALSE for each vector component
z*z > 8 #TRUE FALSE TRUE TRUE
#when it is TRUE it is retuning this value
w #5 -3 8
#we can use filterning to assigning values, i.e replace all elements larger than 3 with
0
z[z>3] <-0
z #0 2 -3 0
#using subset, subsets removes the NA values from list of observations while filterning
subset(z, z>5)#6 12
#using which() - return the positions in vector that meet the filterning condition
z <- c(5, 2, -3, NA, 8)
which(z*z > 8)#1 3 5
#finding first occurance of value in vector, iht example we are looking for 8
first1 <- function(x) return(which(x==8))
first1(z) #5
library(MASS)
#lista wbudowanych data sets w R
data()
data(airquality)
#getting more data for the data set that we selected
??airquality
#filterning for rows that have values in all columns, in case of NA in any column row
would be filtered out
#we can use complete.cases to filter out rows contaning NA from our data set
ag2=airquality[complete.cases(airquality),]
str(ag2)
smmary(ag2)
summary(agty.fix)
Vectors
When you want to create vector with more than one element, you should use c( ) function which means to combine
the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
MyVector <- c(3, 45, 56, 732) # function c() combines numbers into vector
print(MyVector)
Using [] brackets
w[c(1,3,5)] # we can use another vector to access elments of other vector. Return 1st, 3rd, 5th element
w[c(-2, -4)] # accesing all elements of vector w besides 2nd and 4ht element
Vectorized operations
a<-c(1,2,3)
b<-c(4,5,6)
#dodawanie vektórow, R sam dodaje wartosci z 2 vectorw a[1] + b[1] = 1+4 =5; a[2] +b[2] = 2+5 = 7
suma = a +b
#mnożenie vectorów
iloczyn = a * b
iloczyn # iloczyn[1] = 1 * 4, iloczyn [2] = 2 * 5, itd.
#dzielenie vectrów
iloraz = a / b
iloraz # iloraz[1] = 1/4, iloraz[2] = 2/5
m + 10:13
'''
> m + 10:13
[,1] [,2]
r1 11 14
r2 14 17
'''
all(x > 8) # FALSE - checks if all values in vector meets the condition
return(runs)
}
rm(c1)
c1 <- c(1,0,0,1,1,1,0,1,1)
#runs of 1s of length 2 beginning at indices 4, 5, and 8
findruns(c1, 2)
v1 <- c(1,2,3,4,5)
v2 <- c(10,11)
Przy róznej długosci wektorów elemnty krótszego będą kolejno dopisywane az do uzyskania wektórw tych
samych długości.
nie możemy dodac do siebie nie liczb, zwroci blad
litery1 <-c("a", "b", "c")
litery2 <- c("A", "B", "C")
wynik = litery1 + litery2
wynik # ERROR
rm(x)
x <- c(1,2,4)
names(x)
#assigning names to each vector element
names(x) <- c("a", "b", "c")
x
'''
a b c
1 2 4
'''
#we can access the vector element by calling both position or name
x[1]
x["a"]
#Data
revenue <- c(14574.49, 7606.46, 8611.41, 9175.41, 8058.65, 8105.44, 11496.28, 9766.09,
10305.32, 14379.96, 10713.97, 15433.50)
expenses <- c(12051.82, 5695.07, 12319.20, 12089.72, 8658.57, 840.20, 3285.73, 5821.12,
6976.93, 16618.61, 10054.37, 3803.96)
#Solution
#Calculate Profit As The Differences Between Revenue And Expenses
profit <- revenue - expenses
profit
#The Best Month Is Where Profit After Tax Was Equal To The Maximum
best.month <- profit.after.tax == max(profit.after.tax)
best.month
#The Worst Month Is Where Profit After Tax Was Equal To The Minimum
worst.month <- profit.after.tax == min(profit.after.tax)
worst.month
Lists
A list is an R-object which can contain many different types of elements inside it like vectors, functions and even
another list inside it. Each element of the list can be in different data-type
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
x <- list (u=2, v="abc") # list () function
is.list(x)#TRUE
#accesing list values
x$u
x$v
list_data
'''
$`1st Quarter`
[1] "Jan" "Feb" "Mar"
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
# Access the list element using the name of the element and $
print(list_data$A_Matrix)
#Merging list
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
list_data
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Matrixes in R just like vector must have all object with the same data type (just like vector)
Odwołanie sie do elementu w macierzy [numer wiersza, numer kolumny]
A[1,] - wybranie calego wiersz nr 1 z macierzy
A[,1] - wybranie całej kolumny nr 1 z macierzy
A <- matrix(my.data, 4, 5)
A
'''
Dane z my.data umieszane sa w kolejnych kolumnach
'''
A2 <- matrix(my.data, 4, 5, byrow=TRUE)
A2 # byrow = TRUE sprawa, ze dane z naszego zbioru umieszczane sa na poczatek w
wierszach
'''
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
'''
A3
'''
Kolumna1 Kolumna2 Kolumna3 Kolumna4 Kolumna5
Wiersz1 1 5 9 13 17
Wiersz2 2 6 10 14 18
Wiersz3 3 7 11 15 19
Wiersz4 4 8 12 16 20
'''
Tworzenie macierzy przez rbind() - row binds
#Let's define some vectors
r1 <- c("I", "am", "happy")
r2 <- c("What", "a", "day")
r3 <- c(1,2,3)
'''
Cyfry z wektora r3 zostały zamienione na characters bo macierz podobnie jak wektor musi
miec wszystkie dane tego samego typu
[,1] [,2] [,3]
r1 "I" "am" "happy"
r2 "What" "a" "day"
r3 "1" "2" "3"
'''
'''
r1 r2 r3
[1,] "I" "What" "1"
[2,] "am" "a" "2"
[3,] "happy" "day" "3"
'''
rm(m)
m <-matrix(1:6, nrow=3)
m
'''
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
'''
Named vectors
v1 <- 1:4
v1
#give name to vector
names(v1) # returns NULL as there is no name assigned
#clear names
names(v1) <- NULL
Bravo <-matrix(temp.vec, 3, 3)
Bravo
'''
[,1] [,2] [,3]
[1,] "a" "B" "Zz"
[2,] "a" "B" "Zz"
[3,] "a" "B" "Zz"
'''
'''
x y z
How "a" "B" "Zz"
are "a" "B" "Zz"
you "a" "B" "Zz"
'''
#accessing the data in matrix
Bravo[2,2] #returns B
Bravo["are", "y"]#returns B
Bravo[2,2] == Bravo["are", "y"] # returns TRUE
#assigning values based on row and column name
Bravo["are", "y"] = 44
Bravo
'''
[,1] [,2] [,3]
[1,] "a" "B" "Zz"
[2,] "a" "44" "Zz"
[3,] "a" "B" "Zz"
'''
Subsetting
Games[1,] #wybierajac tylko jeden wiersz z matrix jako wynik otrzymujemy vector
is.matrix(Games[1,]) #FALSE
is.vector(Games[1,]) #TRUE
#R by defaul drops unnecessary dimensions of when returning vector from matrix, it can
be changed by changing drop paramater to F
Games[1,,drop=F]
is.matrix(Games[1,,drop=F])#TRUE
is.vector(Games[1,,drop=F])#FALSE
Matrix Computations
rm(y)
y<-matrix( c(7, 15, 10, 22), nrow=2, byrow=TRUE)
y
[,1] [,2]
[1,] 7 15
[2,] 10 22
[,1] [,2]
[1,] 1 3
[2,] 2 4
w1 [,1] [,2]
[1,] 7 45
[2,] 20 88
w2
[,1] [,2]
[1,] 37 81
[2,] 54 118
rm(f)
f <- function(x) x/c(2,8) #zamiarem funkcji jest dzielenie wiersza przez wektor (2,8)
rm(y)
y <- apply(z,1,f)
y
#apply wola zdefiniowana przez nas funkcje f i bedzie dzielic wiersze przez wektor
(2,8)
#pierwsze dzialanie to podzielenie (1,4) / (2,8) = (0.5, 0.5) i to zapisuje do
pierwszej kolumny
'''
[,1] [,2] [,3]
[1,] 0.5 1.000 1.50
[2,] 0.5 0.625 0.75
'''
#to have the outcome of apply() more intuitive we can transpose using t() function
y <- t(apply(z,1,f))
y
'''
[,1] [,2]
[1,] 0.5 0.500
[2,] 1.0 0.625
[3,] 1.5 0.750
'''
Linear Algebra on Matrices
'''
a <- matrix( c(1,1,-1,1), nrow=2, ncol=2)
b <- c(2,4)
#to solve such equation we can use function solve()
solve(a, b) #3 1
#if we would provide only 1 parameter - matrix a - it would compute inverse of matrix
solve(a)
'''
[,1] [,2]
[1,] 0.5 0.5
[2,] -0.5 0.5
t ( ) – transpose matrix
qr ( ) – qr decomposition
chol ( ) – cholesky decomposition
det ( ) – determinant
diag ( ) – extract the diagonals of a square matrix – useful to obtain variances from variance – covariance matrix
sweep ( ) – numerical analysis sweep operations
rm (m)
diag(dm)
'''
[,1] [,2]
[1,] 1 0
[2,] 0 8
'''
diag(3)
'''
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
'''
Sweep function
rm(m)
m <- cbind(c(1,4,7), c(2,5,8), c(3,6,9))
m
'''
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
'''
sweep(m,1,c(1,4,7), "+")
#sweep works somehow like apply()
#1st argument is array, 2nd is the marign which is 1 in this example, 4th - function to
use, 3rd - arguments to be used on function provided as 4th argument
'''
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 8 9 10
[3,] 14 15 16
'''
Wykresy
x<-c(1,2,3)
y<-c(1,3,8)
plot(x,y)#showing 3 point for the data from vector
#abline() draws a line with function arguments being treated asd intercept and slope
of line
abline(c(0,1)) # dodanie linii o wzorze y = 0 + 1 * x
abline(c(2,1)) #dodanie linii o wzorze y = 2 + 1 * x
abline(c(3,4)) # dodanie linie o wzorze y =3 + 4 * x
wykres <-plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y") +
points(c(-2, 0, 2), c(0, 4, 2)) +
par(bg="grey") # adding background for entire graph
rm(wykres)
wykres <-plot(c(-3,3), c(-1, 5), type = "n", xlab="os x", ylab="os y") +
points(c(-2, 0, 2), c(0, 4, 2)) +
par(bg="grey") # adding background for entire graph
#Visualazing subsets
#if we want to visualize single player (single row) in chart we need to make sure that
matrix is returned in subsetting the data
#if we get vector returned (which is by default) the chart wouldn't show us what we
expect
Data <- MinutesPlayed[1,,drop=F]
#creating the chart
matplot(t(Data), type="b", pch=c(15:18), col = c(1:4,6))
#adding legend
legend("bottomleft", inset=0.01, legend=Players[1:3], col = c(1:4,6), pch=c(15:18),
horiz=F)
#---introduction to qplor
install.packages("ggplot2")
library("ggplot2")
getwd() # returns curretn Working Directory # "C:/Users/Pc/Documents"
#Set new Working Directory on windows
setwd("C:\\Users\\Pc\\Desktop\\R files")
getwd()
#-----qplot()
stats <- read.csv("DemographicData.csv")
stats
#ploting histogram
qplot(data = stats, x=Internet.users) #providing only data
#podobnie z kolorem dla puktów, trzeba podać go jako I(kolor), inczej zostanie dodana
kolejna zmienna do wykresu
qplot(data = stats, x=Income.Group, y = Birth.rate, size =I(3),
colour = I("blue"))
#to change the points colour depending on the income.group we are assignign colour to
Income.Group
qplot(data = stats, x=Internet.users, y = Birth.rate, size = I(4), colour =
Income.Group)
#Visualization
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region)
# 1. Shapes
#zmienianie symbolu jakie bedzie widoczny dla punktów - np. kola, trojkaty, diamenty,
paramter dla shape przekazujmy jako I()
#kazdy shape dla punktow ma przypisane swoj numer
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region, size = I(4),
shape =I(19))
#3. Tittle
#tiitle for plot under parameter main
qplot(data = merged, x=Internet.users, y = Birth.rate, colour = Region, size = I(4),
shape =I(19), alpha =I(0.6), main = "Birth Rate vs Internet Users")
#FACTOR in R is category variable used to assign some label or used to assign to some
group
#Genre for movies is good example o factor, it assignt type of movie (Action, Comody
Drama..) to particular title
#we can force R to treat some column as a factor event if it has numerical value
movies$Year <- factor(movies$Year)
# --- Aesthetics - how our data are mapped to that what we want to see
library(ggplot2)
#adding the size for points, the bigger the point size is the higher the Budget was
ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre, size =
BudgetMillions)) +
geom_point()
#under the object p we have the data assign, but to have the plot we need to add +
geom_point()
p + geom_point()
#Overriding Aesthetics
#Overriding the aes() paramters does not modify the orgianl data assign to q
q <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre, size =
BudgetMillions))
#we can override the parameters given in ggplot and assign to variable within
geom_point()
q + geom_point(aes(size=CriticRating))
q + geom_point(aes(colour=BudgetMillions))
q + geom_point(aes(x=BudgetMillions))
# MAPPING vs SETTING
#basic chart
r <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating)) # assigning the data to
variable
r + geom_point()
#TO SET A COLOUR FOR ALL POINTS DO NOT USE AESTHETICS => SETTING
#setting color point manually
r + geom_point(colour="DarkGreen")
#ex seting - we are setting the size for all point on plot
r + geom_point(size=10)
s<-ggplot(data=movies, aes(x=BudgetMillions))
s + geom_histogram(binwidth = 10) #creating histogram with geom_histogram
#STATISTICAL TRANSFORMATION
#1. smoothed mean
u <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating, colour=Genre))
u + geom_point() + geom_smooth(fill=NA)
#boxplots
u <- ggplot(data=movies, aes(x=Genre, y=AudienceRating, colour=Genre))
u + geom_boxplot(size =1.2) + geom_point()
#tip / hack - using geom_jitter instead of geom_point()
u + geom_boxplot(size =1.2) + geom_jitter()
#facets:
v + geom_histogram(binwidth = 10, aes(fill=Genre), colour = "Black") +
facet_grid(Genre~., scales="free") #scales = "free" = each of the histogram would
have it's own scale depending on the values it gets
#by defaul scales would be the same for all plots
#scatterplots
w <- ggplot(data=movies, aes(x=CriticRating, y=AudienceRating, colour=Genre))
w + geom_point(size=3)
#adding some facets
w + geom_point(size=3) +
facet_grid(Genre~Year) #zapis Genre~Year spowoduje utworzenie serii wykresów
#Genre bedzie w wierszach, Year w kolumnach, wiec bedziemy
mieli dla kazdego Genre wykres dla kazedego z roku
w + geom_point(aes(size=BudgetMillions)) +
facet_grid(Genre~Year)+
geom_smooth()
#ADDING THEME
#4 Legend formatting
h +
xlab("Money Axis") + ylab("Number of Movies") + #naming the axis
ggtitle("Movie Budget Distribution") + #adding title for a plot
theme(axis.title.x = element_text(colour = "DarkGreen", size = 30),
axis.title.y=element_text(colour="Red", size = 30),
axis.text.x = element_text(size = 20), #zwiekszamy rozmiar opisu podzialki
(punktow) na x
axis.text.y = element_text(size = 20),#zwiekszamy rozmiar opisue podzialki na x
legend.title = element_text(size = 30), #zwiekszamy rozmiar czcionki dla
legendy
legend.text = element_text(size = 20), # zwiekszamy rozmiar czcionki dla
emekentow w legendzie
legend.position = c(1,1), # setting position of legend
legend.justification = c(1,1),
plot.title = element_text(colour="DarkBlue", size = 40, family ="Courier")
#setting title colour, size and font type(family)
)
Zapisywanie wykresu do pliku
#checking the currently opened devices - we should see pdf in the list
dev.list()
#screen is named RStudioGD when using R studio
'''
RStudioGD png pdf
2 3 4
'''
#wylacanie zrodla
dev.off()
pdf("wykres2.pdf")
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes
a dim attribute which creates the required number of dimension. In the below example we create an array with two
elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
, , 1
, , 2
#ARRAYS
f1 <- c(46, 21, 50)
f2 <- c (30, 25, 50)
'''
[,1] [,2]
[1,] 46 30
[2,] 21 25
[3,] 50 50
'''
#by argument dim(3,2,2) we are specyfying 2 layers each containing of matrix with 3rx2c
attributes(tests)
#$`dim 3 2 2
tests
'''
Array zawiera 2 x matrix, jako (, , , 1) traktowana ta podana przez nas jako pierwsza
w nszym przypadku firsttest
, , 1
[,1] [,2]
[1,] 46 30
[2,] 21 25
[3,] 50 50
layer 2 z array to macierz podana przez nas jako druga, u nas secondtest
, , 2
[,1] [,2]
[1,] 46 43
[2,] 41 35
[3,] 50 50
'''
Factors
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings
and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and
True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor( ) function. The nlevels functions gives the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
Syntax
gl(n, k, labels)
rm(x)
x<-c(5,12,13,12)
#making factor out of x vector
xf<-factor(x)
is.factor(xf) #TRUE
xf #5 12 13
'''
[1] 5 12 13 12
Levels: 5 12 13 88
'''
xff[2]<-88
xff
'''
1] 5 88 13 12
Levels: 5 12 13 88
'''
#tapply() function
tapply(x,f,g)
x-vector, f - factor or list of factors, g - function
Each factor in f must have the same lenghth as vector x1
ages<-c(25,26,55,37,21,42)
affils<-c("R", "D", "D", "R", "U", "D")
tapply(ages, affils, mean)
'''
D R U
41 31 21
'''
What happened: to each element of vector ages the factor from affils is assign
For each D, R, U the mean is calculated
'''
gender age income
1 M 47 55000
2 M 59 88000
3 F 21 32450
4 M 32 76500
5 F 33 123000
6 F 24 45650
'''
#tapply will calculate mean() for the income based on the factors gender and over25
tapply(d$income, list(d$gender, d$over25), mean)
'''
0 1
F 39050 123000.00
M NA 73166.67
'''
split(vector, factor)
vector - vector or data frame
factor - factor or list of factors
'''
gender age income over25
1 M 47 55000 1
2 M 59 88000 1
3 F 21 32450 0
4 M 32 76500 1
5 F 33 123000 1
6 F 24 45650 0
'''
the split() function would return values for each of the factor list provided
Female and age<25y
Make and age<25y
Female and age>25y
Male and age>25y
'''$`F.0`
[1] 32450 45650
$M.0
numeric(0)
$F.1
[1] 123000
$M.1
[1] 55000 88000 76500
'''
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data.
The first column can be numeric while the second column can be character and third column can be logical. It is a list
of vectors of equal length.
Data Frames are created using the data.frame( ) function.
#using head() we are selecting 6 first rows and check the colnames
head(mydf) #"Countries_2012_Dataset" "Codes_2012_Dataset" "Regions_2012_Dataset"
summary(mydf)
#usuniecie kolumny z DF
merged$Country <- NULL
str(merged)
head(merged)
Using $ sign
#using $ sign
head(stat)
#as the columns have names we can select the same by providing row number and column
name
stat[3, "Birth.rate"]
# $ work for data frame, we can put $ any name of columm, it returns a vector
stat$Internet.users
#but with using $ we can easily access to single value of selected vector !!!
stat$Internet.users[2]
#levels() returns the number of factors that are present in the column
levels(stat$Income.Group)# "High income" "Low income" "Lower middle income" "Upper
middle income"
#subsetting
#1 we may simply assign some values to column name that does not exist yet and it would
be added
stat$NewColum <- stat$Birth.rate * stat$Internet.users
head(stat) # columns we got: Country.Name Country.Code Birth.rate Internet.users
Income.Group NewColum
#remove a colum
stat$NewColum <- NULL
head(stat) # column list: Country.Name Country.Code Birth.rate Internet.users
Income.Group
In the example below we create a data frame with new rows and merge it with the
existing data frame to create the final data frame.
# Print a header.
cat("# # # # The First data frame\n")
# Print a header.
cat("# # # The Second data frame\n")
# Print a header.
cat("# # # The combined data frame\n")
stat[filter,] #if we provide as rows vector with values FALSE and TRUE it will display
only those rows where value is TRUE
stat[stat$Birth.rate > 40, ] # select all rows where Birth Rate > 40
#Birth.rate > 40 - returns vector with values TRUE or FALSE
library(datasets)
library(dplyr)
#summarizing by month
airquality %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm=TRUE))
'''
1 5 65.5
2 6 79.1
3 7 83.9
4 8 84.0
5 9 76.9
'''
setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")
getwd()
library(ggplot2)
library(tidyr)
library(dplyr)
cpi = read.csv("cpi.csv")
head(cpi)
names(cpi)
#Gather columns
?gather #librabry dplyr
#takes multiple columns and collapses into key-value pairs
getwd()
stocks = read.csv("5stocks.csv")
head(stocks) #data for stock price movement Jul 2001 - May 2017
#ts(data, start, end, frequency), frequency = the number of observations per unit of
time.
myts <- ts(smove, start=c(2001, 7), end=c(2017, 5), frequency=250) #250 number of
trading days in a year
plot(myts)
plot.ts(myts2)
#######
install.packages("devtools")
library(devtools)
#install_github() - unction to install R packages hosted on GitHub in the devtools
package. But it requests developer’s name.
#install_github("DeveloperName/PackageName")
install_github("sinhrks/ggfortify")
library(ggfortify)
autoplot(myts) #plottting the timeseries on facets
######
#Sudden changes in values over a period of time
g <- read.csv("growth-in-gdp.csv")
head(g)
names(g)
#the same could be done by providing values in vector, but using %in% might be more
convienent when the number of columns to filter is big
data<-df[,c("Country", "Value")]
head(data)
Variables
Varaible naming:
var_name2. valid Has letters, numbers, dot and underscore
var_name% Invalid Has the character '%'. Only dot(.) and underscore allowed.
2var_name invalid Starts with a number
.var_name valid
var.name valid Can start with a dot(.) but the dot(.)should not be followed by a number.
.2var_name invalid The starting dot is followed by a number making it invalid.
_var_name invalid Starts with _ which is not valid
Variables asignment
The variables can be assigned values using leftward, rightward and equal to operator. The values of the variables can
be printed using print() or cat() function. The cat() function combines multiple items into a continuous print output.
print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")
Finding Variables
To know all the variables currently available in the workspace we use the ls() function. Also the ls() function can use
patterns to match the variable names.
print(ls())
[1] "my var" "my_new_var" "my_var" "var.1"
[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"
The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE" argument to ls() function.
print(ls(all.name = TRUE))
[1] ".cars" ".Random.seed" ".var_name" ".varname" ".varname2"
[6] "my var" "my_new_var" "my_var" "var.1" "var.2"
[11]"var.3" "var.name" "var_name2." "var_x"
Deleting variables
All the variables can be deleted by using the rm() and ls() function together.
rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm ( ) and ls ( ) function together.
rm(list = ls())
print(ls())
>> character(0)
z1 <- readLines("z1.txt")
z1 #[1] "John 25" "Mary 28" "Jim 19"
z1[1]#"John 25"
ReadLines() function
#1. opening the connection, opened connection is needed for R to track if EOF is
reached
c<-file("z1.txt", "r")
#to move again at the beginning of the file we can use seek() function
c<-file("z1.txt", "r")
readLines(c, n=2) #"John 25" "Mary 28"
# moving back at the beginning of the file
seek(con=c, where = 0) # where = 0 means that file pointer zero characters from the
start of file
readLines(c, n=1) # "John 25"
#jak Working Directory jest zmienione to możemy podać nazwe pliku, który sie tam
znajduje
stats2 <- read.csv("DemographicData.csv")
stats2
getwd() #"C:/Users/Pc/Documents"
setwd("C:\\Users\\Pc\\Desktop\\R files")
getwd()
input<-read.csv("input.csv")
input
colnames(input) = c("id", "name", "salary", "start_date", "dept")
input
#install.packages("RCurl")
library(RCurl)#allows to read data from online sources
data1= read.csv(text=getURL("https://raw.githubusercontent.com/sciruela/Happiness-
Salaries/master/data.csv"))
head(data1)
summary(data1)
#sometimes we do not want to read to x lines are they are nore contaning data, we can
skip them using skip attribute
rm(data2)
data2=read.csv(text=getURL("https://raw.githubusercontent.com/opetchey/RREEBES/master/B
eninca_etal_2008_Nature/data/nutrients_original.csv"), skip=7, header=T)
head(data2)
summary(data2)
data3=read.csv(text=getURL("https://www.gov.uk/government/uploads/system/uploads/attach
ment_data/file/246663/pmgiftsreceivedaprjun13.csv"))
head(data3)
install.packages("googlesheets")
library(googlesheets)
#goole sheet that we are accesing does have multiple sheet within
# list worksheets
gs_ws_ls(be)
# convert to data.frame
wdf = as.data.frame(west)
head(wdf)
Writing to csv
write.csv(persons2014, "output.csv")
checkexport <-read.csv("output.csv"); checkexport
'''
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 5 Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
'''
#while exporting we've added additional column X which is meaningless, it can be
dropped when using additional parameter during exporting to a file
#file.choose()
#selewcting the LOCATION MANUALLY FOR THE FILE !!!
write.csv(persons2014, file.choose(), row.names = FALSE)
#5 saving to xlsx
write.xlsx(dataout, "FilteredData.xlsx", row.names = FALSE)
#6 We can also choose the path and name the file manually using file.choose()
write.xlsx(dataout, file.choose(), row.names = FALSE)
install.packages("XML")
#load xml library
library("XML")
#load other required packages
library("methods")
#to access any particular object form json data we can use following:
d3 <- lapply(json_data[[2]], function(x) c(x["id"], x["iso2Code"])) # from 2nd element
of json data we are selecting id and iso2Code
d3
'''
dane nadal nie sa zbyt czytelen
[[1]]
[[1]]$`id`
[1] "AUS"
[[1]]$iso2Code
[1] "AU"
'''
# This function allows you to call any R function, but instead of writing out the
arguments one by one, you can use a list to hold the arguments of the function.
d3r <-do.call(rbind, d3) # tworyzmz maciery ze wszystkich danych w d3
d3r
install.packages("rvest")
library(rvest)
url2="https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_in_the_United_Kingdom_
and_the_British_Overseas_Territories"
W kodzie HTML szukamy table class, po najechaniu na nia powinna nam się zaznaczyć
tabelka
PPM > Copy > Copy Xpath
reading data from online HTML tables
library(XML)
library(RCurl)
#in our data besides the table and data from it also text from website was read
#shows structure od data frame. Number of columns and rows and information for each
column
str(stat) #
#shows min, max, mean, median for numeric variables; how many rows fails to each
category for factors
summary(stat)
Input / Output
Input – scan() – reading text files
getwd()
setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")
'''
1,txt
123
45
6
2.txt
123
4.2 5
6
3.txt:
abc
de f
g
4.txt:
abc
123 6
y
'''
#scan() - by default scan function expect numeric values as input from keyboard
#parameter what="" tells that we will be providing character data
names=scan(what="")
joe fred bob john
sam sue robin
#if what= argument is a list containing examples of expected data type, scan would
output list with as many elements as there are data types provided
names2
'''
$`a`
[1] 1 2 3
$b
[1] "dog" "cat" "duck"
$c
[1] 3 5 7
'''
scan("2.txt") #we are receiving vecotr of double as one of entry was double
#123.0 4.2 5.0 6.0
scan("3.txt", what="") # using what="" indictated that we want use string mode
#[1] "abc" "de" "f" "g"
#scan be default assumes that items of the vector are separated by whitespace
#we can use option sep argument for other situations
scan("4.txt", what="")
#[1] "abc" "123" "6" "y"
Printing to screen
Cat() vs print()
cat() is valid only for atomic types (logical, integer, real, complex, character) and
names.
It means you cannot call cat on a non-empty list or any type of object. In practice it
simply converts arguments to characters and concatenates so you can think of something
like as.character() %>% paste().
print() is a generic function so you can define a specific implementation for a certain
S3 class.
The purpose of print is to show values much as they are entered in source code, so
quotes and escaped characters such as "\n" are shown. Cat is intended to provide a way
to send characters straight to the console so the effects of special characters can be
visible (i.e. getting text on the next line when a "\n" occurs in a string). Thus the
element numbering is not relevant there.
Arithmetic Operators
Operatory artymetyczne dodaja / dziela/ mnoża itd. Odpowiadajace sbie wartosci w wektorze lub macierzy
#operatory logiczne
4<5
10 > 100
4 == 5 #porównanie 2 wartosci
4 != 5 # nie równe
4 <= 5 # mniejsze równe
4 >= 5 #wieksze równe
result <- !(1 > 2) # ! not jak operacja w () jest TRUE to zwróci przeciwieństwo FALSE
result
== v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v == t)
equal to the corresponding element of the
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE
!= v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v!=t)
unequal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
&
It is called Element-wise Logical AND operator. v <- c(3,1,TRUE,2+3i)
It combines each element of the first vector t <- c(4,1,FALSE,2+3i)
with the corresponding element of the second print(v&t)
vector and gives a output TRUE if both the
it produces the following result −
elements are TRUE.
[1] TRUE TRUE FALSE TRUE
!
v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes each print(!v)
element of the vector and gives the opposite
it produces the following result −
logical value.
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give a vector of single element as
output.
&&
v <- c(3,0,TRUE,2+2i)
Called Logical AND operator. Takes first element t <- c(1,3,TRUE,2+3i)
of both the vectors and gives the TRUE only if print(v&&t)
both are TRUE.
it produces the following result −
[1] TRUE
||
v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first element
print(v||t)
of both the vectors and gives the TRUE if one of
them is TRUE. it produces the following result −
[1] FALSE
Assignment operators
:
Colon operator. It v <- 2:8
creates the series of print(v)
numbers in sequence for
it produces the following result −
a vector.
[1] 2 3 4 5 6 7 8
%in%
v1 <- 8
v2 <- 12
t <- 1:10
This operator is used to print(v1 %in% t)
identify if an element print(v2 %in% t)
belongs to a vector.
it produces the following result −
[1] TRUE
[1] FALSE
%*%
M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
t = M %*% t(M)
print(t)
This operator is used to
multiply a matrix with
it produces the following result −
its transpose.
[,1] [,2]
[1,] 65 82
[2,] 82 117
Sets Operations
rm(x); rm(y)
x<-c(1,2,5)
y<-c(5,1,8,9)
#wypisywanie 2 zmiennych, wywołanie w jednej linii mozliwe po ;
x;y
#union of 2 sets
union(x, y)# 1 2 5 8 9
#intersect of 2 sets - czesc wspolna
intersect(x, y)#1 5
#difference of 2 sets, all emenents of x that are not in y
setdiff(x, y) #2
#test for equality
setequal(x, y) # FALSE
# testing membership c %in% y
2 %in% x #TRUE
#number of possible subsets of size k from set with size n - choose (n,k)
choose(5,2)
Functions
By putting ? in front of the function we would get the help for this function
?rnorm()
?c()
We can create user-defined functions in R. They are specific to what a user wants and once created they can be used
like the built-in functions. Below is an example of how a function is created and used.
The arguments to a function call can be supplied in the same sequence as defined in the function or they can be supplied
in a different sequence but assigned to the names of the arguments.
f(1:3, 0)#1 4 9
f(1:3, 2)#9 16 25
f(1:3, 1:3)#4 16 36 = (1+1)^2, (2+2)^2 (3+3)^2
Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the function body.
new.function(6)
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
function()
{
#body code
}
#funckja do tworenia wykresow, przekazuejmy jakie dane chcemy uzyc i które wiersze
wybrac
myplot <- function(data, rows=1:10) #paramtery przyjmuja default values
{
Data <- data[rows,,drop=F]
#creating the chart
matplot(t(Data), type="b", pch=c(15:18), col = c(1:4,6))
#adding legend
legend("bottomleft", inset=0.01, legend=Players[rows], col = c(1:4,6), pch=c(15:18),
horiz=F)
}
myplot(Salary, 1:4)
Packages
Strings
Valid Strings
a <- 'Start and end with single quote'
print(a)
Invalid Strings
e <- 'Mixed quotes"
print(e)
Error: unexpected symbol in:
"print(e)
collapse is used to eliminate the space in between two strings. But not the space within two words of one
string.
a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
print(paste(a,b,c, sep = "-"))
print(paste(a,b,c, sep = "", collapse = ""))
[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] " 13.7"
[1] "Hello "
[1] " Hello "
toupper(x)
tolower(x)
grep("error",
c("warining", "error", "Error", "alarm", "warning_error", "warning_eRRor"),
ignore.case=TRUE)
#2 3 5 6 - we can set it as case insensitive
2. grepl() – return TRUE if a string contains the pattern, otherwise returns FALSE.
Vector match:
paste("North", "Pole", sep="") #we can remove the space from separating the strings
#"NorthPole"
Regular expressions
Grep()
grep("[xy]", c("Pawelx", "Kasiay", "Zigiz"))
#[xy] - looks for any string that contains either x or y
Funckja gsub()
x <- "xxxPawelxxx"
gsub("xxx", "zzz", x)#"zzzPawelzzz
Syntax Description
\\d Digit, 0,1,2 ... 9
\\D Not Digit
\\s Space
\\S Not Space
\\w Word
\\W Not Word
\\t Tab
\\n New line
^ Beginning of the string
$ End of the string
\ Escape special characters, e.g. \\ is "\", \+ is "+"
| Alternation match. e.g. /(e|d)n/ matches "en" and "dn"
• Any character, except \n or line terminator
[ab] a or b
[^ab] Any character except a and b
[0-9] All Digit
[A-Z] All uppercase A to Z letters
[a-z] All lowercase a to z letters
[A-z] All Uppercase and lowercase a to z letters
i+ i at least one time
i* i zero or more times
i? i zero or 1 time
i{n} i occurs n times in sequence
i{n1,n2} i occurs n1 - n2 times in sequence
i{n1,n2}? non greedy match, see above example
i{n,} i occures >= n times
[:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:] Alphabetic characters: [:lower:] and [:upper:]
[:blank:] Blank characters: e.g. space, tab
[:cntrl:] Control characters
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:] Graphical characters: [:alnum:] and [:punct:]
[:lower:] Lower-case letters in the current locale
[:print:] Printable characters: [:alnum:], [:punct:] and space
[:punct:] Punctuation character: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
Space characters: tab, newline, vertical tab, form feed, carriage return,
[:space:]
space
[:upper:] Upper-case letters in the current locale
[:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
LOOPS
WHILE
n<-0
n
while(n<10)
{
print(n)
n <- n+1
}
FOR LOOP
#FOR LOOP
for(warunek)
{
bofy of the loop
}
for(i in 1:5)
{
print(i)
}
IF STATEMENT
#IF statement
x<-2
if (x > 0)
{
print("Liczba wieksza od 0")
} else if (x < 0) #wazne zeby else if bylo w tej samej linicjce co zamykajace }
{
print("Liczba mniejsza od 0")
} else #wazne zeby else bylo w tej samej linicjce co zamykajace }
{
print("Liczba rowna 0")
}
rm(x)
rm(y)
x<-1:10
y<-ifelse(x%%2 == 0, 5, 12)# %% - mod, reszta z dzielenia
#jak reszta z dzielenia to 0 to podstaw 5, jak nie to 12
y
Statistical Distributions
d - density or probability mass function (pmf)
p – cumulative distribution function (cdf)
q – qunatiles
r – random number generation
MEASURE OF CENTER
#mean, median, skewness, kurtosis
mean(x) #3.013375
median(x) #3.006536
#skewness
skewness(x) #0.02695227
#kurtosis
kurtosis(x) #2.834572
MEASURE OF VARIATION
#standard deviation
sd(x) #0.2492114
#CHI-SQUARE TEST
#H0 - 2 nominal variables (row or column) has no association between them
data: food.survey
X-squared = 0.13751, df = 2, p-value = 0.9336
'''
CORRELATION
#Parametric form - Pearson's correlation - should be used on data that are normally
distributed
#Non-parametric - Spearman's Rank and Kendall Tau - should be used for data that are
not normally distributed
data(mtcars)
'''
Shapiro-Wilk normality test
data: mtcars$mpg
W = 0.94756, p-value = 0.1229
p > 0.05 => we are accepting HO, data are normally distributed
'''
shapiro.test(mtcars$wt)
'''
Shapiro-Wilk normality test
data: mtcars$wt
W = 0.94326, p-value = 0.09265
p > 0.05 => we are accepting HO, data are normally distributed
'''
#Both data set are normally distributed => we are suing Person's correlation
cor(mtcars$mpg, mtcars$wt) #default = Pearson's
#[1] -0.8676594
#when changing the variables order we are getting the same result
cor(mtcars$wt, mtcars$mpg, method="pearson")
#[1] -0.8676594
#in case of NAs being present we should specify complete.obs to use only rows that got
observations
cor(mtcars$wt, mtcars$mpg, method="pearson", use="complete.obs")
#[1] -0.8676594
p-value jest mniejsze < 0.05 wiec odrzucamy HO, nasze zmienne sa skorelowane
'''
#Computing correlation
cor(iris$Petal.Length, iris$Petal.Width)
#[1] [1] 0.9628654
#should we use Perason's correlation for iris ?
shapiro.test(iris$Petal.Length)
'''
Shapiro-Wilk normality test
data: iris$Petal.Length
W = 0.87627, p-value = 7.412e-10
P< 0.05 => we are faiiling to reject H0, data are not normally distributed
'''
#we should use spearman or kendall method for computing correlation
cor(iris$Petal.Length, iris$Petal.Width, method="spearman")
#[1] 0.9376668
cor(iris$Petal.Length, iris$Petal.Width, method="kendall")
#[1] 0.8068907
#qqnorm () - jezeli dane maja rozklad normlany to powinny sie one ukladac wzdłuż linii
prostej y=x
#qqnrom rysuje nasze dane na wykresie i mozemy ocenic czy sa dopasowane do lini prostej
czy nie
#qqline() - wychodzi z tego samego założenia co qqnorm ale dodatkowo mamy wyrysowana
linie prosta na wykresie wzdluz ktorej powinny sie ukladac punkt z rozkladu normalnego
qqnorm(X$len)
qqline(X$len)
data: X$len
W = 0.96743, p-value = 0.1091
P > 0.05 wiec nie ma podstaw do odrzucenia HO, dane maja rozklad normalny
'''
#to compute the Z value for the 95% confidence interval we can use qnorm()
#we are using 0.975 as 1-.95 = 0.05 ale 0.05 jest suma warstosci na 2 krance ogonów
(alpha)
#z jednej strony bedzie 0.5 * alpha = 0.5 * 0.05 = 0.025
#1 - 0.025 = 0.975
qnorm(0.975)
SE = S / sqrt(n)
SE
#2. Z valuee
Zval = qnorm(0.975)
Zval# 1.959964
#4.Confidence interval
srednia = mean(X$len)
CI <- srednia + c(-MOE, MOE)
CI #1] 16.87783 20.74884
n=length(X$len)
n
#1.Computnig t value
tval <- qt(0.975, df=n-1)
tval #2.000995
#2.Margin of ERROR
tMOE <- tval *SE
#3 Configence Interval
sredniat <-mean(X$len)
CIt <- srednia + c(-tMOE, tMOE)
CIt #16.83731 20.78936
data: X$len
t = 19.051, df = 59, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
16.83731 20.78936
sample estimates:
mean of x
18.81333
'''
#wywolujac t-test mozemy tez sprecyzowac jaki przedzial ufnosci mamy uzyc
'''
One Sample t-test
data: X$len
t = 19.051, df = 59, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
17.16309 20.46358
sample estimates:
mean of x
18.81333
'''
T Test
#examine if the difference in means are significat different
X <-ToothGrowth
srednia <- mean(X$len)
srednia #18.81333
#one sided t-Test, test of the mean value is equal to a certain number
#H0: true value of mean = 18
t.test(X$len, mu=18) #jako mu podajemy ile wierzymy, ze srednia wynosi
'''
One Sample t-test
data: X$len
t = 0.82361, df = 59, p-value = 0.4135
alternative hypothesis: true mean is not equal to 18
95 percent confidence interval:
16.83731 20.78936
sample estimates:
mean of x
18.81333
?t.test
t.test(OJ, VC, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
#var.equal - are the variances are equal for both groups
'''
Welch Two Sample t-test
data: OJ and VC
t = 1.9153, df = 55.309, p-value = 0.06063
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1710156 7.5710156
sample estimates:
mean of x mean of y
20.66333 16.96333
data: OJ and VC
t = 1.9153, df = 55.309, p-value = 0.03032
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.4682687 Inf
sample estimates:
mean of x mean of y
20.66333 16.96333
p-value < 0,05 wiec mozemy odrzucic hipoteze, ze 2 srednie sa rowne, srednie OJ >
srednia VC
'''
data: OJ and VC
t = 3.3026, df = 29, p-value = 0.00255
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.408659 5.991341
sample estimates:
mean of the differences
3.7
mala wartosc P-value wskazuja na odrzucenie H0
'''
Linear regression
data("Orange")
head(Orange)
plot(Orange$age, Orange$circumference)
'''
Residuals:
Min 1Q Median 3Q Max
-46.310 -14.946 -0.076 19.697 45.111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.399650 8.622660 2.018 0.0518 .
age 0.106770 0.008277 12.900 1.93e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
'''
#what to check: p-value <0.05 => rejrect H0, there is relation
#R2 and adjusted R2 is <> 0
#confidence interval
adding geom_smooth(method=lm, color='#2C3e50') to plot by default also provide
condifence interval in grey in addition to the linear regression line
new.dat=data.frame(age=1500)
head(new.dat)
'''
fit lwr upr
1 177.5551 164.8539 190.2564
'''
data(iris)
fit1 = lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris)
summary(fit1)
#2 Testing heteroscedasticiy
#H0: hupothesis of constant error variance, variance around regression line is the same
for all values
install.packages("car")
library(car)
ncvTest(fit)
Defining Class
Define class – setClass ( )
Create object – new ()
Reference member variable - @
Implement generefic function – setMethod ()
Declare generic – setGeneric ()
print(num1) # a = 12
Example
#1. Defining class employee
setClass("employee", representation (name="character", salary="numeric",
union="logical"))
#2. create new instance of class - defining new object of class employee
#new("ClassName", "assigning values for each of value - slots- defined for the class)
joe <- new("employee", name="Joe", salary = 55000, union=T)
joe
'''
An object of class "employee"
Slot "name":
[1] "Joe"
Slot "salary":
[1] 55000
Slot "union":
[1] TRUE
'''
#3. Checing values for each of slot assinged to class object - @ or slot()
joe@salary
slot(joe, "salary")
)
#7. Checking how the method works after overwridding it for employee object.
joe # Joe has a salary of 88000 and is in the union
#Joe has a salary of 88000 and is in the union
NBA DATA
#Seasons
Seasons <- c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")
#Players
Players <-
c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","
ChrisPaul","KevinDurant","DerrickRose","DwayneWade")
#Salaries
KobeBryant_Salary <-
c(15946875,17718750,19490625,21262500,23034375,24806250,25244493,27849149,30453805,2350
0000)
JoeJohnson_Salary <-
c(12000000,12744189,13488377,14232567,14976754,16324500,18038573,19752645,21466718,2318
0790)
LeBronJames_Salary <-
c(4621800,5828090,13041250,14410581,15779912,14500000,16022500,17545000,19067500,206444
00)
CarmeloAnthony_Salary <-
c(3713640,4694041,13041250,14410581,15779912,17149243,18518574,19450000,22407474,224580
00)
DwightHoward_Salary <-
c(4493160,4806720,6061274,13758000,15202590,16647180,18091770,19536360,20513178,2143627
1)
ChrisBosh_Salary <-
c(3348000,4235220,12455000,14410581,15779912,14500000,16022500,17545000,19067500,206444
00)
ChrisPaul_Salary <-
c(3144240,3380160,3615960,4574189,13520500,14940153,16359805,17779458,18668431,20068563
)
KevinDurant_Salary <-
c(0,0,4171200,4484040,4796880,6053663,15506632,16669630,17832627,18995624)
DerrickRose_Salary <-
c(0,0,0,4822800,5184480,5546160,6993708,16402500,17632688,18862875)
DwayneWade_Salary <-
c(3031920,3841443,13041250,14410581,15779912,14200000,15691000,17182000,18673000,150000
00)
#Matrix
Salary <- rbind(KobeBryant_Salary, JoeJohnson_Salary, LeBronJames_Salary,
CarmeloAnthony_Salary, DwightHoward_Salary, ChrisBosh_Salary, ChrisPaul_Salary,
KevinDurant_Salary, DerrickRose_Salary, DwayneWade_Salary)
#rm() - removes objest from cached memory, rm() cleans vectors from memory as all data
we now have in matrix
rm(KobeBryant_Salary, JoeJohnson_Salary, CarmeloAnthony_Salary, DwightHoward_Salary,
ChrisBosh_Salary, LeBronJames_Salary, ChrisPaul_Salary, DerrickRose_Salary,
DwayneWade_Salary, KevinDurant_Salary)
#colnames(matrixName) < - it is used to label columns in martix
colnames(Salary) <- Seasons
#rownames() function labels the amtrix rows
rownames(Salary) <- Players
#Games
KobeBryant_G <- c(80,77,82,82,73,82,58,78,6,35)
JoeJohnson_G <- c(82,57,82,79,76,72,60,72,79,80)
LeBronJames_G <- c(79,78,75,81,76,79,62,76,77,69)
CarmeloAnthony_G <- c(80,65,77,66,69,77,55,67,77,40)
DwightHoward_G <- c(82,82,82,79,82,78,54,76,71,41)
ChrisBosh_G <- c(70,69,67,77,70,77,57,74,79,44)
ChrisPaul_G <- c(78,64,80,78,45,80,60,70,62,82)
KevinDurant_G <- c(35,35,80,74,82,78,66,81,81,27)
DerrickRose_G <- c(40,40,40,81,78,81,39,0,10,51)
DwayneWade_G <- c(75,51,51,79,77,76,49,69,54,62)
#Matrix
Games <- rbind(KobeBryant_G, JoeJohnson_G, LeBronJames_G, CarmeloAnthony_G,
DwightHoward_G, ChrisBosh_G, ChrisPaul_G, KevinDurant_G, DerrickRose_G, DwayneWade_G)
rm(KobeBryant_G, JoeJohnson_G, CarmeloAnthony_G, DwightHoward_G, ChrisBosh_G,
LeBronJames_G, ChrisPaul_G, DerrickRose_G, DwayneWade_G, KevinDurant_G)
colnames(Games) <- Seasons
rownames(Games) <- Players
#Minutes Played
KobeBryant_MP <- c(3277,3140,3192,2960,2835,2779,2232,3013,177,1207)
JoeJohnson_MP <- c(3340,2359,3343,3124,2886,2554,2127,2642,2575,2791)
LeBronJames_MP <- c(3361,3190,3027,3054,2966,3063,2326,2877,2902,2493)
CarmeloAnthony_MP <- c(2941,2486,2806,2277,2634,2751,1876,2482,2982,1428)
DwightHoward_MP <- c(3021,3023,3088,2821,2843,2935,2070,2722,2396,1223)
ChrisBosh_MP <- c(2751,2658,2425,2928,2526,2795,2007,2454,2531,1556)
ChrisPaul_MP <- c(2808,2353,3006,3002,1712,2880,2181,2335,2171,2857)
KevinDurant_MP <- c(1255,1255,2768,2885,3239,3038,2546,3119,3122,913)
DerrickRose_MP <- c(1168,1168,1168,3000,2871,3026,1375,0,311,1530)
DwayneWade_MP <- c(2892,1931,1954,3048,2792,2823,1625,2391,1775,1971)
#Matrix
MinutesPlayed <- rbind(KobeBryant_MP, JoeJohnson_MP, LeBronJames_MP, CarmeloAnthony_MP,
DwightHoward_MP, ChrisBosh_MP, ChrisPaul_MP, KevinDurant_MP, DerrickRose_MP,
DwayneWade_MP)
rm(KobeBryant_MP, JoeJohnson_MP, CarmeloAnthony_MP, DwightHoward_MP, ChrisBosh_MP,
LeBronJames_MP, ChrisPaul_MP, DerrickRose_MP, DwayneWade_MP, KevinDurant_MP)
colnames(MinutesPlayed) <- Seasons
rownames(MinutesPlayed) <- Players
#Field Goals
KobeBryant_FG <- c(978,813,775,800,716,740,574,738,31,266)
JoeJohnson_FG <- c(632,536,647,620,635,514,423,445,462,446)
LeBronJames_FG <- c(875,772,794,789,768,758,621,765,767,624)
CarmeloAnthony_FG <- c(756,691,728,535,688,684,441,669,743,358)
DwightHoward_FG <- c(468,526,583,560,510,619,416,470,473,251)
ChrisBosh_FG <- c(549,543,507,615,600,524,393,485,492,343)
ChrisPaul_FG <- c(407,381,630,631,314,430,425,412,406,568)
KevinDurant_FG <- c(306,306,587,661,794,711,643,731,849,238)
DerrickRose_FG <- c(208,208,208,574,672,711,302,0,58,338)
DwayneWade_FG <- c(699,472,439,854,719,692,416,569,415,509)
#Matrix
FieldGoals <- rbind(KobeBryant_FG, JoeJohnson_FG, LeBronJames_FG, CarmeloAnthony_FG,
DwightHoward_FG, ChrisBosh_FG, ChrisPaul_FG, KevinDurant_FG, DerrickRose_FG,
DwayneWade_FG)
rm(KobeBryant_FG, JoeJohnson_FG, LeBronJames_FG, CarmeloAnthony_FG, DwightHoward_FG,
ChrisBosh_FG, ChrisPaul_FG, KevinDurant_FG, DerrickRose_FG, DwayneWade_FG)
colnames(FieldGoals) <- Seasons
rownames(FieldGoals) <- Players
#Points
KobeBryant_PTS <- c(2832,2430,2323,2201,1970,2078,1616,2133,83,782)
JoeJohnson_PTS <- c(1653,1426,1779,1688,1619,1312,1129,1170,1245,1154)
LeBronJames_PTS <- c(2478,2132,2250,2304,2258,2111,1683,2036,2089,1743)
CarmeloAnthony_PTS <- c(2122,1881,1978,1504,1943,1970,1245,1920,2112,966)
DwightHoward_PTS <- c(1292,1443,1695,1624,1503,1784,1113,1296,1297,646)
ChrisBosh_PTS <- c(1572,1561,1496,1746,1678,1438,1025,1232,1281,928)
ChrisPaul_PTS <- c(1258,1104,1684,1781,841,1268,1189,1186,1185,1564)
KevinDurant_PTS <- c(903,903,1624,1871,2472,2161,1850,2280,2593,686)
DerrickRose_PTS <- c(597,597,597,1361,1619,2026,852,0,159,904)
DwayneWade_PTS <- c(2040,1397,1254,2386,2045,1941,1082,1463,1028,1331)
#Matrix
Points <- rbind(KobeBryant_PTS, JoeJohnson_PTS, LeBronJames_PTS, CarmeloAnthony_PTS,
DwightHoward_PTS, ChrisBosh_PTS, ChrisPaul_PTS, KevinDurant_PTS, DerrickRose_PTS,
DwayneWade_PTS)
rm(KobeBryant_PTS, JoeJohnson_PTS, LeBronJames_PTS, CarmeloAnthony_PTS,
DwightHoward_PTS, ChrisBosh_PTS, ChrisPaul_PTS, KevinDurant_PTS, DerrickRose_PTS,
DwayneWade_PTS)
colnames(Points) <- Seasons
rownames(Points) <- Players
#operations on matrixes
FieldGoals / Games # value in each matrix is divided by corresponding value - on the
same position - in the other matrix
round(FieldGoals / Games, 2)
round(MinutesPlayed / Games, 2)
#visualization in R using matplot
?matplot # plot the columns of one matrix agains columns of another
FieldGoals
#t() - function to transpose the table
t(FieldGoals)
#Seasons
Seasons <- c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")
#Players
Players <-
c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","
ChrisPaul","KevinDurant","DerrickRose","DwayneWade")
#Free Throws
KobeBryant_FT <- c(696,667,623,483,439,483,381,525,18,196)
JoeJohnson_FT <- c(261,235,316,299,220,195,158,132,159,141)
LeBronJames_FT <- c(601,489,549,594,593,503,387,403,439,375)
CarmeloAnthony_FT <- c(573,459,464,371,508,507,295,425,459,189)
DwightHoward_FT <- c(356,390,529,504,483,546,281,355,349,143)
ChrisBosh_FT <- c(474,463,472,504,470,384,229,241,223,179)
ChrisPaul_FT <- c(394,292,332,455,161,337,260,286,295,289)
KevinDurant_FT <- c(209,209,391,452,756,594,431,679,703,146)
DerrickRose_FT <- c(146,146,146,197,259,476,194,0,27,152)
DwayneWade_FT <- c(629,432,354,590,534,494,235,308,189,284)
#Matrix for free throws
FreeThrows <- rbind(KobeBryant_FT,
JoeJohnson_FT,
LeBronJames_FT,
CarmeloAnthony_FT,
DwightHoward_FT,
ChrisBosh_FT,
ChrisPaul_FT,
KevinDurant_FT,
DerrickRose_FT,
DwayneWade_FT)
FreeThrows
#Matrix
FreeThrowAttempts
Housing market
#------ HELPERS
#1. reading the data, musimy miec zinstalowanego perla i wskazac do niego sciezke
#at beginning of the xls we have some description of the file, BOROUGH is the one of
the column name
#so we are nor importing anything that was present before the headers of the columns
x <- "xxxPawelxxx"
gsub("xxx", "zzz", x)#"zzzPawelzzz
#3 as.numeric() - funckje uzyjemy dla data frame kiedy checmy by pewna kolumna byla
traktowna jako numeric a nie np. factor czy character
#-------
install.packages("gdata")
library("gdata")
getwd()
setwd("C:\\Users\\Pc\\Desktop\\R files\\pliki")
#removing the $ sign from column and converting it to numeric, SALE.PRICE.N to nowa
kolumna jaka dodamy do naszego data frame
bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]", "", bk$SALE.PRICE)) #zamien kazdy
znak poza cyframi na ""
str(bk)
#$ SALE.PRICE.N : num 2214693 1654656 1069162 1374637 1649565 ...
#doing to make sure anythin weird is taking place with sale prices
attach(bk)
#The database is attached to the R search path. This means that the database is
searched by R when evaluating a variable, so objects in the database can be accessed by
simply giving their names.
#filterning the data to keep only actual sales (where sale price >0)
bk.sale <- bk[bk$sale.price.n!=0, ] #stworzylismy nowy data frame bk.sale
plot(bk.sale$gross.square.feet, bk.sale$sale.price.n)
plot(log(bk.sale$gross.square.feet), log(bk.sale$sale.price.n))
plot(log(bk.homes$gross.square.feet), log(bk.homes$sale.price.n))
bk.homes[which(bk.homes$sale.price.n<100000),
order(bk.homes[which(bk.homes$sale.price.n<100000),]
$sale.price.n),]
#removing outliers that seem like they weren't actual sales
plot(log(bk.homes$gross.square.feet),log(bk.homes$sale.price.n))