Data Transformation Cheatsheet

Data Transformation Manipulate Cases Manipulate Variables
with dplyr Cheat Sheet Extract Cases Extract Variables

Row functions return a subset of rows as a new table. Use a variant Column functions return a set of columns as a new table. Use a
that ends in _ for non-standard evaluation friendly code. variant that ends in _ for non-standard evaluation friendly code.
dplyr functions work with pipes and expect tidy data. In tidy data: filter(.data, ) select(.data, )
A B C A B C
w
www
ww Extract rows that meet logical criteria. Also
filter_(). filter(iris, Sepal.Length > 7) w
www Extract columns by name. Also select_if()
select(iris, Sepal.Length, Species)
& pipes
x %>% f(y) distinct(.data, ..., .keep_all = FALSE)
Use these helpers with select(),
e.g. select(iris, starts_with("Sepal"))
Each variable is
in its own column
Each observation, or
case, is in its own row
becomes f(x, y)
w
www
ww Remove rows with duplicate values. Also
distinct_(). distinct(iris, Species)
contains(match)
ends_with(match)
num_range(prefix, range)
one_of()
:, e.g. mpg:cyl
-, e.g, -Species
matches(match) starts_with(match)
Summarise Cases sample_frac(tbl, size = 1, replace = FALSE,
weight = NULL, .env = parent.frame()) Make New Variables
These apply summary functions to
columns to create a new table.
Summary functions take vectors as
summary
function
w
www
ww Randomly select fraction of rows.
sample_frac(iris, 0.5, replace = TRUE)
These apply vectorized functions to
columns. Vectorized funs take vectors vectorized
input and return one value (see back). as input and return vectors of the function
sample_n(tbl, size, replace = FALSE, same length as output (see back).
summarise(.data, ) mutate(.data, )
Compute table of summaries. Also weight = NULL, .env = parent.frame())
w
ww summarise_().
summarise(mtcars, avg = mean(mpg))
Randomly select size rows.
sample_n(iris, 10, replace = TRUE) w
www
ww Compute new column(s).
mutate(mtcars, gpm = 1/mpg)
transmute(.data, )
count(x, ..., wt = NULL, sort = FALSE) slice(.data, ) Compute new column(s), drop others.
w
ww Count number of rows in each group defined
by the variables in Also tally().
count(iris, Species) w
www
ww Select rows by position. Also slice_().
slice(iris, 10:15)
w
ww transmute(mtcars, gpm = 1/mpg)
mutate_all(.tbl, .funs, ...)

Variations top_n(x, n, wt) Apply funs to every column. Use with
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns.
Select and order top n entries (by group if
grouped data). top_n(iris, 5, Sepal.Width)
w
www funs(). mutate_all(faithful, funs(log(.),
log2(.)))
summarise_if() - Apply funs to all cols of one type.
Logical and boolean operators to use with filter()
mutate_at(.tbl, .cols, .funs, ...)
< <= is.na() %in% | xor()
Apply funs to specific columns. Use with
Group Cases > >= !is.na() ! &
See ?base::logic and ?Comparison for help. ww
w funs(), vars() and the helper functions for
select().
Use group_by() to created a "grouped" copy of a table. dplyr mutate_at(iris, vars( -Species), funs(log(.)))
functions will manipulate each "group" separately and then
combine the results. Arrange Cases mutate_if(.tbl, .predicate, .funs, ...)
arrange(.data, ...) Apply funs to all columns of one type. Use
mtcars %>% with funs().
w
www
ww group_by(cyl) %>% w
www
ww Order rows by values of a column (low to high),
use with desc() to order from high to low. mutate_if(iris, is.numeric, funs(log(.)))
ww
w summarise(avg = mean(mpg)) arrange(mtcars, mpg)
arrange(mtcars, desc(mpg)) add_column(.data, ..., .before =
NULL, .after = NULL)
group_by(.data, ..., add = FALSE)
Returns copy of table grouped by
Add Cases w
www
ww Add new column(s).
add_column(mtcars, new = 1:32)
g_iris <- group_by(iris, Species) add_row(.data, ..., .before = NULL,
.after = NULL)
ungroup(x, ...)
Returns ungrouped copy of table.
w
www
ww Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1) w
www
w
rename(.data, )
Rename columns.
rename(iris, Length = Sepal.Length)
ungroup(g_iris)
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) dplyr 0.5.0 tibble 1.2.0 Updated: 2017-01
Vectorized Functions Summary Functions Combine Tables
to use with mutate() to use with summarise() Combine Variables Combine Cases
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A
A B
B C
CC A
AA B
BB D
D A B C
Vectorized functions take vectors as input and functions take vectors as input and return single
return vectors of the same length as output. values as output.
aa
bb
cc
tt
uu
vv
11
22
33
+ aaa
bbb
ddd
tt
uuu
w
ww
33
22
11
= x
a t
b u
c v
1
2
3
A B C
vectorized
function
summary
function
Use bind_cols() to paste tables beside each other
as they are. + z
c
d
v 3
w 4
bind_cols()
Osets Counts A B C A B D
a t 1 a t 3
b u 2 b u 2
Returns tables placed side by Use bind_rows() to paste tables below each other as
dplyr::lag() - Oset elements by 1 dplyr::n() - number of values/rows c v 3 d w 1 side as a single table. they are.
dplyr::n_distinct() - # of uniques BE SURE THAT ROWS ALIGN.
dplyr::lead() - Oset elements by -1
sum(!is.na()) - # of non-NAs DF A B C bind_rows(, .id = NULL)
Cumulative Aggregates Use a "Mutating Join" to join one table to columns x a t 1
Location from another, matching values with the rows that
x b u 2 Returns tables one on top of the other
dplyr::cumall() - Cumulative all() x c v 3 as a single table. Set .id to a column
mean() - mean, also mean(!is.na()) they correspond to. Each join retains a dierent z c v 3
dplyr::cumany() - Cumulative any() z d w 4 name to add a column of the original
median() - median combination of values from the tables. table names (as pictured)
cummax() - Cumulative max()
dplyr::cummean() - Cumulative mean() Logicals A B C D
a t 1 3
left_join(x, y, by = NULL,
mean() - Proportion of TRUEs b u 2 2 copy=FALSE, suix=c(.x,.y),) A B C intersect(x, y, )
cummin() - Cumulative min() c v 3
sum() - # of TRUEs
c v 3 NA
Join matching values from y to x. Rows that appear in both x and z.
cumprod() - Cumulative prod()
cumsum() - Cumulative sum() Position/Order setdi(x, y, )
A B C D right_join(x, y, by = NULL, copy = A
aa
B
B
tt
CC
11
dplyr::first() - first value a t 1 3
FALSE, suix=c(.x,.y),) Rows that appear in x but not z.
Rankings dplyr::last() - last value
b u 2 2 bb uu 22
d w NA 1 Join matching values from x to y.
dplyr::cume_dist() - Proportion of all values <= dplyr::nth() - value in nth location of vector
AA B
B C
C union(x, y, )
dplyr::dense_rank() - rank with ties = min, no aa tt 1
Rank 1
gaps A B C D inner_join(x, y, by = NULL, copy = bb u 2 Rows that appear in x or z. (Duplicates
a t 1 3
FALSE, suix=c(.x,.y),) c v 3 removed). union_all() retains
dplyr::min_rank() - rank with ties = min quantile() - nth quantile b u 2 2 d w 4
Join data. Retain only rows with duplicates.
min() - minimum value
dplyr::ntile() - bins into n bins matches.
max() - maximum value
dplyr::percent_rank() - min_rank scaled to [0,1] Use setequal() to test whether two data sets contain
Spread the exact same rows (in any order).
dplyr::row_number() - rank with ties = "first" A B C D full_join(x, y, by = NULL,
a t 1 3
IQR() - Inter-Quartile Range b u 2 2 copy=FALSE, suix=c(.x,.y),)
Math mad() - mean absolute deviation c v 3 NA
Join data. Retain all values, all
sd() - standard deviation
d w NA 1
rows.
Extract Rows
+, - , *, /, ^, %/%, %% - arithmetic ops
var() - variance x y
log(), log2(), log10() - logs A B C
C A B CD
<, <=, >, >=, !=, == - logical comparisons Use by = c("col1", "col2") to a 1
+ a t 33
=
A B.x C B.y D t
Row names a
b
t
u
1
2
t
u
3
2 specify the column(s) to match b u 2 b uu 22
Misc c v 3 NA NA on. cc vv 3 d w 11
dplyr::between() - x >= left & x <= right Tidy data does not use rownames, which store left_join(x, y, by = "A")
a variable outside of the columns. To work with Use a "Filtering Join" to filter one table against the
dplyr::case_when() - multi-case if_else() the rownames, first move them into a column. rows of another.
A.x B.x C A.y B.y Use a named vector, by =
dplyr::coalesce() - first non-NA values by C A B C A B rownames_to_column() a t 1 d w
c("col1" = "col2"), to match on
element across a set of vectors 1 a t 1 a t b u 2 b u
semi_join(x, y, by = NULL, )
2 b u 2 b u Move row names into col. c v 3 a t columns with dierent names in A B C
dplyr::if_else() - element-wise if() + else() 3 c v 3 c v a <- rownames_to_column(iris, each data set.
a
b
t
u
1
2 Return rows of x that have a match in y.
dplyr::na_if() - replace specific values with NA var = "C") left_join(x, y, by = c("C" = "D")) USEFUL TO SEE WHAT WILL BE JOINED.
pmax() - element-wise max()
A B C column_to_rownames()
C A B
pmin() - element-wise min() a t 1 1 a t A1 B1 C A2 B2 Use suix to specify suix to give A B C anti_join(x, y, by = NULL, )
Move col in row names. a t 1 d w
to duplicate column names. c v 3
dplyr::recode() - Vectorized switch()
b u 2 2 b u Return rows of x that do not have a
c v 3 column_to_rownames(a,
3 c v b u 2 b u
c v 3 a t left_join(x, y, by = c("C" = "D"), match in y. USEFUL TO SEE WHAT WILL
dplyr::recode_factor() - Vectorized switch() for var = "C") NOT BE JOINED.
factors Also has_rownames(), remove_rownames() suix = c("1", "2"))
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) dplyr 0.5.0 tibble 1.2.0 Updated: 2017-01

Data Transformation Cheatsheet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Transformation Cheatsheet

Uploaded by

Copyright:

Available Formats

Data Transformation Manipulate Cases Manipulate Variables

with dplyr Cheat Sheet Extract Cases Extract Variables

mutate_all(.tbl, .funs, ...)

You might also like