Tips and tricks in R captured in the last 6 months of work at Mazama Science. Most of this is not rocket science, but some of it took research to figure out and I’d like to keep it in my mind.
Basic Data Structures
Variables
foo <- "Hello World!"
Note the strange assignment operator “<-”. The equal sign works as well, but by convention is only used with function parameters.
Vectors
Basic unit of data in R. Pretty much the same as a Python list or an array in Perl (without the semantic stupidity). Vectors are ordered and must contain the same type of data. Elements in a vector can be accessed by index location, but using a negative index means “everything but that element”. Oh yeah, and R is not zero-indexed. Trying to access the [0] element yields no joy.
> foo <- c(1,2,3,4)
> foo[3] == 3
# TRUE
> foo[-1] == c(2,3,4)
# TRUE
> foo[-4] == c(1,2,3)
# TRUE
Lists
R lists are like Python dictionaries, or hashes in Perl, but with the important distinction that they retain their ordering. Lists can also store multiple data types.
1. Create a list using the list() function:
> contact_info <- list("first_name" = "Roger",
"middle_initiual" = "T",
"last_name" = "Shrubber",
"address" = "123 Jump St.",
"age" = as.integer(100))
2. Access 1st element by index position (NOTE that this returns a list):
> contact_info[1]
$first_name = 'Roger'
# 3. Access list element by name:
> contact_info["first_name"]
$first_name = 'Roger'
4. Access first 3 elements in the list:
contact_info[1:3]
$first_name
'Roger'
$middle_initiual
'T'
$last_name
'Shrubber'
5. Access just the values, without their names (keys)
unlist(contact_info[1:3], use.names = FALSE)
'Roger' 'T' 'Shrubber'
6. Access just the value of an element using the $<name> syntax
contact_info$address
'123 Jump St.'
Lists are very useful as lookup tables and as catch-all containers for random crap. Lists are slower than vectors though, so should be used with discretion.
Matrices
R implements a 2-dimensional array using the matrix()
function. Higher dimension arrays are possible as well, but since I haven’t used them yet, I’ll stick with what I know.
1. Create a 3 x 3 matrix using a sequence from [1,9]:
> threesome <- matrix(seq(1,9),
nrow = 3,
ncol = 3,
byrow = TRUE)
# yields
1 2 3
4 5 6
7 8 9
2. Access row 1 of matrix:
> threesome[1,]
# yields
1 2 3
3. Access column 1 of matrix:
threesome[,1]
# yields
1 4 7
4. Access a "window" of values. [row_start:row_end, col_start:col_end]
threesome[2:3, 1:2]
# yields
5 6
8 9
# NOTE: Setting "byrow = FALSE" when you create the matrix yields this:
1 4 7
2 5 8
3 6 9
Dataframes
Dataframes are probably the reason you started using R in the first place. They are the type of data structure that you seem to encounter the most when working with packages.
first_names <- c("Fred", "George", "Ronald", "Harry")
last_names <- c("Weasely", "Weasely", "Weasely", "Potter")
addresses <- c("27 Foobar Way", "27 Foobar Way", "27 Foobar Way", "4 Privet Dr")
counties <- c("Devon", "Devon", "Devon", "Surrey")
ages <- as.integer(c(17, 17, 15, 15))
potterTrivia <- data.frame("firstName" = first_names,
"lastName" = last_names,
"address" = addresses,
"county" = counties,
"age" = ages,
stringsAsFactors = FALSE)
# yields
firstName lastName address county age
<chr> <chr> <chr> <chr> <int>
1 Fred Weasely 27 Foobar Way Devon 17
2 George Weasely 27 Foobar Way Devon 17
3 Ronald Weasely 27 Foobar Way Devon 15
4 Harry Potter 4 Privet Dr Surrey 15
Like everything else, R dataframes are indexed and data inside them can be accessed by their index position.
- By column:
potterTrivia[1]
# yields
firstName
<chr>
Fred
George
Ronald
Harry
- By row:
potterTrivia[1,]
# yields
firstName lastName address county age
<chr> <chr> <chr> <chr> <int>
1 Fred Weasely 27 Foobar Way Devon 17
- By range of rows:
potterTrivia[1:3,]
# yields
firstName lastName address county age
<chr> <chr> <chr> <chr> <int>
1 Fred Weasely 27 Foobar Way Devon 17
2 George Weasely 27 Foobar Way Devon 17
3 Ronald Weasely 27 Foobar Way Devon 15
- By range of rows and columns
potterTrivia[1:3, 1:2]
# yields
firstName lastName
<chr> <chr>
1 Fred Weasely
2 George Weasely
3 Ronald Weasel
Column names can be viewed from a dataframe by using the names()
function and $column_name
can be used to extract it from the dataframe.
names(potterTrivia)
# yields
'firstName' 'lastName' 'address' 'county' 'age'
potterTrivia$firstName
# yields
'Fred' 'George' 'Ronald' 'Harry'
In many ways, a dataframe acts and feels like a database table. So it should come as no surprise that Pandas, which was developed to provide R-like functionality in Python, is described by some as, “an in-memory nosql database that has sql-like constructs”. I think similar things could be said about the R dataframe.
Written with StackEdit.
No comments:
Post a Comment