Thursday, February 6, 2020

R Basics

Tips and tricks in R captured in the last 6 months of work at Mazama Science. Most of this is not rocket science, but some of it took research to figure out and I’d like to keep it in my mind.

Basic Data Structures

Variables

foo <- "Hello World!"

Note the strange assignment operator “<-”. The equal sign works as well, but by convention is only used with function parameters.

Vectors

Basic unit of data in R. Pretty much the same as a Python list or an array in Perl (without the semantic stupidity). Vectors are ordered and must contain the same type of data. Elements in a vector can be accessed by index location, but using a negative index means “everything but that element”. Oh yeah, and R is not zero-indexed. Trying to access the [0] element yields no joy.

> foo <- c(1,2,3,4)

> foo[3] == 3
# TRUE

> foo[-1] == c(2,3,4)
# TRUE

> foo[-4] == c(1,2,3)
# TRUE

Lists

R lists are like Python dictionaries, or hashes in Perl, but with the important distinction that they retain their ordering. Lists can also store multiple data types.

1. Create a list using the list() function:
> contact_info <- list("first_name" = "Roger", 
                     "middle_initiual" = "T", 
                     "last_name" = "Shrubber", 
                     "address" = "123 Jump St.",
                     "age" = as.integer(100))

2. Access 1st element by index position (NOTE that this returns a list):
> contact_info[1]

$first_name = 'Roger'

# 3. Access list element by name:
> contact_info["first_name"]

$first_name = 'Roger'

4. Access first 3 elements in the list:
contact_info[1:3]

$first_name
'Roger'
$middle_initiual
'T'
$last_name
'Shrubber'

5. Access just the values, without their names (keys)
unlist(contact_info[1:3], use.names = FALSE)

'Roger' 'T' 'Shrubber'

6. Access just the value of an element using the $<name> syntax
contact_info$address

'123 Jump St.'

Lists are very useful as lookup tables and as catch-all containers for random crap. Lists are slower than vectors though, so should be used with discretion.

Matrices

R implements a 2-dimensional array using the matrix() function. Higher dimension arrays are possible as well, but since I haven’t used them yet, I’ll stick with what I know.

1. Create a 3 x 3 matrix using a sequence from [1,9]:
> threesome <- matrix(seq(1,9),
                    nrow = 3,
                    ncol = 3,
                    byrow = TRUE)
# yields
1	2	3
4	5	6
7	8	9

2. Access row 1 of matrix:
> threesome[1,]

# yields
1 2 3

3. Access column 1 of matrix:
threesome[,1]

# yields
1 4 7

4. Access a "window" of values. [row_start:row_end, col_start:col_end]
threesome[2:3, 1:2]

# yields
5	6
8	9

# NOTE: Setting "byrow = FALSE" when you create the matrix yields this:
1	4	7
2	5	8
3	6	9

Dataframes

Dataframes are probably the reason you started using R in the first place. They are the type of data structure that you seem to encounter the most when working with packages.

first_names <- c("Fred", "George", "Ronald", "Harry")
last_names <- c("Weasely", "Weasely", "Weasely", "Potter")
addresses <- c("27 Foobar Way", "27 Foobar Way", "27 Foobar Way", "4 Privet Dr")
counties <- c("Devon", "Devon", "Devon", "Surrey")
ages <- as.integer(c(17, 17, 15, 15))

potterTrivia <- data.frame("firstName" = first_names,
                           "lastName" = last_names,
                           "address" = addresses,
                           "county" = counties,
                           "age" = ages, 
                           stringsAsFactors = FALSE)

# yields
	firstName	lastName	address	county	age
    <chr>	<chr>	<chr>	        <chr>	<int>
1	Fred	Weasely	27 Foobar Way	Devon	17
2	George	Weasely	27 Foobar Way	Devon	17
3	Ronald	Weasely	27 Foobar Way	Devon	15
4	Harry	Potter	4 Privet Dr	Surrey	15

Like everything else, R dataframes are indexed and data inside them can be accessed by their index position.

  • By column:
potterTrivia[1]

# yields
firstName
<chr>
Fred
George
Ronald
Harry
  • By row:
potterTrivia[1,]

# yields
firstName	lastName	address	    county	age
    <chr>	<chr>	<chr>	        <chr>	<int>
1	Fred	Weasely	27 Foobar Way	Devon	17
  • By range of rows:
potterTrivia[1:3,]

# yields 
firstName	lastName	address	    county	age
    <chr>	<chr>	<chr>	        <chr>	<int>
1	Fred	Weasely	27 Foobar Way	Devon	17
2	George	Weasely	27 Foobar Way	Devon	17
3	Ronald	Weasely	27 Foobar Way	Devon	15

  • By range of rows and columns
potterTrivia[1:3, 1:2]

# yields
  firstName	lastName
    <chr>	<chr>
1	Fred	Weasely
2	George	Weasely
3	Ronald	Weasel

Column names can be viewed from a dataframe by using the names() function and $column_name can be used to extract it from the dataframe.

names(potterTrivia)

# yields
'firstName' 'lastName' 'address' 'county' 'age'

potterTrivia$firstName

# yields
'Fred' 'George' 'Ronald' 'Harry'

In many ways, a dataframe acts and feels like a database table. So it should come as no surprise that Pandas, which was developed to provide R-like functionality in Python, is described by some as, “an in-memory nosql database that has sql-like constructs”. I think similar things could be said about the R dataframe.

Written with StackEdit.

No comments: