Today we will begin using R to analyze data. If you haven’t yet, you should install R and RStudio as described in R4DS 1.4.
To get to know R and RStudio, first go to the console window (likely
in the lower left). Type 1+1
and hit enter.
Now go to the source panel (likely the top left). Type
1+1
. Highlight and hit enter. You will see the code
executed in the console panel.
The source panel is where we can write an R script, which is reproducible code to produce results. We will almost always work from the source panel rather than the console.
R is open-source statistical software. While some functions are available in the base software itself (known as base R), we will often use functions stored in packages which extend base R to more functionality. One reason R is so powerful is that a large community of users has contributed open-source packages to this project.
One package we will use in almost every exercise is tidyverse. To install this
package, type install.packages('tidyverse')
.
To load packages into R, we use the library
function. To
load tidyverse, put this at the
top of your .R source code.
library(tidyverse)
Highlight and hit enter. Now all of the tidyverse functions are available to you in R.
We will load our first dataset as a .csv
file (Comma
Separated Values). Load this directly from the course website by typing
this in your source editor, highlighting, and hitting Enter.
median_income <- read_csv("https://info3370.github.io/sp23/assets/data/median_income.csv")
The line above did a few things:
read_csv
function (part of the tidyverse) to read the datamedian_income
.If you have succeeded, your environment panel on the top right will
now show the object median_income
. Your R environment can
hold many kinds of objects—data frames like this one, plots, numeric
vectors, etc. For instance, try creating some other object.
favorite_number <- 2
hello <- "Hi [your name]"
Now if you type hello
and hit enter, your R console will
greet you. You can learn the class of an object by typing
class
.
class(hello)
class(favorite_number)
class(median_income)
median_income
is a special kind of
data.frame
called a tibble. Its class is
complicated! But tibbles will become familiar as we work with them all
semester.
When we load some data, a first thing we’d like is to know what’s there. Try these functions to see what they tell you about the data.
nrow(median_income)
ncol(median_income)
head(median_income)
tail(median_income)
colnames(median_income)
summary(median_income)
Each function has an associated help file. These are great when you
get stuck! For instance, type ?nrow
to learn more about the
nrow
function.
Visualization is one of the most important data science tasks. We will begin by visualizing how the median U.S. household income changed over the period 1968–2022.
We will use ggplot2 for visualization all semester. This package is part of the tidyverse.
A ggplot always begins with a call to the ggplot()
function.
year
and the y-variable to be income
. These
refer to variables in the median_income
data frame.ggplot(data = median_income,
aes(x = year, y = income))
Having prepared our blank plot canvas, we are ready to plot the data!
We will start by plotting the data points. Put a +
at the
end of your ggplot()
call so R knows that more is coming,
and then type geom_point
on the next line.
ggplot(data = median_income,
aes(x = year, y = income)) +
geom_point()
We will continue layering in more code to make the plot better. You
can modify the x- and y-axes by adding xlab()
and
ylab()
.
ggplot(data = median_income,
aes(x = year, y = income)) +
geom_point() +
xlab("Year") +
ylab("Median Household Income")
You can customize many aspects of a ggplot
.
theme_bw()
or theme_classic()
. The theme
documentation contains many possibilitiesgeom_point
to geom_point(color = "blue")
.ggplot(data = median_income,
aes(x = year, y = income)) +
geom_point(color = "blue", size = 2, shape = 3) +
xlab("Year") +
ylab("Median Household Income") +
theme_bw()