Today we will begin using R to analyze data. If you haven’t yet, you should install R and RStudio as described in R4DS 1.4.

R as a calculator

To get to know R and RStudio, first go to the console window (likely in the lower left). Type 1+1 and hit enter.

Now go to the source panel (likely the top left). Type 1+1. Highlight and hit enter. You will see the code executed in the console panel.

The source panel is where we can write an R script, which is reproducible code to produce results. We will almost always work from the source panel rather than the console.

Prepare the environment

R is open-source statistical software. While some functions are available in the base software itself (known as base R), we will often use functions stored in packages which extend base R to more functionality. One reason R is so powerful is that a large community of users has contributed open-source packages to this project.

One package we will use in almost every exercise is tidyverse. To install this package, type install.packages('tidyverse').

To load packages into R, we use the library function. To load tidyverse, put this at the top of your .R source code.

library(tidyverse)

Highlight and hit enter. Now all of the tidyverse functions are available to you in R.

Load data

We will load our first dataset as a .csv file (Comma Separated Values). Load this directly from the course website by typing this in your source editor, highlighting, and hitting Enter.

median_income <- read_csv("https://info3370.github.io/sp23/assets/data/median_income.csv")

The line above did a few things:

  1. Used the read_csv function (part of the tidyverse) to read the data
  2. Stored the data in a new object called median_income.

If you have succeeded, your environment panel on the top right will now show the object median_income. Your R environment can hold many kinds of objects—data frames like this one, plots, numeric vectors, etc. For instance, try creating some other object.

favorite_number <- 2
hello <- "Hi [your name]"

Now if you type hello and hit enter, your R console will greet you. You can learn the class of an object by typing class.

class(hello)
class(favorite_number)
class(median_income)

median_income is a special kind of data.frame called a tibble. Its class is complicated! But tibbles will become familiar as we work with them all semester.

Look at the data

When we load some data, a first thing we’d like is to know what’s there. Try these functions to see what they tell you about the data.

nrow(median_income)
ncol(median_income)
head(median_income)
tail(median_income)
colnames(median_income)
summary(median_income)

Each function has an associated help file. These are great when you get stuck! For instance, type ?nrow to learn more about the nrow function.

Visualize the data

Visualization is one of the most important data science tasks. We will begin by visualizing how the median U.S. household income changed over the period 1968–2022.

We will use ggplot2 for visualization all semester. This package is part of the tidyverse.

A ggplot always begins with a call to the ggplot() function.

ggplot(data = median_income,
       aes(x = year, y = income))

Having prepared our blank plot canvas, we are ready to plot the data! We will start by plotting the data points. Put a + at the end of your ggplot() call so R knows that more is coming, and then type geom_point on the next line.

ggplot(data = median_income,
       aes(x = year, y = income)) +
  geom_point()

Improve the axis labels

We will continue layering in more code to make the plot better. You can modify the x- and y-axes by adding xlab() and ylab().

ggplot(data = median_income,
       aes(x = year, y = income)) +
  geom_point() +
  xlab("Year") +
  ylab("Median Household Income")

Try some things on your own

You can customize many aspects of a ggplot.

ggplot(data = median_income,
       aes(x = year, y = income)) +
  geom_point(color = "blue", size = 2, shape = 3) +
  xlab("Year") +
  ylab("Median Household Income") +
  theme_bw()