Summarizing Categorical Data

STAT 20: Introduction to Probability and Statistics

Agenda

Time for Lab 1.1 Arbuthnot (10 minutes)
Using R (20 minutes)
Your turn (8 minutes)
Break (5 minutes)
Concept Questions + Discussion (25 minutes)
Your first plot (5 minutes)
Lab 1.2: Arbuthnot

Lab 1.1 Arbuthnot

10:00

Loading Packages

R has a vast ecosystem of packages that add new functions. Any installed package can be loaded with the library() function.

Our two main packages:

tidyverse
stat20data

Load them with:

library(tidyverse)
library(stat20data)

Loading data from a package

Most data you will not be creating by hand. You will either be

Loading it in from a separate file.
Loading it from within an R package (most of our are in stat20data)

To load data from a package,

load that package with library()
You can then print the data to the console by typing its name and pressing enter or see it in the viewer with View(<df name>).

library(stat20data)
penguins

# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Functions on data frames

3 functions from the `tidyverse`

The tidyverse package contains several functions used to manipulate data frames:

select() : subset columns

arrange() : sort rows

mutate() : create a new column from existing column(s)

Selecting columns

select(penguins, species, island)

# A tibble: 333 × 2
   species island   
   <fct>   <fct>    
 1 Adelie  Torgersen
 2 Adelie  Torgersen
 3 Adelie  Torgersen
 4 Adelie  Torgersen
 5 Adelie  Torgersen
 6 Adelie  Torgersen
 7 Adelie  Torgersen
 8 Adelie  Torgersen
 9 Adelie  Torgersen
10 Adelie  Torgersen
# ℹ 323 more rows

Arranging the rows of a data frame

arrange(penguins, bill_length_mm)

# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.4          18.4               184        3325
 6 Adelie  Biscoe              34.5          18.1               187        2900
 7 Adelie  Torgersen           34.6          21.1               198        4400
 8 Adelie  Torgersen           34.6          17.2               189        3200
 9 Adelie  Biscoe              35            17.9               190        3450
10 Adelie  Biscoe              35            17.9               192        3725
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can sort in descending order by wrapping the variable name in desc().

Mutating a new column

mutate(penguins, bill_index = bill_depth_mm + bill_length_mm)

# A tibble: 333 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_index <dbl>

Remember that you can nest functions.

Nesting functions

select(mutate(penguins, bill_index = bill_depth_mm + bill_length_mm), bill_index)

# A tibble: 333 × 1
   bill_index
        <dbl>
 1       57.8
 2       56.9
 3       58.3
 4       56  
 5       59.9
 6       56.7
 7       58.8
 8       58.7
 9       59.8
10       55.7
# ℹ 323 more rows

Your turn

There is a built-in data set to R called mtcars that has information on cars that appeared in Motor Trend magazine. It’s already loaded and can be accessed as mtcars.

Create a slimmer data frame that only contains the columns hp and wt and save it to mtcars_slim.
Create a new column called power_to_weight that is the ratio of hp to wt. Save the three-column data frame back over mtcars_slim.
Sort the data frame in descending order by the power-to-weight ratio.

Hint: look up help files!

08:00

Break

05:00

Concept Questions

The table below displays data from a survey on a class of students.

What proportion of the class was in the marching band?

00:30

What proportion of those in the marching band were juniors?

00:30

What proportion were sophomores who were not in the marching band?

00:30

What were the dimensions of the raw data from which this table was constructed?

00:30

How would you characterize the association between these two variables?

00:30

Political affiliation and college degree status of 500 survey participants.

Which group is the largest?

01:00

What does this plot show?

00:30

Your first plot

A template for a line plot:

ggplot(DATAFRAME, aes(x = XVARIABLE, y = YVARIABLE)) +
  geom_line()

Where:

DATAFRAME is the name of your data frame
XVARIABLE is the name of the variable of that data frame that you want on the x-axis
YVARIABLE is the name of the variable of that data frmae that you want on the y-axis

Lab 1.2: Arbuthnot

20:00

Summarizing Categorical Data

Agenda

Lab 1.1 Arbuthnot

Loading Packages

Loading data from a package

Functions on data frames

3 functions from the tidyverse

Selecting columns

Arranging the rows of a data frame

Mutating a new column

Nesting functions

Your turn

Concept Questions

Your first plot

Lab 1.2: Arbuthnot

3 functions from the `tidyverse`