Summarizing Categorical Data

STAT 20: Introduction to Probability and Statistics

Agenda

  • Time for Lab 1.1 Arbuthnot (10 minutes)
  • Using R (20 minutes)
  • Your turn (8 minutes)
  • Break (5 minutes)
  • Concept Questions + Discussion (25 minutes)
  • Your first plot (5 minutes)
  • Lab 1.2: Arbuthnot

Lab 1.1 Arbuthnot

10:00

Loading Packages

R has a vast ecosystem of packages that add new functions. Any installed package can be loaded with the library() function.

Our two main packages:

  • tidyverse
  • stat20data

Load them with:

library(tidyverse)
library(stat20data)

Loading data from a package

Most data you will not be creating by hand. You will either be

  1. Loading it in from a separate file.

  2. Loading it from within an R package (most of our are in stat20data)

To load data from a package,

  1. load that package with library()
  2. You can then print the data to the console by typing its name and pressing enter or see it in the viewer with View(<df name>).
library(stat20data)
penguins
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Functions on data frames

3 functions from the tidyverse

The tidyverse package contains several functions used to manipulate data frames:

  • select() : subset columns
  • arrange() : sort rows
  • mutate() : create a new column from existing column(s)

Selecting columns

select(penguins, species, island)
# A tibble: 333 × 2
   species island   
   <fct>   <fct>    
 1 Adelie  Torgersen
 2 Adelie  Torgersen
 3 Adelie  Torgersen
 4 Adelie  Torgersen
 5 Adelie  Torgersen
 6 Adelie  Torgersen
 7 Adelie  Torgersen
 8 Adelie  Torgersen
 9 Adelie  Torgersen
10 Adelie  Torgersen
# ℹ 323 more rows

Arranging the rows of a data frame

arrange(penguins, bill_length_mm)
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.4          18.4               184        3325
 6 Adelie  Biscoe              34.5          18.1               187        2900
 7 Adelie  Torgersen           34.6          21.1               198        4400
 8 Adelie  Torgersen           34.6          17.2               189        3200
 9 Adelie  Biscoe              35            17.9               190        3450
10 Adelie  Biscoe              35            17.9               192        3725
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can sort in descending order by wrapping the variable name in desc().

Mutating a new column

mutate(penguins, bill_index = bill_depth_mm + bill_length_mm)
# A tibble: 333 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_index <dbl>

Remember that you can nest functions.

Nesting functions

select(mutate(penguins, bill_index = bill_depth_mm + bill_length_mm), bill_index)
# A tibble: 333 × 1
   bill_index
        <dbl>
 1       57.8
 2       56.9
 3       58.3
 4       56  
 5       59.9
 6       56.7
 7       58.8
 8       58.7
 9       59.8
10       55.7
# ℹ 323 more rows

Your turn

There is a built-in data set to R called mtcars that has information on cars that appeared in Motor Trend magazine. It’s already loaded and can be accessed as mtcars.

  1. Create a slimmer data frame that only contains the columns hp and wt and save it to mtcars_slim.

  2. Create a new column called power_to_weight that is the ratio of hp to wt. Save the three-column data frame back over mtcars_slim.

  3. Sort the data frame in descending order by the power-to-weight ratio.

Hint: look up help files!

08:00


Break

05:00

Concept Questions

The table below displays data from a survey on a class of students.

What proportion of the class was in the marching band?

00:30

What proportion of those in the marching band were juniors?

00:30

What proportion were sophomores who were not in the marching band?

00:30

What were the dimensions of the raw data from which this table was constructed?

00:30

How would you characterize the association between these two variables?

00:30

Political affiliation and college degree status of 500 survey participants.

Which group is the largest?

01:00

What does this plot show?

00:30

Your first plot

A template for a line plot:


ggplot(DATAFRAME, aes(x = XVARIABLE, y = YVARIABLE)) +
  geom_line()


Where:

  • DATAFRAME is the name of your data frame
  • XVARIABLE is the name of the variable of that data frame that you want on the x-axis
  • YVARIABLE is the name of the variable of that data frmae that you want on the y-axis

Lab 1.2: Arbuthnot

20:00