Applied Data Science Training

In the SPEC Lab we invest a lot of energy training students in applied data science. In particular, we focus on training students in data management and data visualization using R, an open source statistical software package. 

These trainings are complementary to coursework in statistics and econometrics. While statistics and econometrics courses generally focus on theory and mathematics, we teach the nuts and bolts of statistical computing. We developed these trainings in collaboration with our Pipeline Partners to prepare students not only for academic social science research, but also for data science careers in government, nonprofits, and the private sector. 

 

One of our favorite free resources for learning econometrics is the Econometrics Academy, founded by Dr. Ani Ketchova. Also, Gary King has put all of the lectures for his introductory qualitative methods course on YouTube (targeted at first-year graduate students). For R resources complementary to these materials, we are big fans of this online textbook, and this list of resources by topic.

 

The modules below are designed to be completed in order, but some skipping around is definitely possible. Each module contains:

1). A module guide

2). Lecture videos

3). A walkthrough exercise, which is designed to be completed in small groups but can be completed alone

4). A homework assignment designed to allow students to demonstrate individual mastery.

 

These materials are a constant work in progress and we welcome feedback at uscspeclab@gmail.com or benjamin.a.graham@usc.edu. Funding for the creation of these materials was provided by the National Science Foundation, the Dornsife College of Letters, Arts, and Sciences at the University of Southern California, and individual SPEC Lab Donors. To support our continued work to improve and expand these materials, please donate here.

training data

Download all datasets necessary for these trainings here. We recommend placing these datasets in a single folder and then setting your R working directory to this folder (see our Introduction to R materials for more information on working directories).

organization for collaboration

This is an optional module that covers SPEC Lab guidance for organizing collaborative research projects in the context of large research teams.

introduction to R

This module motivates these trainings conceptually and introduces the R statistical computing software. Topics include vectors, dataframes, variables, and arithmetic in R.

data management I

Exploring and Manipulating Data

This module introduces the tidyverse package and covers how to subset data and create new variables, as well as how to group and arrange observations and summarize information. Functions covered include: select(), filter(), mutate(), summarise(), group_by(), arrange(). 

data management II

Reshaping and Merging Data

This module covers how to restructure datasets to prepare them to be merged together, and then how to complete that merging process. Functions covered include: pivot_wider() and pivot_longer().

data management IIA

Append IDs

This module covers how to use the append_ids function created by the SPEC Lab to append Gleditsch-Ward country ID numbers to datasets on the basis of country names.  This module is narrowly geared toward the management of the type of country-year datasets common in the quantitative study of comparative politics and international relations and may not be useful to all students.  

data visualization I

Descriptive Data Visualization

This module introduces the ggplot2 package and teaches students how to create basic descriptive visualizations including histograms, scatter plots, line plots, bar plots, and box plots. Also includes instructions on facet_wrap().

data management III

Data Management for Visualization

This module covers types of data management frequently necessary for making beautiful figures. We continue working with the groupby(), summarise(), and mutate() functions that were first introduced in data management I. 
 

Data visualization II

Visualizing Statistical Relationships

This module continues to work in ggplot2, with an emphasis on using variation in colors, shapes, line types, and labels to enable cross-group comparisons.

regression I

Math-Free Regression Intuition

This module provides a conceptual, math-free introduction to linear regression.

regression II

T-tests and Linear Regression in R

 

Module description

regression III

Regression Tables and Extracting Information for Regression Visualization

This module teaches students how to make regression tables in R using the stargazer package.

© 2023 by USC SPEC.

  • Twitter Social Icon
  • YouTube Social  Icon