top of page

Applied Data Science Training

At the SPEC Lab, we devote significant effort to training students in applied data science—with a particular focus on data management and data visualization using R, an open-source statistical programming language widely used in both academia and industry.

​

Our training modules are designed to complement formal coursework in statistics and econometrics. Whereas traditional courses emphasize theory and mathematics, our approach focuses on the practical skills of statistical computing. These modules were developed in collaboration with our private sector and government partners, ensuring that students are prepared not only for academic research but also for careers in data science across government, nonprofit, and private sectors.

​

Our training modules are designed to be taken in sequence, but students are welcome to skip around. Each module includes:

  1. A module guide

  2. Lecture videos (via YouTube)

  3. A walkthrough exercise (for solo or guided completion)

  4. A group-work exercise (intended for small groups but also suitable for solo work)

  5. A homework assignment to assess individual mastery

We also provide answer keys for the exercises and homework, as well as the R scripts featured in the lecture videos.​​

​

Lastly, we’ve compiled a Google Drive folder with additional online resources. It’s a great place to start exploring.

​

These materials are a constant work in progress and we welcome feedback at benjamin.a.graham@usc.edu. Funding for the creation of these materials was provided by the National Science Foundation, the Dornsife College of Letters, Arts, and Sciences at the University of Southern California, and individual SPEC Lab Donors. 

dataManagementArt02-removebg-preview.png

training data

Download all datasets necessary for these trainings here. We recommend placing these datasets in a single folder and then setting your R working directory to this folder (see our Introduction to R materials for more information on working directories).

organization for collaboration

This is an optional module that covers SPEC Lab guidance for organizing collaborative research projects in the context of large research teams.

introduction to R

This module motivates these trainings conceptually and introduces the R statistical computing software. Topics include vectors, dataframes, variables, and arithmetic in R.

data management I

Exploring and Manipulating Data

​

This module introduces the tidyverse package and covers how to subset data and create new variables, as well as how to group and arrange observations and summarize information. Functions covered include: select(), filter(), mutate(), summarise(), group_by(), arrange(). 

data management
II

Reshaping and Merging Data

​

This module covers how to restructure datasets to prepare them to be merged together, and then how to complete that merging process. Functions covered include: pivot_wider() and pivot_longer().

data management
IIA

Append IDs

​

This module covers how to use the append_ids function created by the SPEC Lab to append Gleditsch-Ward country ID numbers to datasets on the basis of country names.  This module is narrowly geared toward the management of the type of country-year datasets common in the quantitative study of comparative politics and international relations and may not be useful to all students.  

data visualization I

Descriptive Data Visualization

​

This module introduces the ggplot2 package and teaches students how to create basic descriptive visualizations including histograms, scatter plots, line plots, bar plots, and box plots. Also includes instructions on facet_wrap().

data management III

Data Management for Visualization

​

This module covers types of data management frequently necessary for making beautiful figures. We continue working with the groupby(), summarise(), and mutate() functions that were first introduced in data management I. 
 

Data visualization

II

Visualizing Statistical Relationships

​

This module continues to work in ggplot2, with an emphasis on using variation in colors, shapes, line types, and labels to enable cross-group comparisons.

regression I

Math-Free Regression Intuition

​

This module provides a conceptual, math-free introduction to linear regression.

regression II

Math-Free Regression Intuition II

​

This module provides a conceptual, math-free introduction to linear regression.

Data visualization iii

Visualizing Regression Results with dotwhisker

​

This module provides guidance on how to create dot-and-whisker plots

bottom of page