Applied Data Science Training
In the SPEC Lab we invest a lot of energy training students in applied data science. In particular, we focus on training students in data management and data visualization using R, an open source statistical software package.
Download all datasets necessary for these trainings here. We recommend placing these datasets in a single folder and then setting your R working directory to this folder (see our Introduction to R materials for more information on working directories).
This is an optional module that covers SPEC Lab guidance for organizing collaborative research projects in the context of large research teams.
introduction to R
This module motivates these trainings conceptually and introduces the R statistical computing software. Topics include vectors, dataframes, variables, and arithmetic in R.
data management I
Exploring and Manipulating Data
This module introduces the tidyverse package and covers how to subset data and create new variables, as well as how to group and arrange observations and summarize information. Functions covered include: select(), filter(), mutate(), summarise(), group_by(), arrange().
organization for collaboration
These trainings are complementary to coursework in statistics and econometrics. While statistics and econometrics courses generally focus on theory and mathematics, we teach the nuts and bolts of statistical computing. We developed these trainings in collaboration with our Pipeline Partners to prepare students not only for academic social science research, but also for data science careers in government, nonprofits, and the private sector.
One of our favorite free resources for learning econometrics is the Econometrics Academy, founded by Dr. Ani Ketchova. Also, Gary King has posted his introductory quantitative methods course (targeted at first-year graduate students). For R resources complementary to these materials, we are big fans of this online textbook, and this list of resources by topic. We also like Andrew Heiss's data visualization course.
The modules below are designed to be completed in order, but some skipping around is definitely possible. Each module contains:
1). A module guide
2). Lecture videos (on Youtube).
3). A walkthrough exercise, which can be completed with an instructor or alone
4). A group-work exercise, which is designed to be completed in small groups but can be completed alone
5). A homework assignment designed to allow students to demonstrate individual mastery.
We also post answer keys for the group-work and homework questions, and the R scripts from the lecture videos.
These materials are a constant work in progress and we welcome feedback at firstname.lastname@example.org. Funding for the creation of these materials was provided by the National Science Foundation, the Dornsife College of Letters, Arts, and Sciences at the University of Southern California, and individual SPEC Lab Donors. To support our continued work to improve and expand these materials, please donate here.
data management II
Reshaping and Merging Data
This module covers how to restructure datasets to prepare them to be merged together, and then how to complete that merging process. Functions covered include: pivot_wider() and pivot_longer().
data management IIA
This module covers how to use the append_ids function created by the SPEC Lab to append Gleditsch-Ward country ID numbers to datasets on the basis of country names. This module is narrowly geared toward the management of the type of country-year datasets common in the quantitative study of comparative politics and international relations and may not be useful to all students.
data visualization I
Descriptive Data Visualization
This module introduces the ggplot2 package and teaches students how to create basic descriptive visualizations including histograms, scatter plots, line plots, bar plots, and box plots. Also includes instructions on facet_wrap().
data management III
Data Management for Visualization
This module covers types of data management frequently necessary for making beautiful figures. We continue working with the groupby(), summarise(), and mutate() functions that were first introduced in data management I.
Data visualization II
Visualizing Statistical Relationships
This module continues to work in ggplot2, with an emphasis on using variation in colors, shapes, line types, and labels to enable cross-group comparisons.
Math-Free Regression Intuition
This module provides a conceptual, math-free introduction to linear regression.