How Data Sampling Affects Differentially Private Stochastic Gradient Descent
Machine learning (ML) models are increasingly used to automate decision making in companies, public services, and the healthcare domain. Yet, researchers have shown that adversaries having access to a model can infer private information about individuals in the training dataset. The most successful approach to prevent these attacks is to train models with differential privacy (DP) guarantees.
Differential privacy ensures that the output of the training algorithm does not change too much regardless of the inclusion or exclusion of any person’s data, guaranteeing strong privacy. In ML, DP is implemented by modifying the standard training algorithm, stochastic gradient descent (SGD), into an algorithm called DP-SGD. A key step in DP-SGD is the sampling of mini-batches. Previous research showed that sampling is detrimental to utility in non-ML algorithms, meaning that these algorithms achieve better utility for the same privacy level when sampling is not applied at all. These results raise the question of whether sampling is detrimental in DP-SGD as well, and if it reduces the inference accuracy of the models.
The goal of this project is to empirically analyze the DP-SGD algorithm in order to understand the impact of sampling on accuracy. You will empirically evaluate the impact by implementing neural networks in Pytorch and training them using Opacus, an open-source implementation of DP-SGD. If your results provide evidence that sampling does benefit accuracy, you will investigate potential reasons why this is the case, building on previous work on non-ML algorithms. If you find that sampling does not benefit accuracy, you will attempt to design an algorithm that provides DP guarantees without sampling, with improved performance.