We consider the problem of repeatedly choosing policy parameters in order to maximize social welfare, the weighted sum of utility. The outcomes of earlier choices inform later choices. In contrast to multi-armed bandit models, utility is not observed, but needs to be indirectly inferred as equivalent variation. In contrast to standard optimal tax theory, response functions need to be learned through policy choices. We propose an algorithm based on optimal tax theory, Gaussian process priors, random Fourier features, and Thompson sampling, to (approximately) maximize social welfare over time.