Simulation-based inference and normalizing flows have recently demonstrated excellent performance when applied to gravitational-wave parameter estimation. These methods can provide accurate results within seconds, in cases where classical methods based on stochastic samplers may take days or even weeks. However, such methods are typically based on deep neural networks and thus unable to reliably deal with out-of-distribution data, such as may arise when predicted signal and noise models do not precisely fit observations. We here present two innovations to deal with this challenge. First, we introduce a probabilistic noise model to augment the training data, making the inference network substantially more robust to distribution shifts in experimental noise. Second, we apply importance sampling to independently verify and correct inference results. This compensates for network inaccuracies and flags failure cases via low sample efficiencies. We expect these methods to be key components for the integration of deep learning techniques into production pipelines for gravitational-wave analysis.