Surface ozone is an air pollutant that contributes to hundreds of thousands of premature deaths annually. Accurate short-term ozone forecasts may allow improved policy to reduce the risk to health, such as air quality warnings. However, forecasting ozone is a difficult problem, as surface ozone concentrations are controlled by a number of physical and chemical processes which act on varying timescales. Accounting for these temporal dependencies appropriately is likely to provide more accurate ozone forecasts. We therefore deploy a state-of-the-art transformer-based model, the Temporal Fusion Transformer, trained on observational station data from three European countries. In four-day test forecasts of daily maximum 8-hour ozone, the novel approach is highly skilful (MAE = 4.6 ppb, R2 = 0.82), and generalises well to two European countries unseen during training (MAE = 4.9 ppb, R2 = 0.79). The model outperforms standard machine learning models on our data, and compares favourably to the published performance of other deep learning architectures tested on different data. We illustrate that the model pays attention to physical variables known to control ozone concentrations, and that the attention mechanism allows the model to use relevant days of past ozone concentrations to make accurate forecasts.