Beyond Collaborative Filtering: Using Decoders for Personalized Music Recommendation
Abstract
Music recommendation systems face the dual challenge of capturing both immediate context and long-term preferences in users' listening patterns. While recent transformer architectures have shown promise for sequential modeling, their application to music recommendation remains underexplored. We adapt a generalized sequential model architecture for music recommendation, introducing modifications that acknowledge how music preferences combine both temporal patterns and stable long-term tastes. By removing causal masking constraints typically used in sequential models, we better capture how users' music choices reflect their overall preferences rather than strictly sequential patterns. This technique achieves approximately 28% improvement in f1 scores across different evaluation cutoffs compared to a neural item-item baseline. Through ablation studies, we show that using positional encoding and removing the causal mask during training results in the best personalized recommendations, according to an offline evaluation based on playback probability on a held-out listening dataset. Our findings demonstrate that transformer-based architectures can effectively model music preferences while being computationally efficient enough for potential deployment in large-scale recommendation systems.