Hierarchical Tokenization of Multimodal Music Data for LLM-Compatible Semantic Representations
Abstract
Recent advances in generative retrieval allow large language models (LLMs) to recommend items by generating their identifiers token by token, rather than using nearest-neighbor search over embeddings. This approach requires each item, such as a music track, to be represented by a compact and semantically meaningful token sequence that LLMs can generate. We propose a multimodal music tokenizer (3MToken) that transforms rich metadata from a music database, including audio, credits, semantic tags, song and artist descriptions, musical characteristics, release dates, and consumption patterns into discrete tokens using a Residual-Quantized Variational Autoencoder. Our method learns hierarchical representations, capturing coarse features in early quantization levels and refining them at later levels, preserving fine-grained information. We train and evaluate our model on a large-scale dataset of 1.6 million tracks, and it achieves +40.0\%, +43.4\%, and +15.8\% improvements in Precision@k, Recall@k, and Hit@k, respectively, over the baselines.