In this demo we propose a compositional tool that generates musical sequences based on prosody of speech recorded by the user. The tool allows any user–-regardless of musical training--to use their own speech to generate musical melodies, while hearing the direct connection between their recorded speech and resulting music. This is achieved with a pipeline combining speech-based signal processing [1,2], musical heuristics, and a set of transformer models [3,4] trained for new musical tasks. Importantly, the pipeline is designed to work with any kind of speech input and does not require a paired dataset for the training of the said transformer model.
Our approach consists of the following steps:
The demo is self-explanatory: the audience can interact with the system by either providing a live-recording using a web-based recording interface or by uploading a pre-recorded speech sample. The system then provides a visualization of the formant contours extracted from the provided speech sample, the set of note constraints obtained from the speech, and the sequence of musical notes as generated by the transformers. The audience can also listen to—and interactively mix the levels (volume) of—the input speech sample, initial note sequences, and the musical sequences as generated by the transformer models.
 Rabiner & Huang. Fundamentals of speech recognition.  Dumpala et al. Sine-wave speech as pre-processing for downstream tasks. Symp. FRSM 2020  Vaswani et al. Attention is all you need. NeurIPS 2017  Huang et al, Music Transformer ICLR 2018