Workshop: AI for Accelerated Materials Design (AI4Mat)

Group SELFIES: A Robust Fragment-Based Molecular String Representation

Austin Cheng · Andy Cai · Santiago Miret · Gustavo Malkomes · Mariano Phielipp · Alan Aspuru-Guzik

Keywords: [ generative chemistry ] [ molecular string representations ] [ cheminformatics ] [ genetic algorithms ]


The design of functional molecules relies on the representation used: a flexible and informative representation can improve downstream generation tasks. String representations such as SMILES and SELFIES serve as the basis for chemical language models, and the robustness of SELFIES makes it naturally suited for molecular optimization with genetic algorithms. But while SMILES and SELFIES are atomic representations, several recent approaches take advantage of the inductive bias of molecular fragments. In this work, we present Group SELFIES, introducing group tokens that represent functional groups or entire substructures while maintaining robustness. Group tokens give control over which structures should be preserved during optimization. Experiments indicate that Group SELFIES improves distribution learning and improves the quality of molecules generated by simply taking random Group SELFIES strings. The code is available at \url{}.

Chat is not available.