Skip to yearly menu bar Skip to main content



Abstract:

Training large language models relies heavily on the quality and composition of data, yet optimizing data selection and utilization remains a significant challenge in the field. In this talk, I will outline several key ideas to enhance training efficiency through better data use and cover several findings from my lab on selecting high-quality datasets and optimizing data compositions. I will also introduce a simple yet powerful pre-training approach that conditions on meta-data information associated with training data. This approach is remarkably straightforward to implement, incurs minimal computational overhead, and yields significant efficiency gains.

Chat is not available.