Joint Training Optimization: Calibrating Paired LLM Personas via Self-Play
Jingtian Wu · Yann Hicke
Abstract
Large language models (LLMs) can simulate human-like personas, yet most optimization pipelines still train only one side of an interaction. As a result, while each role can be optimized independently, their interactive behavior and co-adaptive dynamics remain underexplored. We introduce Joint Training Optimization (JTO), a self-play reinforcement learning (RL) framework that jointly trains two models that simulate standardized human-like personas under a shared calibration objective, aligning their interaction-level outcomes $E(\tau)$ to a desired behavioral target specified by a control variable $\alpha$. As a preliminary evaluation, we apply JTO in a minimal clinical simulation, training a doctor--patient pair to calibrate patient disclosure behavior without any pre-existing datasets or human annotations. The preliminary results show a learnable signal, with several iterations outperforming baseline calibration, while trends remain noisy given limited sample sizes. The framework ultimately aims to produce a tutor-like persona that functions as a standardized evaluator for distinguishing real-world learners across ability levels ($\alpha$), and a learner-like persona that simulates human participants with varying abilities ($\alpha$) to estimate real-world task difficulty under the Item Response Theory (IRT) framework.
Chat is not available.
Successful Page Load