Tracing the Development of Syntax and Semantics in a Model trained on Child-Directed Speech and Visual Input
Abstract
In contrast to most large language models, children learn in an astoundingly data-efficient manner from their naturalistic environments. These environments are grounded in sensory perception, critically enabling learning from the alignment between visual and auditory input, among other types. In the current work, we identify three signatures of human language acquisition that highlight this ability for efficient, constructive learning: the identification of semantic hierarchies, the ability to bootstrap meaning on the basis of polysemy, and the generalization of syntactic frames. Using a testing set of the CHILDES Providence corpus, we create probes for each of these signatures of efficient learning, and test them on variations of the BabyLLaVA model (trained on SAYCam): the baseline model, one finetuned on younger children's data from the Providence corpus, one finetuned on older children's data from the Providence corpus, and one finetuned on equal samples of younger and older children's data as a more rigorous baseline. Preliminary findings suggest that finetuning on the Providence corpus improves performance on most probes, though we also observe interactions between finetuning data and probe type. This suggests that, while some signatures of efficient human language learning are present in a VLM trained on naturalistic data, not all types of data contribute equally to their emergence.