Do Large Language Models Defend Their Beliefs Consistently?
Abstract
When large language models (LLMs) are challenged on their response, they may defer to the user or uphold their response. Some models may be more deferent, while others may be more stubborn in defense of their beliefs. The 'appropriate' level of belief defense may be conditional on the task and user preferences, but it is nonetheless desirable that the model behave consistently in this respect. In particular, on average, when a model has a high confidence in its answer, it should not defer more often than when it has a lower confidence; and this should be independent of the model's overall tendency towards deference. We term acting in this manner as being belief-consistent, and we carry out the first detailed study of belief-consistency in modern LLMs.