Agents interacting under partial observability require access to past observations via a memory mechanism in order to approximate the true state of the environment.Recent work suggests that leveraging language as abstraction provides benefits for creating a representation of past events.History Compression via Language Models (HELM) leverages a pretrained Language Model (LM) for representing the past. It relies on a randomized attention mechanism to translate environment observations to token embeddings.In this work, we show that the representations resulting from this attention mechanism can collapse under certain conditions. This causes blindness of the agent to certain subtleties in the environment. We propose a solution to this problem consisting of two parts. First, we improve upon HELM by substituting the attention mechanism with a feature-wise centering-and-scaling operation. Second, we take a step toward semantic history compression by encoding the observations with a pretrained multimodal model such as CLIP, which further improves performance. With these improvements our model is able to solve the challenging MiniGrid-Memory environment.Surprisingly, however, our experiments suggest that this is not due to the semantic enrichment of the representation presented to the LM but only due to the discriminative power provided by CLIP.