Humans are capable of learning visual concepts by jointly understanding vision and language. Imagine that someone with no prior knowledge of colors is presented with the images of the red and green objects, paired with descriptions. They can easily identify the difference in objects' visual appearance (in this case, color), and align it to the corresponding words. This intuition motivates the use of image-text pairs to facilitate automated visual concept learning and language acquisition. In the talk, I will present recent progress on neuro-symbolic models for visual concept learning and reasoning. These models learn visual concepts and their association with symbolic representations of language, only by looking at images and reading paired natural language texts.