In this paper, we present a method to disentangle appearance and structural information in the latent space of StyleGAN. We train an autoencoder whose encoder extracts appearance and structural features from an input latent code and then reconstructs the original input using the decoder. To train this network, We propose a video-based latent contrastive learning framework. With the observation that the appearance of a face does not change within a short video, the encoder learns to pull appearance representations of various video frames together while pushing appearance representations of different faces apart. Similarly, the structural representations of augmented versions of the same frame are pulled together, while the representation across different frames are pushed apart. As face video datasets lack sufficient number of unique identities, we propose a method to synthetically generate videos. This allows our disentangling network to observe a larger variation of appearances, expressions, and poses during training. We evaluate our approach on the tasks of expression transfer in images and motion transfer in videos.