Approaches to (in silico) predict structures of proteins have been revolutionized by AlphaFold2, while those to predict interfaces between proteins are relatively underdeveloped, owing to the overly complicated protein complex data. In short, proteins are represented by 1D sequences folding into 3D structures, and interact to form assemblies to function. We believe such intricate scenarios are better modeled with additional indicative information, of their multi-modality nature and multi-scale functionality. We thus hypothesize to improve inter-protein contact prediction via augmenting input features with multi-modal representations, and synergizing the objective with auxiliary predictive tasks. (i) We first progressively add three protein modalities into models: protein sequences, sequences with evolutionary information, and structure-aware intra-contact maps, with observations that utilizing all data modalities delivers the best prediction precision. Fine-grained analysis reveals evolutionary and structural information benefit predictions on the difficult and rigid protein complexes, respectively, assessed by resemblance to bound structures in residue contacts. (ii) We next introduce three auxiliary tasks via multi-task learning or pre-training: inter-contact distance, angle, and protein-protein interaction (PPI) prediction. Although PPI prediction is reported to benefit from predicting inter-contacts (as causal interpretations), in reverse it is not true, and the same are the other two tasks across all complex categories. This again reflects the high complexity of the protein assembly data on which, designing synergistic auxiliary tasks is nontrivial.