The current study evaluated and compared single-modality and joint fusion deep learning approaches for automatic binary classification of diabetic retinopathy (DR) using seven convolutional neural network models (VGG19, ResNet50V2, DenseNet121, InceptionV3, InceptionResNetV2, Xception, and MobileNetV2) over two datasets: APTOS 2019 blindness detection and Messidor-2. The empirical evaluations used (1) six performance metrics (accuracy, sensitivity, specificity, precision, F1-score, and area under the curve), (2) the Scott-Knott Effect Size difference (SK ESD) statistical test to rank and cluster the models based on accuracy, and (3) the Borda count voting method to rank the best models figuring in the first SK ESD cluster, based on sensitivity, specificity, precision, F1-score, and area under the curve. Results showed that the single-modality DenseNet121 and InceptionV3 were the top-performing and less sensitive approaches, with an accuracy of 90.63% and 75.25%, respectively. The joint fusion strategy outperformed single-modality techniques across the two datasets, regardless of the modality used, because of the additional information provided by the preprocessed modality to the Fundus. The Fundus modality was the most favorable modality for DR diagnosis using the seven models. Furthermore, the joint fusion VGG19 model performed best with an accuracy of 97.49% and 91.20% over APTOS19 and Messidor-2, respectively; as the VGG19 model was fine-tuned in comparison to the remaining six models. In comparison with state-of-the-art models, Attention Fusion, and Cascaded Framework, joint fusion VGG19 ranks below the Attention Fusion network and outperforms the Cascaded Framework on the Messidor dataset by 5.6% and 8%, respectively.