In this paper, we propose a multi-modal mesh surface representation by fusing texture and geometric data. Our approach defines an inverse mapping between different geometric descriptors computed on the mesh surface, and the corresponding 2D texture image of the mesh, allowing the construction of fused geometrically augmented images. This new fused modality enables us to learn feature representations from 3D data in a highly efficient manner by employing standard convolutional neural networks in a transfer-learning mode. In contrast to existing methods, the proposed approach is both computationally and memory efficient, preserves intrinsic geometric information and learns highly discriminative feature representations by effectively fusing shape and texture information at the data level. The efficacy is demonstrated on the task of facial expression classification, showing competitive performance with state-of-the-art methods.