Non-invasive physiological sensors allow for the collection of user-specific data in realistic environments. In this paper, using physiological data, we investigate the effectiveness of Convolutional Neural Network (CNN) based feature embeddings and Transformer architecture for the human activity recognition task. 1D-CNN representation is used for the heart rate, and 2D-CNN is used for short-term Fourier transformation of the accelerometer data. Post fusion, the feature is input into a transformer. The experiments are performed on the harAGE dataset. The findings indicate the discriminative ability of the feature-fusion on transformer-based architecture, and the method outperforms the harAGE baseline by an absolute 3.7%.