Recognizing human activities is considered a vital research challenge because of its essential significance for improving human-machine collaboration in the Internet of Things environments. The present deep learning (DL) literature focused on studying human activities (HAs) from one subject, with several schemes differing in the recognition method and sensing strategy. However, few research interests have been dedicated to situations where numerous individuals interact to perform common activities. This challenge is termed human-to-human interaction (H2HI) recognition. This study addresses the H2HI problem by a novel device-free DL model, named H2HI-NET, for modeling the HA representation of the Channel State Information of Wireless Fidelity devices. In H2HI-NET, a bi-directional temporal learning module is introduced to capture temporal representation from historical and future information. Simultaneously, the residual spatial learning module is designed to combine residual learning and transformer network capabilities for the efficient extraction of complex spatial features of HAs. The experimental evaluations reveal the efficiency of the H2HI-NET with 96.39% accuracy overcoming cutting-edge studies.