Considerable progress has been made in improving the estimation accuracy of cognitive workload using various sensor technologies. However, the overall performance of different algorithms and methods remain suboptimal in real-world applications. Some studies in the literature demonstrate that a single modality is sufficient to estimate cognitive workload. These studies are limited to controlled settings, a scenario that is significantly different from the real world where data gets corrupted, interrupted, and delayed. In such situations, the use of multiple modalities is needed. Multimodal fusion approaches have been successful in other domains, such as wireless-sensor networks, in addressing single-sensor weaknesses and improving information quality/accuracy. These approaches are inherently more reliable when a data source is lost. In the cognitive workload literature, sensors, such as electroencephalography (EEG), electrocardiography (ECG), and eye tracking, have shown success in estimating the aspects of cognitive workload. Multimodal approaches that combine data from several sensors together can be more robust for real-time measurement of cognitive workload. In this article, we review the published studies related to multimodal data fusion to estimate the cognitive workload and synthesize their main findings. We identify the opportunities for designing better multimodal fusion systems for cognitive workload modeling.