Multi-modal information extraction and fusion with convolutional neural networks for classification of scaled images

  • Dinesh Kumar

    Student thesis: Doctoral Thesis


    Developing computational algorithms to model the biological vision system has challenged researchers in the computer vision field for several decades. As a result, state-of-the-art Deep Learning (DL) algorithms such as the Convolutional Neural Network (CNN) have emerged for image classification and recognition tasks with promising results. CNNs, however, remain view-specific, producing good results when the variation between test and train data is small. Making CNNs learn invariant features to effectively recognise objects that undergo appearance changes as a result of transformations such as scaling remains a technical challenge. Recent bio-inspired studies of the visual system are suggesting three new paradigms. Firstly, our visual system uses both local features and global features in its recognition function. Secondly, cells tuned to detecting global features respond to visual stimuli prior to cells tuned
    on local features leading to quicker response times in recognising objects. Thirdly, information from modalities that handle local features, global features and color are integrated in the brain for performing recognition tasks. While CNNs rely on an aggregation of local features into global features for recognition, these research outcomes motivate global feature extraction and with established local features to improve the efficiency and CNN model application to solve transformation invariance problems.
    The main goals of the current research include an investigation and development of relevant models for classification of scaled images using both local and global features with CNNs. To improve the performance of the current CNN model towards classification of scaled images, this work has performed investigations on different techniques: (i) exploration of (global) high-level, low-resolution CNN feature map augmentation, (ii) examination of fusion of CNN features with global features from non-trainable global feature descriptors, (iii) color histogram as global features, (iii) examination of fusion of CNN features with spatial features using large kernels in a multi-scale filter pyramid setting, (v) examination of brain-inspired distributed multi-modal information extraction and integration model, and (vi) development of a zoom-in convolution algorithm.
    For improving classification of scaled images, this thesis has proposed two specific techniques. The first technique exploits the automatic feature extraction in CNN convolution layers and proposes augmentation of (global) high-level low-resolution feature maps as a cheap and effective way to improve
    classification of scaled images. The second technique proposes an architecture supported by physiological evidence that allows multi-modal information extraction and fusion of DL models for combining global features and CNN local features. This architecture allows parallel extraction and processing of CNN and global features from input image data. To extract global image features, both non-trainable and trainable feature extraction methods are investigated. Global feature descriptors - Histogram of Gradients (HOG) and color information - are used as non-trainable methods. A technique using multi-scale filter banks containing large kernels are used as trainable method to cover more spatial areas of the image. The idea of using large kernels and multi-scale filter banks is extended to develop a new lightweight zoom-in convolution technique that allows the model capture more spatial areas in relation to the center of the
    image, assuming the object of interest is generally centered in the middle of the image. This technique called DeepZoom inspects multi-scale slices of an image beginning with a set of center pixels and progressively extending the area of each slice until the final slice covers the entire image. To fuse global, local and color features, a simple feature map concatenation technique is compared with a brain-inspired distribution information integration model. Four datasets consisting of different sized images in each are used to validate the models.
    Experiments on a) (global) high-level low-resolution feature map augmentation, b) fusion of CNN local features with global features from various non-trainable global feature descriptors methods, c) fusion of CNN local features with spatial features from using large kernels, and d) adjusting the convolution technique in DL models, have shown the developed models compared to CNN only based models i) obtained comparatively similar if not better training test accuracies and ii) obtained higher classification accuracies for scaled test images. Whilst global feature extraction or manipulation methods differed, in general the results are promising for classification of scaled images. In all the cases, the developed models are evaluated against established benchmark results from benchmark CNNs. Finally, this thesis presents skin cancer classification as an application where handling scale is important. It shows application of developed DL models on detection of skin cancer using skin lesion images on mobile phones. By investigating the different models, a suitable DL model has been presented for classification of skin lesion images in real time and provides an implementation on mobile devices as an early warning diagnosis tool for skin cancer.
    The thesis concludes with a summary of research outcomes against each identified research question. Several questions emanating from the thesis research are also identified to extend the research presented as future work.
    Date of Award2020
    Original languageEnglish
    SupervisorDharmendra Sharma AM PhD (Supervisor), Dat Tran (Supervisor) & Roland Goecke (Supervisor)

    Cite this