Multi-modal Information Extraction and Fusion with Convolutional Neural Networks for Classification of Scaled Images

  • Dinesh Kumar

    Student thesis: Doctoral Thesis


    Developing computational algorithms to model the biological vision system has challenged researchers
    in the computer vision field for several decades. As a result, state-of-the-art Deep Learning (DL) algorithms
    such as the Convolutional Neural Network (CNN) have emerged for image classification and
    recognition tasks with promising results. CNNs, however, remain view-specific, producing good results
    when the variation between test and train data is small. Making CNNs learn invariant features to effectively
    recognise objects that undergo appearance changes as a result of transformations such as scaling
    remains a technical challenge. Recent bio-inspired studies of the visual system are suggesting three
    new paradigms. Firstly, our visual system uses both local features and global features in its recognition
    function. Secondly, cells tuned to detecting global features respond to visual stimuli prior to cells tuned
    on local features leading to quicker response times in recognising objects. Thirdly, information from
    modalities that handle local features, global features and color are integrated in the brain for performing
    recognition tasks. While CNNs rely on an aggregation of local features into global features for recognition,
    these research outcomes motivate global feature extraction and with established local features to
    improve the efficiency and CNN model application to solve transformation invariance problems.
    The main goals of the current research include an investigation and development of relevant models
    for classification of scaled images using both local and global features with CNNs. To improve the
    performance of the current CNN model towards classification of scaled images, this work has performed
    investigations on different techniques: (i) exploration of (global) high-level, low-resolution CNN featuremap
    augmentation, (ii) examination of fusion of CNN features with global features from non-trainable
    global feature descriptors, (iii) color histogram as global features, (iii) examination of fusion of CNN
    features with spatial features using large kernels in a multi-scale filter pyramid setting, (v) examination
    of brain-inspired distributed multi-modal information extraction and integration model, and (vi) development
    of a zoom-in convolution algorithm.
    For improving classification of scaled images, this thesis has proposed two specific techniques. The
    first technique exploits the automatic feature extraction in CNN convolution layers and proposes augmentation
    of (global) high-level low-resolution feature maps as a cheap and effective way to improve
    classification of scaled images. The second technique proposes an architecture supported by physiological
    evidence that allows multi-modal information extraction and fusion of DL models for combining global features and CNN local features. This architecture allows parallel extraction and processing of
    CNN and global features from input image data. To extract global image features, both non-trainable and
    trainable feature extraction methods are investigated. Global feature descriptors - Histogram of Gradients
    (HOG) and color information - are used as non-trainable methods. A technique using multi-scale filter
    banks containing large kernels are used as trainable method to cover more spatial areas of the image. The
    idea of using large kernels and multi-scale filter banks is extended to develop a new lightweight zoom-in
    convolution technique that allows the model capture more spatial areas in relation to the center of the
    image, assuming the object of interest is generally centered in the middle of the image. This technique
    called DeepZoom inspects multi-scale slices of an image beginning with a set of center pixels and progressively
    extending the area of each slice until the final slice covers the entire image. To fuse global,
    local and color features, a simple feature map concatenation technique is compared with a brain-inspired
    distribution information integration model. Four datasets consisting of different sized images in each are
    used to validate the models.
    Experiments on a) (global) high-level low-resolution feature map augmentation, b) fusion of CNN local
    features with global features from various non-trainable global feature descriptors methods, c) fusion
    of CNN local features with spatial features from using large kernels, and d) adjusting the convolution
    technique in DL models, have shown the developed models compared to CNN only based models i)
    obtained comparatively similar if not better training test accuracies and ii) obtained higher classification
    accuracies for scaled test images. Whilst global feature extraction or manipulation methods differed,
    in general the results are promising for classification of scaled images. In all the cases, the developed
    models are evaluated against established benchmark results from benchmark CNNs.
    Finally, this thesis presents skin cancer classification as an application where handling scale is important.
    It shows application of developed DL models on detection of skin cancer using skin lesion images
    on mobile phones. By investigating the different models, a suitable DL model has been presented for
    classification of skin lesion images in real time and provides an implementation on mobile devices as an
    early warning diagnosis tool for skin cancer.
    The thesis concludes with a summary of research outcomes against each identified research question.
    Several questions emanating from the thesis research are also identified to extend the research presented
    as future work.
    Date of Award2020
    Original languageEnglish
    SupervisorSharma D (Supervisor), Dat Tran (Supervisor) & Roland Goecke (Supervisor)

    Cite this