Static analysis for android malware detection using document vectors

  • Utkarsh Raghav

    Student thesis: Master's Thesis


    The prevalence of smart mobile devices has led to an upsurge in malware that targets mobile platforms. The dominant market player in the sector, Android OS, has been a favourite target for malicious actors. Various feature engineering techniques are used in the current machine learning and deep learning approaches for Android malware detection. In order to correctly identify dependable features, feature engineering for Android malware detection using multiple AI algorithms requires a particular level of expertise in Android malware and the platform itself. The majority of these engineered features are initially extracted by applying different static and dynamic analysis approaches. These allow researchers to obtain various types of information from Android application packages (APKs), such as required permissions, opcode sequences and control flow graphs, to name a few. This information is used (as is or in vectorised form) for training supervised learning models. Researchers have also applied Natural Language Processing techniques to the features extracted from APKs. In order to automatically create feature vectors that can describe the data included in Android manifests and Dalvik executable files inside an APK, this study focused on developing a novel method that uses static analysis and the NLP technique of document embeddings. We designed a system that takes Android APK files as input documents and generates the feature embeddings. This system removes the need for manual identification & extraction of features. We use these embeddings to train various Android Malware detection models to experimentally evaluate the effectiveness of these automatically generated features. The experiments were done by training and evaluating 5 different supervised learning models. We did our experiments on APKs from two well-known datasets, DREBIN and AndroZoo. We trained and validated our models with 4000 files (training set). We had kept separate 700 files (test set) which were not used during training and validation. We used our trained models to predict the classes of the unseen file embeddings from the test set. The automatically generated features allowed training of robust detection models. The Android malware detection models performed best with Android manifest file embeddings concatenated with Dalvik executable file embeddings, with some of the models achieving Precision, Recall and Accuracy values above 99% consistently during development and over 97% against unseen file embeddings. The prediction accuracy of the detection model trained on our automatically generated features was equivalent to the accuracy achieved by one of the most cited research works known as DREBIN, which was 94%. We also provided a simple method to directly utilise the file present in Android APK to create feature embeddings without scouring through Android application files to identify reliable features. The resulting system can be further improved against new emerging threats and be better trained by just gathering more samples.
    Date of Award2023
    Original languageEnglish
    SupervisorElisa Martinez-Marroquin (Supervisor), Wanli Ma (Supervisor) & Yohannes Kinfu (Supervisor)

    Cite this