No description

Python 100%

Find a file

Leon A. 13316d6f90 Revise abstract for clarity on malware detection study Updated the abstract to provide a detailed overview of the study on malware classification and detection using CNNs, addressing concept drift and its impact on performance.		2026-02-06 19:23:37 +01:00
results	add temporal	2026-01-20 12:50:42 +01:00
.gitignore	update gitignore	2026-01-20 12:51:41 +01:00
imgconversion.py	skip image conversion when directory exists	2026-02-04 17:34:52 +01:00
main.py	renames	2026-02-06 16:21:50 +01:00
pyproject.toml	add temporal	2026-01-20 12:50:42 +01:00
README.md	Revise abstract for clarity on malware detection study	2026-02-06 19:23:37 +01:00
reproduction.py	add reproduction	2026-01-17 16:39:52 +01:00
temporal.py	add temporal	2026-01-20 12:50:42 +01:00
uv.lock	add temporal	2026-01-20 12:50:42 +01:00

README.md

Visualization-Based Automated Malware Classification: A Replication Study Transcended Towards Malware Detection

Abstract: Reliably and robustly detecting and classifying malware is a critical cornerstone of security, with AI-based approaches becoming increasingly prevalent. Proposed concepts particularly include the classification of malware through Convolutional Neural Networks (CNNs) operating on two-dimensional visualizations of binaries. As this approach exhibited outstanding results when classifying between malware families, it stands to reason to also employ it for the binary classification between malware and benign samples to perform malware detection. However, in contrast to malware family classification, the detection is subject to concept drift, which might lead to decreasing performance over time.

Hence, to explore the transferability of visualization-based approaches from malware classification towards binary malware detection, we herein first reproduce a state-of-the-art CNN architecture for classifying malware families. After validation, we transcend the approach towards a binary classification task on another established dataset comprising malware and benign software covering two years overall. On that basis, we then evaluate how strongly the model is affected by temporal dependencies between train and test samples in two directions, thus training on old and testing on new samples and vice versa.

We find that while the classification model can be reproduced and at first glance also suits detection, concept drift causes a significant degradation in malware detection performance. In particular, this concept drift also affects performance in backward order when classifying older samples through a model trained on more recent ones. Insofar, the visualization-based classification approach cannot be transferred to malware detection tasks without further adjustment.

Quick-Start

To get this running, you only need one tool: uv.

Make sure uv is installed on your machine. uv locks the specific Python version and installs all required modules automatically when you run the script.

Running the Experiments

Once you have uv, just run:

uv run main.py

This launches an interactive selector. The script will simply ask you which of the four experiments you want to reproduce:

reproduction
detection baseline
forward
backward

If you don't have the datasets set up yet, the script detects that. It will guide you through the process of downloading the raw data and converting it into the format required for the models.

Datasets

Here are the links to the two datasets used in this study: