The state-of-the-art evaluation of an Intrusion Detection System (IDS) relies on benchmark datasets composed of the regular system’s and potential attackers’ behavior. The datasets are collected once and independently of the IDS under analysis. This paper questions this practice by introducing a methodology to elicit particularly challenging samples to benchmark a given IDS. In detail, we propose (1) six fitness functions quantifying the suitability of individual samples, particularly tailored for safety-critical cyber-physical systems, (2) a scenario-based methodology for attacks on networks to systematically deduce optimal samples in addition to previous datasets, and (3) a respective extension of the standard IDS evaluation methodology. We applied our methodology to two network-based IDSs defending an advanced driver assistance system. Our results indicate that different IDSs show strongly differing characteristics in their edge case classifications and that the original datasets used for evaluation do not include such challenging behavior. In the worst case, this causes a critical undetected attack, as we document for one IDS. Our findings highlight the need to tailor benchmark datasets to the individual IDS in a final evaluation step. Especially the manual investigation of selected samples from edge case classifications by domain experts is vital for assessing the IDSs.
«
The state-of-the-art evaluation of an Intrusion Detection System (IDS) relies on benchmark datasets composed of the regular system’s and potential attackers’ behavior. The datasets are collected once and independently of the IDS under analysis. This paper questions this practice by introducing a methodology to elicit particularly challenging samples to benchmark a given IDS. In detail, we propose (1) six fitness functions quantifying the suitability of individual samples, particularly tailored f...
»