By Liviu Arsene, Senior e–Threat Analyst at Bitdefender
Despite the prevalence of the technologies, a degree of confusion remains around the difference between machine learning (ML) and artificial intelligence (AI). The distinction lies in the fact that machine learning is the practical implementation of artificial intelligence – the use of algorithms to analyse volumes of quantitative and qualitative data, establishing findings and making statistical inferences based on analyzed data.
From a cybersecurity perspective, this process is focussed on accurately and efficiently identifying zero-day, unknown threats at the earliest possible opportunity – and at a stage before that which traditional static or behavioral analysis would permit. But machine learning algorithms are not infallible, and should not be treated as such. That said, they can certainly offer a significant boost to security tools, by enabling them to operate proactively as well as reactively when undertaking functions such as anti-malware, anti-spam, anti-fraud and anti-phishing detection.
The components of machine learning outlined below allow us to evaluate the technology’s capability from a cybersecurity perspective.
Data Models
A key difference between machine learning-based security tools and traditional solutions is the use of data models to identify known malicious files. These data models are very small in terms of file size (usually less than 1 kilobyte), and are used to identify commonalities shared by specific malware strains. Because these criteria are stored in this format – a mathematical equation that needs to be ‘resolved’, as opposed to a long list of hashes that needs updating, each individual model can identify whether a large quantity of unknown files are malicious or safe, based on how they “resolve” the model.
The data models used by machine learning tools can be utilised in the cloud or locally, which offers a welcome boost to protection against unknown threats and cyber attacks. However, because some locally implemented machine learning models have been trained to safeguard the infrastructure of a specific organisation, they are better placed to do so.
A technique that can be employed to boost resilience is to overlap complimentary machine learning models for detection. For example, if some models focus on specific threats while others are more generic, two or more models might well flag the same malicious file, increasing the chances of identifying and mitigating unknown malware.
Algorithms
Correctly implementing machine learning models can be a complex process. A variety of algorithms are often used, as some yield more useful results than others, depending on their intended purpose. For instance, perceptrons, binary decision trees, restricted Boltzmann machines, genetic algorithms, support vector machines, artificial neural networks and even custom algorithms can be combined or used on a singular basis to identify particular types of malware or malware families.
In terms of accuracy and intended use – anti-malware detection versus anti-phishing, for instance – some algorithms may be better placed to identify malware, rather than detecting fraudulent or command and control domains. The key point here is to experiment with different algorithms for each specific task, in order to identify the most effective one.
Features and feature extraction techniques
Feature extraction techniques are instrumental to the construction of effective models. Unpacking routines, pre-execution emulation, and packer reputation can help extract thousands of features per file, used to build multidimensional matrices. These matrices are used to represent the distance between each point or feature, then compute the actual model (mathematical function) and feature set which helps to statistically determine if a file is clean or infected.
Tunable Machine Learning and Advanced Threats
Performance and detection rate must be given equal weighting when building security solutions. It is important to avoid false positives, because the incorrect identification of a clean file as infected can disrupt processes and cause downtime, which has a negative effect on business performance.
Cybercriminals will sometimes launch custom attacks in an attempt to exploit weaknesses in the architecture of a particular organisation. Based on this, IT administrators can determine how aggressive or permissive machine learning solutions need to be. This puts the decision as to whether to accept occasional false positives in order to avoid data breaches in the hands of individual companies.
Fileless attacks that rely on scripts – PowerShell, Visual Basic, etc. – to execute malware or exfiltrate data can be almost impossible to detect without employing machine learning. This is because scripts are usually considered benign – they’re often used by IT admins to automate everyday tasks, but this is not to say they cannot be abused. Malicious actors can also use them to daisy chain a series of commands and compromise vulnerable endpoint devices.
A layered approach
A layered security model is focused on minimising an organisation’s attack vectors that are vulnerable to exploitation by malicious actors. Different layers within the security stack guard against specific types of threats, such as file-based or fileless malware, spam and online fraud. But due to the the ever-advancing sophistication of threats that rely on encryption, obfuscation and polymorphism, traditional detection methods ineffective in dealing with the sheer number of threats that need to be detected and processed. Incorporating machine learning algorithms into each security layer allows security solutions to boost both efficiency, and efficacy in detection.
Whilst machine learning is unparallelled in its ability to spot new and unknown malware, one disadvantage is that older threats may slip under the radar. This is because machine learning models are often only trained against new malware, whilst older and more commonly known threats can be overlooked. But by layering traditional security layers with machine learning, this potential weakness is easily counteracted.
Because no single machine learning algorithm can defend against every threat, specialisation is the most effective option from a security perspective. Behavioural heuristics, combined with signature databases for known malware samples can significantly improve performance. This enables machine learning algorithms to focus only on new, unknown, and sophisticated pieces of malware that could go unnoticed by traditional security tools – reinforcing the security capabilities of any organisation.