Author Topic: [Paper] Inner Workings of Antivirus Scanners  (Read 1914 times)

0 Members and 1 Guest are viewing this topic.

Offline Deque

  • P.I.N.N.
  • Global Moderator
  • Overlord
  • *
  • Posts: 1203
  • Cookies: 518
  • Programmer, Malware Analyst
    • View Profile
[Paper] Inner Workings of Antivirus Scanners
« on: February 06, 2015, 06:35:46 pm »
Inner Workings of Antivirus Scanners

Fred Cohen proved in 1984 that the detection of viruses is an undecideable problem [see http://web.eecs.umich.edu/~aprakash/eecs588/handouts/cohen-viruses.html]. That means in practice that there is no algorithm, which can perfectly decide whether a program is malicious or not.

Antivirus software exists since 1987. The first one was written for the Atari ST platform by G DATA Software AG. In the same year McAfee and NOD32 appeared.

These very early antivirus scanners had only two simple techniques, both of which are still used today.
The first one was blacklisting, the second is detection based on simple signatures.

------------------------------------------------------------------------------

Blacklisting

Blacklisting usually works by hashing the file. The antivirus scanner has a database that contains file hashes and usually also the file size of known malicious files. This "list" of hashes is the blacklist.
If a suspicious file is found, the scanner calculates the hash sum and looksup the blacklist for a match. The file size is often used as filter: only the hashes with the correct filesize are taken into consideration to boost performance.

String Scanning with Simple Signatures

[excerpt from my thesis, p.18]
Quote
String scanning compares byte sequences in a database with the actual byte sequences of a file. If a subsequence in the file matches one of the sequences in the database, the file is declared malicious. Such a byte sequence to identify a malware is also called signature or string. The signature needs to be unique enough, so that it is unlikely to exist in clean programs.

------------------------------------------------------------------------------

Both techniques blacklisting and string scanning are only able to recognise known malware. Regular updates of the antivirus database are mandatory for the scanner to work properly.

These techniques were enough to handle malware in 1987 and the following years, but today there are so many new threats every day, that blacklisting and string scanning alone cannot keep up with it. AV-Test Institute estimates that there are "390,000 new malicious programs every day" [see http://www.av-test.org/en/statistics/malware/] and this number is on the rise. As a comparison: In 1994 AV-Test had 28,613 unique malware samples in its entire database [see http://antivirus.about.com/od/whatisavirus/a/A-Brief-History-Of-Malware-The-First-25-Years_2.htm].

Antivirus Scanners Today

The engine of an antivirus program today has several layers. Each layer employs another detection technique with increasing complexity and decreasing performance.

Layer 1

The first layer is the fastest. It performs the blacklist and whitelist lookup.

Whitelists are similar to blacklists, but they keep track of non-malicious samples. These are often system critical files, which must not be detected at all costs; or harmless files that look so much like malware, that they have to be excluded this way.

Blacklists entries are usually created by automation. Malware analysts cannot analyse every new file, so certain processes blacklist files automatically after certain conditions are met. E.g. it is common practice to blacklist files that are detected by a certain number of other trusted antivirus companies.

If a file is found in that layer (black or white), no other layers are executed.

Layer 2

If no file was found, the engine will apply a file type scanner to determine further steps and for filtering. The file type scanner is already a possible attack vector for malware. If the scanner is not able to determine the right file type, the subsequent detection routine does not work too.

The antivirus scanner keeps a list of trusted vendors.
Signed files with a trusted vendor might be excluded from detection at that point.

Layer 3

Based on the file type, the antivirus engine performs string scanning. However, the signatures are not that simple anymore. They have been enhanced with wildcards, regular expressions, mismatches, and location specific scanning.

Wildcards like ? and * can be part of a signature. The ? means that the value of that nibble does not matter. The * denotes that any number of bytes (zero to n) might be in between two values.

Mismatches allow a certain number of bytes to be different than given in the signature, and the signature will still be considered a match.

The following box demonstrates some examples for wildcards and mismatches:

Code: [Select]
Wildcards
---------
signature:           0E 1F 0E 07 E8 ?? ?? E8 ?? ?? 3A C6 73
byte sequence found: 0E 1F 0E 07 E8 31 C4 E8 00 FF 3A C6 73
--> match
byte sequence found: 0E 00 0E 07 E8 A5 C4 E8 22 CF 3A C6 73
--> mismatch            --
                       
Mismatches allowed: 2
---------------------
signature:           73 72 63 64 6F 63 69 64 3A 20

byte sequence found: 73 72 63 02 6F 63 69 64 00 20
--> match                     --             --
byte sequence found: C6 A2 63 64 6F 63 69 64 3A 00
--> mismatch         -- --                      --

Regular expression-like enhancements allow for even more flexibility.
Byte ranges might be given for wildcards, e.g., {2-20} means there are 2 to 20 bytes of any value in this location of the signature.
OR expression can be used to allow several values for one byte, e.g., you might want to allow '.EXE' in all lower and upper case combinations as 2E [45|65][78|58][45|65]. This patter would allow sequences like '.eXE', '.exe', '.eXe'.

Last but not least there is a location given along with the signature. The locations that are possible depend on the file type. Common locations for any file type are whole file (bad performance, must only be done with small files), first 10 KB, last 10 KB. For PE files you have additional locations like entry point, section Nr X, last section, overlay, file headers, resources. It might not be surprising that the resource section is sometimes used to put signatures for icons that the malware uses (given that the icon is unique to that malware or malware author).

More complicated signature expressions, or even combinations of several signatures are possible and might be applied as separate lower layers of the engine.
E.g. a signature combination will have the first signature as filter. This first signature is not unique enough to identify the malware, it is just there to keep a good performance by filtering only relevant files for the second signature. That second signature might be more extensive to search for, but unique enough for identification.

Layer 4

If no matching signature was found in layer 2, the engine will resort to the most extensive detection technique, which is algorithmic scanning. Early antivirus scanners used hardcoded procedures for these cases. Antivirus engines today have their own portable programming language to write detections. These detection scripts are saved in the database of the antivirus software and updated like signatures and hashes.

Algorithmic scanning relies heavily on filtering, because executing the detection scripts is computational extensive. Every detection script must provide filters, which may include file properties (e.g. size ranges), the file type, signatures or others. Only if these preconditions are met, the detection script is executed.

Layer 5

At this point no detection could be found for the file itself. Unpacking is applied and the unpacked files are given as input to the engine. Unpacking includes, e.g., unpacking of archive files (ZIP, RAR, ...), installation scripts (Nullsoft), packed files (e.g. by crypters). If the engine detects any of the unpacked files, the original file is seen as detected.
The unpacking module of the engine has to be careful with archive bombs and similar problematic files.

This layer may also be switched with layer 4 for better performance.

Heuristic Analysis

This can be part of a layer or an entirely separate module.

[excerpt from my master thesis, p. 22]
Quote
Heuristic methods find a solution for a problem using incomplete knowledge.
Heuristic analysis describes all malware detection methods that use ‘a rule-based
approach to diagnosing a potentially-offending file’. These detection
methods are not optimal, they create false positives and false negatives; but they
are able to recognise unknown threats and detect variants of known malware.
Heuristic analysis is done in two steps: data gathering and data analysis.

The data gathering step collects information about the file that is relevant for the heuristic scanner. This information is either a hint that the file is malicious or a hint that the file is harmless. E.g. blacklisted strings in file like obscene words are a hint that the file is malicious. An entry point that is in the last section of a PE file is a hint for a virus infection. A high entropy of a section is a hint for a packed or encrypted file, which is itself a hint for a malicious file.

The data analysis step assigns weights/points to the information that was found in the data gathering step. Artificial neural networks, or machine learning may be used to get a result based on these weights. In the most simple way the scanner will just calculate the sum of all weights and detect the file as malicious if the sum exceeds a threshold.

Generic Detection

Generic detection covers several or all known variants of a malware family using a signature or algorithmic detection.

As I said before certain processes blacklist files automatically. To avoid an enormous growth of the blacklist, similar files are found by clustering. Malware analysts then create a signature or detection algorithm that covers all of these files at once, effectively getting rid of blacklist entries.

Summary

Every antivirus engine works a bit differently, but the core concepts are the same. The simple techniques, which are done fast, come first. The computationally extensive ones later, and only after some filtering.
« Last Edit: February 06, 2015, 06:37:32 pm by Deque »