Data Encoding

computer coding screengrab

What is Data Encoding?

  • It refers to all forms of content modification for the purpose of hiding intent.
  • Malware uses encoding techniques to mask its malicious activities.
  • The malware author uses simple ciphers, basic encoding functions as well as cryptographic ciphers to make identification and reverse-engineering difficult.

Caesar Cipher

  • It is a simple cipher formed by shifting the letters of the alphabet three characters to the right.

XOR

  • The XOR cipher is a simple cipher that is similar to the Caesar cipher. XOR means exclusive OR and is a logical operation that can be used to modify bits.
  • An XOR cipher uses a static byte value and modifies each byte of plaintext by performing a logical XOR operation with that value.
  • The XOR cipher is convenient to use because it is both simple—requiring only a single machine-code instruction—and reversible.
  • A reversible cipher uses the same function to encode and decode.

Brute-Forcing XOR Encoding

  • The single byte XOR encoded executable can be found by brute forcing.
  • Since there are only 256 possible values for each character in the file, it is easy and quick enough for a computer to try all of the possible 255 single-byte keys XORed with the file header, and compare the output with the header you would expect for an executable file
  • PE files begin with the letters MZ, and the hex characters for M and Z are 4d and 5a, respectively, the first two hex characters in this particular string.

Brute-Forcing Many Files

  • Brute-forcing can also be used proactively. For example, if you want to search many files to check for XOR-encoded PE files, you could create 255 signatures for all of the XOR combinations, focusing on elements of the file that you think might be present.

NULL-Preserving Single-Byte XOR Encoding

  • Most of the bytes in the initial part of the header are 0x12! This demonstrates a particular weakness of single-byte encoding
  • Malware authors have actually developed a clever way to mitigate this issue by using a NULL-preserving single-byte XOR encoding scheme. Unlike the regular XOR encoding scheme, the NULL-preserving single-byte XOR scheme has two exceptions:
    • If the plaintext character is NULL or the key itself, then the byte is skipped.
    • If the plaintext character is neither NULL nor the key, then it is encoded via an XOR with the key.
      (So if the key is 0x12, then any 0x00 or 0x12 will not be transformed, but any other byte will be transformed via an XOR with 0x12)

Identifying XOR Loops in IDA Pro

  • In disassembly, XOR loops can be identified by small loops with an XOR instruction in the middle of a loop. The easiest way to find an XOR loop in IDA Pro is to search for all instances of the XOR instruction
  • The XOR instruction can be used for different purposes. One of the uses of XOR is to clear the contents of a register. XOR instructions can be found in three forms:
    • XOR of a register with itself
    • XOR of a register (or memory reference) with a constant
    • XOR of one register (or memory reference) with a different register (or memory reference)
  • XOR of a register with itself is mainly used for clearing content of the register and not for data encoding activities.
  • XOR of a register with a constant for example “xor eax, 0x12” is a most command XOR encoding technique

Base64

  • Base64 encoding is used to represent binary data in an ASCII string format. Base64 encoding is commonly found in malware, so you’ll need to know how to recognize it.
  • Base64 encoding converts binary data into a limited character set of 64 characters. There are a number of schemes or alphabets for different types of Base64 encoding. They all use 64 primary characters and usually an additional character to indicate padding, which is often =.
  • The most common character set is MIME's Base64, which uses A-Z, a-z, and 0-9 for the first 62 values, and + and / for the last two values. As a result of squeezing data into smaller set of characters, Base-64 encoded data ends up being longer than the original data. For every 3 bytes of binary data, there are at least 4 bytes of Base64-encoded data.

Transforming Data to Base64

  • The process of translating raw data to Base64 is fairly standard and uses 24-bit (3 byte) chunks. The first character is placed in the most significant position, the second in the middle 8 bits, and the third in the least significant 8 bits. Next, bits are read in blocks of six, starting with the most significant.

Identifying and Decoding Base64

  • The base64 string appears as a random selection of characters, with the character set composed of the alphanumeric characters plus two other characters. One padding (=) character may be present at the end of an encoded string: if padded, the length of the encoded object will be divisible by four.
  • One of the difficult way is to decode Base64 string is with custom developed substitution cyper. The only item that needs t be changed is the indexing string, and it will have all the same desirable characteristics as the standard Base64.
  • One simple way to create new indexing string is to relocate some of the characters to the front of the string.
  • Malware uses this technique to make its output appear to be Base64, even though it cannot be decoded using the common Base64 functions.

Common Cryptographic Algorithms

Searching for Cryptographic Constants

  • Malware often uses simple cipher schemes such as Base64 because they are easy and often sufficient.
  • However, with the evolving cryptographic practices and libraries, malware might use algorithms such as SSL. In such cases, the trick is to identify not only the algorithm but also the key.

Recognizing Strings and Imports

  • One way to identify standard cryptographic algorithms is by recognizing strings that refer to the use of cryptography. This occurs when cryptographic libraries such as OPENSSL are statically compiled into malware.
  • Another way to look for standard cryptography is to identify imports that reference to cryptographic functions. For example looking for functions such as "CryptAcquireContextA", "CryptHashData", "CryptDecrypt", "CryptEncrypt".
  • A third basic method of detecting cryptography is to use a tool that can search for commonly used cryptographic constants such as:
    1. IDA Pro’s FindCrypt12 Plugin
    2. Krpto ANALyzer (KANAL)
    3. By Searching for High-Entropy Content

Decoding

  • Finding encoding functions to isolate them is an important part of the analysis process, but typically we want to decode the hidden content. There are two fundamental ways to duplicate the encoding or decoding functions in malware"
    • Reprogram the functions
    • Use the functions as they exist in the malware itself

Self-Decoding

  • The most economical way to decrypt data is to let the program itself perform the decryption in the course of its normal activities. We call this process self-decoding

Manual Programming of Decoding functions

Using Instrumentation for Generic Decryption

  • lpBuffer to be encrypted or decrypted
  • The length (nNumberOfBytesToWrite) of the buffer