Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristic to reduce the number of reported accesses #1

Open
kristopher-pellizzi opened this issue May 18, 2021 · 2 comments
Open

Heuristic to reduce the number of reported accesses #1

kristopher-pellizzi opened this issue May 18, 2021 · 2 comments
Labels
limitation Something that has drawbacks, but is necessary/desirable

Comments

@kristopher-pellizzi
Copy link
Owner

Commit 3a31a3a implements a heuristic useful to reduce the number of reported uninitialized memory accesses.
This has been thought to try and recognize uninitialized accesses due to the usage of optimized versions of string operations.

The main idea is to avoid reporting an uninitialized memory access when these conditions are simultaneously true:

  • Access size is higher or equal to 16 bytes
  • The instruction is not a syscall (e.g. a call to write() may be incorrectly recognized as an access due to some string operation)
  • The accessed memory area contains at least 1 (initialized) null byte (i.e. '\0')
  • One of the following conditions is verified:
  1. There's more than 1 uninitialized interval
  2. There's 1 interval, and it begins at index 0 and ends before the position of the null byte
  3. There's 1 interval, starting after the null byte

Condition 1 means that the string pointer has an alignment different from that required by the SIMD extension instruction used and the initialized portion ends before the end of the access. In this case, the layout should be something like this: UNINITIALIZED - INITIALIZED - UNINITIALIZED. This usually happens when we are doing something on some short string.

Condition 2 means that again the string pointer has an alignment different from the one required by the instruction. However, this time there's only 1 uninitialized interval, meaning that probably we have again a short string, which has the terminator in its last byte. However, in memory, there's something adjacent to the string, which has been initialized.

Condition 3 means that we are probably managing the end of a long string, whose size is not a multiple of the access size. In this case, the memory area is usually initialized from index 0 up to the null byte (included), and after that there is at least 1 byte not initialized.

Note that indexes are meant to be relative to the access boundaries.
In practice, we are assuming that everytime there is an uninitialized read where some bytes are initialized, while some other are not, we are handling a string (as it is composed by a sequence of bytes).

These instructions, however, may be used by the compiler to optimize operations on arrays of numeric data. However, in that case, the number is usually either fully initialized or not initialized, thus not falling in any of our conditions. It may be the case, however, that the developer managed the numeric data byte by byte (probably performing some casts). In that case, some false negatives are expected.

Another source of false negatives may be the usage of memory management functions (e.g. memcpy).
In glibc-3.31, memcpy is implemented by using 8 byte integer moves, so it can't generate false negatives due to our heuristic. However, it merely depends on the implementation of the function, and therefore it is expected to generate some false negatives with some implementations which may use the SIMD extension instructions as well.

@kristopher-pellizzi kristopher-pellizzi added the limitation Something that has drawbacks, but is necessary/desirable label May 18, 2021
@kristopher-pellizzi
Copy link
Owner Author

After a few tests and a long reasoning, it seems that this heuristic leads to too many false negatives.
Indeed, there are many functions in the libc using SIMD extension instructions where the uninitialized reads are relevant (e.g. memcpy and all mem-prefixed functions).

Moreover, it is not unusual that compilers optimize code and transform an array iteration loop into a smaller loop using SIMD extension instructions, in order to process, for instance, 4 integer elements within a single loop body execution.
Usually, arrays are initialized contiguously, and therefore, if there is any uninitialized element in the processed array, the tool easily falls into Condition 3 of the heuristic, as any element may contain a null byte and everything from the beginning of the array up to that 0 byte may be initialized.
This kind of uninitialized read accesses may be very important, because the array can have a big size, and therefore may overlap many interesting things.

Another reason why it's better to remove this heuristic is that there are too many different situations in which a string may be found. For instance, on the stack there may be many contiguous small strings. In this case many strings will be found in the same 32 bytes during a strcpy, but the function signature has the string address as a first argument, and is therefore implemented in such a way that only the portion relative to the correct string is copied.
A completely different situation may be found on the heap, where all blocks are aligned in memory, and between 2 contiguous blocks there's the block header. However, it is possible that the same block is reused many times to hold different strings, so also in this situation we can't rely on the fact that after the null byte everything is uninitialized.

Since we are working with (possibly stripped) binaries, there's no trivial way to distinguish a string (i.e. an array of chars) from an array of any other numeric type (int, long, float, double...), so it is no easy task to find an heuristic good enough to remove some of the not relevant uninitialized reads due to execution of string operations without increasing false negatives so much.

Some naive alternatives may be:

  • Use a white list of functions where the uninitialized reads generated by SIMD instructions should be ignored (e.g. strpy, strlen, strcpy...) and try to detect when these are called.
  • Completely trust libc, and ignore (all or only SIMD generated) uninitialized read accesses. In this case, the main drawback is that if the functions are re-implemented in a custom library or in the application itself, this solution won't be able to ignore the accesses anymore

For now, I'll rollback to last commit to remove the heuristic.

kristopher-pellizzi added a commit that referenced this issue May 20, 2021
This reverts commit 3a31a3a.
More infos in the comments of issue #1
@kristopher-pellizzi
Copy link
Owner Author

Heuristic removed by commit 8e6c2aa

@kristopher-pellizzi kristopher-pellizzi added the invalid This doesn't seem right label May 20, 2021
@kristopher-pellizzi kristopher-pellizzi removed the invalid This doesn't seem right label May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
limitation Something that has drawbacks, but is necessary/desirable
Projects
None yet
Development

No branches or pull requests

1 participant