mirror of
http://github.com/valkey-io/valkey
synced 2024-11-22 00:52:38 +00:00
04d76d8b02
This PR utilizes the IO threads to execute commands in batches, allowing us to prefetch the dictionary data in advance. After making the IO threads asynchronous and offloading more work to them in the first 2 PRs, the `lookupKey` function becomes a main bottle-neck and it takes about 50% of the main-thread time (Tested with SET command). This is because the Valkey dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key, or a value object requires, with high probability, an expensive external memory access. ### Memory Access Amortization Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately. Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency. We applied this method in the development of `dictPrefetch`, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, `dictPrefetch` is an interleaved execution of dictFind for all the keys. **Implementation details** When the main thread iterates over the `clients-pending-io-read`, for clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. Only then will the commands be executed. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com>
23 lines
835 B
Python
Executable File
23 lines
835 B
Python
Executable File
#!/usr/bin/env python3
|
|
|
|
# Outputs the generated part of src/fmtargs.h
|
|
MAX_ARGS = 200
|
|
|
|
import os
|
|
print("/* Everything below this line is automatically generated by")
|
|
print(" * %s. Do not manually edit. */\n" % os.path.basename(__file__))
|
|
|
|
print('#define ARG_N(' + ', '.join(['_' + str(i) for i in range(1, MAX_ARGS + 1, 1)]) + ', N, ...) N')
|
|
|
|
print('\n#define RSEQ_N() ' + ', '.join([str(i) for i in range(MAX_ARGS, -1, -1)]))
|
|
|
|
print('\n#define COMPACT_FMT_2(fmt, value) fmt')
|
|
for i in range(4, MAX_ARGS + 1, 2):
|
|
print('#define COMPACT_FMT_{}(fmt, value, ...) fmt COMPACT_FMT_{}(__VA_ARGS__)'.format(i, i - 2))
|
|
|
|
print('\n#define COMPACT_VALUES_2(fmt, value) value')
|
|
for i in range(4, MAX_ARGS + 1, 2):
|
|
print('#define COMPACT_VALUES_{}(fmt, value, ...) value, COMPACT_VALUES_{}(__VA_ARGS__)'.format(i, i - 2))
|
|
|
|
print("\n#endif")
|