valkey

mirror of http://github.com/valkey-io/valkey synced 2024-11-22 18:54:58 +00:00

Author	SHA1	Message	Date
Yanqi Lv	e2b7932b34	Shrink dict when deleting dictEntry (#12850 ) When we insert entries into dict, it may autonomously expand if needed. However, when we delete entries from dict, it doesn't shrink to the proper size. If there are few entries in a very large dict, it may cause huge waste of memory and inefficiency when iterating. The main keyspace dicts (keys and expires), are shrinked by cron (`tryResizeHashTables` calls `htNeedsResize` and `dictResize`), And some data structures such as zset and hash also do that (call `htNeedsResize`) right after a loop of calls to `dictDelete`, But many other dicts are completely missing that call (they can only expand). In this PR, we provide the ability to automatically shrink the dict when deleting. The conditions triggering the shrinking is the same as `htNeedsResize` used to have. i.e. we expand when we're over 100% utilization, and shrink when we're below 10% utilization. Additionally: * Add `dictPauseAutoResize` so that flows that do mass deletions, will only trigger shrinkage at the end. * Rename `dictResize` to `dictShrinkToFit` (same logic as it used to have, but better name describing it) * Rename `_dictExpand` to `_dictResize` (same logic as it used to have, but better name describing it) related to discussion https://github.com/redis/redis/pull/12819#discussion_r1409293878 --------- Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>	2024-01-15 08:20:53 +02:00
zhaozhao.zz	bb2b6e2927	fix scripts access wrong slot if they disagree with pre-declared keys (#12906 ) Regarding how to obtain the hash slot of a key, there is an optimization in `getKeySlot()`, it is used to avoid redundant hash calculations for keys: when the current client is in the process of executing a command, it can directly use the slot of the current client because the slot to access has already been calculated in advance in `processCommand()`. However, scripts are a special case where, in default mode or with `allow-cross-slot-keys` enabled, they are allowed to access keys beyond the pre-declared range. This means that the keys they operate on may not belong to the slot of the pre-declared keys. Currently, when the commands in a script are executed, the slot of the original client (i.e., the current client) is not correctly updated, leading to subsequent access to the wrong slot. This PR fixes the above issue. When checking the cluster constraints in a script, the slot to be accessed by the current command is set for the original client (i.e., the current client). This ensures that `getKeySlot()` gets the correct slot cache. Additionally, the following modifications are made: 1. The 'sort' and 'sort_ro' commands use `getKeySlot()` instead of `c->slot` because the client could be an engine client in a script and can lead to potential bug. 2. `getKeySlot()` is also used in pubsub to obtain the slot for the channel, standardizing the way slots are retrieved.	2024-01-15 09:57:12 +08:00
bentotten	b3aaa0a136	When one shard, sole primary node marks potentially failed replica as FAIL instead of PFAIL (#12824 ) Fixes issue where a single primary cannot mark a replica as failed in a single-shard cluster.	2024-01-11 15:48:19 -08:00
Binbin	b351a04b1e	Add announced-endpoints test to all_tests and fix tls related tests (#12927 ) The test was introduced in #10745, but we forgot to add it to the test_helper.tcl, so our CI did not actually run it. This PR adds it and ensures it passes CI tests.	2024-01-09 18:18:59 -08:00
Madelyn Olson	8bb9a2895e	Address some failures with new tests for improving debug report (#12915 ) Fix a daily test failure because alpine doesn't support stack traces and add in an extra assertion related to making sure the stack trace was printed twice.	2024-01-08 17:56:06 -08:00
Madelyn Olson	068051e378	Handle recursive serverAsserts and provide more information for recursive segfaults (#12857 ) This change is trying to make two failure modes a bit easier to deep dive: 1. If a serverPanic or serverAssert occurs during the info (or module) printing, it will recursively panic, which is a lot of fun as it will just keep recursively printing. It will eventually stack overflow, but will generate a lot of text in the process. 2. When a segfault happens during the segfault handler, no information is communicated other than it happened. This can be problematic because `info` may help diagnose the real issue, but without fixing the recursive crash it might be hard to get at that info.	2024-01-02 18:20:22 -08:00
Chen Tianjie	8527959598	Replace slots_to_channels radix tree with slot specific dictionaries for shard channels. (#12804 ) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (#9356), and then replacing the list with slot specific dictionaries for keys (#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2023-12-27 17:40:45 +08:00
sundb	bef5715374	Fix oom-score-adj test due to no permission (#12887 ) Fix #12792 On ubuntu 23(lunar), non-root users will not be allowed to change the oom_score_adj of a process to a value that is too low. Since terminal's default oom_score_adj is 200, if we run the test on terminal, we won't be able to set the oom_score_adj of the redis process to 9 or 22, which is too low. Reproduction on ubuntu 23(lunar) terminal: ```sh $ cat /proc/`pgrep redis-server`/oom_score_adj 200 $ echo 100 > /proc/`pgrep redis-server`/oom_score_adj # success without error $ echo 99 > /proc/`pgrep redis-server`/oom_score_adj echo: write error: Permission denied ``` As from the output above, we can only set the minimum oom score of redis processes to 100. By modifying the test, make oom_score_adj only increase upwards and not decrease. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2023-12-27 08:42:46 +02:00
Slava Koyfman	20214b26a4	Don't disconnect all clients in ACL LOAD (#12171 ) Previous implementation would disconnect _all_ clients when running `ACL LOAD`, which wasn't very useful. This change brings the behavior in line with that of `ACL SETUSER`, `ACL DELUSER`, in that only clients whose user is deleted or clients subscribed to channels which they no longer have access to will be disconnected. --------- Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>	2023-12-24 11:56:44 +02:00
Binbin	09e0d338f5	redis-cli adds -4 / -6 options to determine IPV4 / IPV6 priority in DNS lookup (#11315 ) This PR, we added -4 and -6 options to redis-cli to determine IPV4 / IPV6 priority in DNS lookup. This was mentioned in https://github.com/redis/redis/pull/11151#issuecomment-1231570651 For now it's only used in CLUSTER MEET. The options also made it possible to reliably test dns lookup in CI, using this option, we can add some localhost tests for #11151. The commit was cherry-picked from #11151, back then we decided to split the PR. Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2023-12-24 10:40:34 +02:00
Wen Hui	5dc631d880	Add missing test cases for hash commands (#12851 ) We dont have test for hgetall against key doesnot exist so added the test in test suite and along with this, added wrong type cases for other missing commands.	2023-12-17 14:02:53 +02:00
Chen Tianjie	e95a5d4831	Support by/get options for sort(_ro) in cluster mode when pattern implies slot. (#12728 ) The by/get options of sort/sort_ro command used to be forbidden in cluster mode, since we are not sure which slot the pattern may be in. As the optimization done in #12536, patterns now can be mapped to slots, we should allow by/get options in cluster mode when the pattern maps to the same slot as the key.	2023-12-13 21:16:36 +02:00
Binbin	3c0fd25201	Redact ACL username information and mark *-key-file-pass configs as sensitive (#12860 ) In #11489, we consider acl username to be sensitive information, and consider the ACL GETUSER a sensitive command and remove it from redis-cli historyfile. This PR redact username information in ACL GETUSER and ACL DELUSER from SLOWLOG, and also remove ACL DELUSER from redis-cli historyfile. This PR also mark tls-key-file-pass and tls-client-key-file-pass as sensitive config, will redact it from SLOWLOG and also remove them from redis-cli historyfile.	2023-12-13 15:28:13 +02:00
Chen Tianjie	f9cc25c1dd	Add metric to INFO CLIENTS: pubsub_clients. (#12849 ) In INFO CLIENTS section, we already have blocked_clients and tracking_clients. We should add a new metric showing the number of pubsub connections, which helps performance monitoring and trouble shooting.	2023-12-13 13:44:13 +08:00
Binbin	c85a9b7896	Fix delKeysInSlot server events are not executed inside an execution unit (#12745 ) This is a follow-up fix to #12733. We need to apply the same changes to delKeysInSlot. Refer to #12733 for more details. This PR contains some other minor cleanups / improvements to the test suite and docs. It uses the postnotifications test module in a cluster mode test which revealed a leak in the test module (fixed).	2023-12-11 20:15:19 +02:00
Chen Tianjie	991aff1c0f	Optimize KEYS when pattern includes hashtag and implies a single slot. (#12754 ) in #12536 we made a similar optimization for SCAN, now that hashtags in patterns. When we can make sure all keys matching the pettern will be in the same slot, we can limit the iteration to run only one one.	2023-12-05 16:21:50 +02:00
sundb	91309f7981	Fix compilation warning in KeySpace_ServerEventCallback and add CFLAGS=-Werror flag for module CI (#12786 ) Warning: ``` postnotifications.c:216:77: warning: format specifies type 'long' but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat] RedisModule_Log(ctx, "warning", "Got an unexpected subevent '%ld'", subevent); ~~~ ^~~~~~~~ %llu ``` CI: https://github.com/redis/redis/actions/runs/6937308713/job/18871124342#step:6:115 ## Other Add `CFLAGS=-Werror` flag for module CI. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2023-11-30 17:41:00 +02:00
zhaozhao.zz	3431b1f156	format cpu config as redis style (#7351 ) The following four configurations are renamed to align with Redis style: 1. server_cpulist renamed to server-cpulist 2. bio_cpulist renamed to bio-cpulist 3. aof_rewrite_cpulist renamed to aof-rewrite-cpulist 4. bgsave_cpulist renamed to bgsave-cpulist The original names are retained as aliases to ensure compatibility with old configuration files. We recommend users to gradually transition to using the new configuration names to maintain consistency in style.	2023-11-29 13:40:06 +08:00
zhaozhao.zz	a1c5171c1d	Fix resize hash tables stuck on the last non-empty slot (#12802 ) Introduced in #11695 . The tryResizeHashTables function gets stuck on the last non-empty slot while iterating through dictionaries. It does not restart from the beginning. The reason for this issue is a problem with the usage of dbIteratorNextDict: /* Returns next dictionary from the iterator, or NULL if iteration is complete. / dict dbIteratorNextDict(dbIterator *dbit) { if (dbit->next_slot == -1) return NULL; dbit->slot = dbit->next_slot; dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType); return dbGetDictFromIterator(dbit); } When iterating to the last non-empty slot, next_slot is set to -1, causing it to loop indefinitely on that slot. We need to modify the code to ensure that after iterating to the last non-empty slot, it returns to the first non-empty slot. BTW, function tryResizeHashTables is actually iterating over slots that have keys. However, in its implementation, it leverages the dbIterator (which is a key iterator) to obtain slot and dictionary information. While this approach works fine, but it is not very intuitive. This PR also improves readability by changing the iteration to directly iterate over slots, thereby enhancing clarity.	2023-11-28 18:50:16 +08:00
Binbin	d6f19539d2	Un-register notification and server event when RedisModule_OnLoad fails (#12809 ) When we register notification or server event in RedisModule_OnLoad, but RedisModule_OnLoad eventually fails, triggering notification or server event will cause the server to crash. If the loading fails on a later stage of moduleLoad, we do call moduleUnload which handles all un-registration, but when it fails on the RedisModule_OnLoad call, we only un-register several specific things and these were missing: - moduleUnsubscribeNotifications - moduleUnregisterFilters - moduleUnsubscribeAllServerEvents Refactored the code to reuse the code from moduleUnload. Fixes #12808.	2023-11-27 17:26:33 +02:00
meiravgri	2e854bccc6	Fix async safety in signal handlers (#12658 ) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### 1. serverLog() and serverLogFromHandler() `serverLog` uses unsafe calls. It was decided that we will avoid `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### 2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex` The safe version of `snprintf` was taken from [here](`8cfc4ca5e7/src/mc_util.c (L754)`) #### 3. fopen(), fgets(), fclose() --------> open(), read(), close() #### 4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close() #### 5. Threads_mngr sync mechanisms * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### 6. Stacktraces buffer The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory. To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to create a pipe at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. The stacktraces collection process is now as follows: * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### 7. Changes that were considered and eventually were dropped * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](`c7562ee135`) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### 8. fix Signal mask check The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### 9. Unrelated changes both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-11-23 13:22:20 +02:00
Wen Hui	5a1f4b9aec	Adding missing SWAPDB related test cases. (#12769 ) We have some test cases of swapdb with watchkey but missing seperate basic swapdb test cases, unhappy path and flushdb after swapdb. So added the test cases in keyspace.tcl.	2023-11-19 12:44:48 +02:00
Binbin	3d9c427f8c	Fix timing issue in CLUSTER SLAVE / REPLICAS consistent test (#12774 ) CI reports that this test failed, the reason is because during the command processing, the node processed PING/PONG, resulting in ping_sent or pong_received mismatch. Change to use MULTI to avoid timing issue. The test was introduced in #12224.	2023-11-19 11:09:33 +02:00
Binbin	fe36306340	Fix DB iterator not resetting pauserehash causing dict being unable to rehash (#12757 ) When using DB iterator, it will use dictInitSafeIterator to init a old safe dict iterator. When dbIteratorNext is used, it will jump to the next slot db dict when we are done a dict. During this process, we do not have any calls to dictResumeRehashing, which causes the dict's pauserehash to always be > 0. And at last, it will be returned directly in dictRehashMilliseconds, which causes us to have slot dict in a state where rehash cannot be completed. In the "expire scan should skip dictionaries with lot's of empty buckets" test, adding a `keys ` can reproduce the problem stably. `keys ` will call dbIteratorNext to trigger a traversal of all slot dicts. Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator. Issue was introduced in #11695.	2023-11-14 14:28:46 +02:00
Harkrishn Patro	9ca8490315	Increase timeout for expiry cluster tests (#12752 ) Test recently added fails on timeout in valgrind in GH actions. Locally with valgrind the test finishes within 1.5 sec(s). Couldn't find any issue due to lack of reproducibility. Increasing the timeout and adding an additional log to the test to understand how many keys were left at the end.	2023-11-11 12:01:04 +02:00
Meir Shpilraien (Spielrein)	0ffb9d2ea9	Before evicted and before expired server events are not executed inside an execution unit. (#12733 ) Redis 7.2 (#9406) introduced a new modules event, `RedisModuleEvent_Key`. This new event allows the module to read the key data just before it is removed from the database (either deleted, expired, evicted, or overwritten). When the key is removed from the database, either by active expire or eviction. The new event was not called as part of an execution unit. This can cause an issue if the module registers a post notification job inside the event. This job will not be executed atomically with the expiration/eviction operation and will not replicated inside a Multi/Exec. Moreover, the post notification job will be executed right after the event where it is still not safe to perform any write operation, this will violate the promise that post notification job will be called atomically with the operation that triggered it and only when it is safe to write. This PR fixes the issue by wrapping each expiration/eviction of a key with an execution unit. This makes sure the entire operation will run atomically and all the post notification jobs will be executed at the end where it is safe to write. Tests were modified to verify the fix.	2023-11-08 09:28:22 +02:00
Yossi Gottlieb	6223355cf3	Use cross-platform-actions for FreeBSD support. (#12732 ) This change overcomes many stability issues experienced with the vmactions action. We need to limit VMs to 8GB for better stability, as the 13GB default seems to hang them occasionally. Shell code has been simplified since this action seem to use `bash -e` which will abort on non-zero exit codes anyway.	2023-11-06 18:07:14 +02:00
Roshan Khatri	15a048d4f0	re-enable defrag tests in cluster mode. (#12710 ) Reverts the skipping defrag tests in cluster mode (done in #12672. instead it skips only some defrag tests that are relevant for cluster modes. The test now run well after investigating and making the changes in #12674 and #12694. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-11-02 13:55:48 +02:00
Viktor Söderqvist	8878817d89	Optimize SCAN with MATCH when pattern implies cluster slot (#12536 ) Optimize the performance of SCAN commands when a match pattern can only contain keys from a single slot in cluster mode. This can happen when the pattern contains a hash tag before any wildcard matchers or when the key contains no matchers.	2023-11-01 00:06:49 -07:00
Harkrishn Patro	3fac869f02	Fix test, disable expiration until empty buckets are formed (#12689 ) Test failure on freebsd CI: ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. ``` Observation: expiry of keys might happen before the empty buckets are formed and won't help with the expiry skip logic validation. Solution: Disable expiration until the empty buckets are formed.	2023-10-24 11:29:40 +03:00
Harkrishn Patro	26eb4ce397	Fix defrag test (#12674 ) Fixing issues started after #11695 when the defrag tests are being executed in cluster mode too. For some reason, it looks like the defragmentation is over too quickly, before the test is able to detect that it's running. so now instead of waiting to see that it's active, we wait to see that it did some work ``` [err]: Active defrag big list: cluster in tests/unit/memefficiency.tcl defrag not started. [err]: Active defrag big keys: cluster in tests/unit/memefficiency.tcl defrag didn't stop. ```	2023-10-22 11:56:45 +03:00
Harkrishn Patro	becd50d0da	Disable flaky defrag tests affecting daily run (#12672 ) Temporarily disabling few of the defrag tests in cluster mode to make the daily run stable: Active defrag eval scripts Active defrag big keys Active defrag big list Active defrag edge case	2023-10-19 21:12:58 +03:00
Harkrishn Patro	f3bf8485d8	Fix resize hash table dictionary iterator (#12660 ) Dictionary iterator logic in the `tryResizeHashTables` method is picking the next (incorrect) dictionary while the cursor is at a given slot. This could lead to some dictionary/slot getting skipped from resizing. Also stabilize the test. problem introduced recently in #11695	2023-10-19 13:58:32 +03:00
Vitaly	0270abda82	Replace cluster metadata with slot specific dictionaries (#11695 ) This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. ## Important changes * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. ## Performance This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. ## Interface changes * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information. --------- Co-authored-by: Vitaly Arbuzov <arvit@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-14 23:58:26 -07:00
Oran Agra	f0c1c730d4	test suite: clean server pids after server crashed (#12639 ) when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list. we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.	2023-10-13 16:28:52 +03:00
Harkrishn Patro	b784c5375e	Unsubscribe all clients from replica for shard channel if the master ownership changes (#12577 ) Unsubscribe all clients from replica for shard channel if the master ownership changes	2023-10-12 20:48:27 -07:00
zhaozhao.zz	77a65e82b2	support XREAD[GROUP] with BLOCK option in scripts (#12596 ) In #11568 we removed the NOSCRIPT flag from commands and keep the BLOCKING flag. Aiming to allow them in scripts and let them implicitly behave in the non-blocking way. In that sense, the old behavior was to allow LPOP and reject BLPOP, and the new behavior, is to allow BLPOP too, and fail it only in case it ends up blocking. So likewise, so far we allowed XREAD and rejected XREAD BLOCK, and we will now allow that too, and only reject it if it ends up blocking.	2023-10-12 10:54:50 +03:00
Oran Agra	b810384c62	dump server longs on hang corrupt dump fuzzer test recently there are some incidents of hanged tests in the CI when we try to reproduce them, we get an assertion, not a hang. maybe the server logs will reveal some info.	2023-10-08 16:19:31 +03:00
YaacovHazan	2cf50ddbad	Fix 'load corrupted rdb with no CRC' test (#12629 ) After the change in #12626 (`2e0f6724e`), the is_alive proc gets pid and not server config. This PR aligns it in 'load corrupted rdb with no CRC' test.	2023-10-03 11:09:25 +03:00
meiravgri	4ba9e18ef0	fix crash in crash-report and other improvements (#12623 ) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: in threadsmngr:ThreadsManager_runOnThreads(): make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void) tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-02 20:02:02 +03:00
YaacovHazan	2e0f6724e0	Stabilization and improvements around aof tests (#12626 ) In some tests, the code manually searches for a log message, and it uses tail -1 with a delay of 1 second, which can miss the expected line. Also, because the aof tests use start_server_aof and not start_server, the test name doesn't log into the server log. To fix the above, I made the following changes: - Change the start_server_aof to wrap the start_server. This will add the created aof server to the servers list, and make srv() and wait_for_log_messages() available for the tests. - Introduce a new option for start_server. 'wait_ready' - an option to let the caller start the test code without waiting for the server to be ready. useful for tests on a server that is expected to exit on startup. - Create a new start_server_aof_ex. The new proc also accept options as argument and make use of the new 'short_life' option for tests that are expected to exit on startup because of some error in the aof file(s). Because of the above, I had to change many lines and replace every local srv variable (a server config) usage with the srv().	2023-10-02 08:20:53 +03:00
guybe7	c2a4b78491	WAITAOF: Update fsynced_reploff_pending even if there's nothing to fsync (#12622 ) The problem is that WAITAOF could have hang in case commands were propagated only to replicas. This can happen if a module uses RM_Call with the REDISMODULE_ARGV_NO_AOF flag. In that case, master_repl_offset would increase, but there would be nothing to fsync, so in the absence of other traffic, fsynced_reploff_pending would stay the static, and WAITAOF can hang. This commit updates fsynced_reploff_pending to the latest offset in flushAppendOnlyFile in case there's nothing to fsync. i.e. in case it's behind because of the above mentions case it'll be refreshed and release the WAITAOF. Other changes: Fix a race in wait.tcl (client getting blocked vs. the fsync thread)	2023-09-28 17:19:20 +03:00
guybe7	bfa3931a04	WAITAOF: Update fsynced_reploff_pending just before starting the initial AOFRW fork (#12620 ) If we set `fsynced_reploff_pending` in `startAppendOnly`, and the fork doesn't start immediately (e.g. there's another fork active at the time), any subsequent commands will increment `server.master_repl_offset`, but will not cause a fsync (given they were executed before the fork started, they just ended up in the RDB part of it) Therefore, any WAITAOF will wait on the new master_repl_offset, but it will time out because no fsync will be executed. Release notes: ``` WAITAOF could timeout in the absence of write traffic in case a new AOF is created and an AOFRW can't immediately start. This can happen by the appendonly config is changed at runtime, but also after FLUSHALL, and replica full sync. ```	2023-09-28 17:05:53 +03:00
Binbin	9fe63bdc80	Dump server logs when corrupt fuzzer reports crash (#12612 ) Recently we found some signal crashes, but unable to reproduce them. It is a good idea to dump the server logs when a failure happens.	2023-09-27 09:08:18 +03:00
meiravgri	cc2be63997	Print stack trace from all threads in crash report (#12453 ) In this PR we are adding the functionality to collect all the process's threads' backtraces. ## Changes made in this PR ### introduce threads mngr API The threads mngr API which has 2 abilities: * `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up. * ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every thread in the list to invoke the callback, and returns the output collected by each invocation. Elaborating atomicvar API * `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value * `atomicFlagGetSet` -- Get and set the atomic counter value to 1 ### Always set SIGALRM handler SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the `server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler. To make it available at run time we set SIGALRM handler on server startup. The signal handler name was changed to a more general `sigalrmSignalHandler`. ### Print all the process' threads' stacktraces `logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace. `writeStacktraces()`: * On Linux systems we use the threads manager API to collect the backtraces of all the process' threads. To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories. Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses, its name, and tid. * On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file). ## compatibility notes 1. The threads mngr API is only supported in linux. 2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated alternatives (`gettid()` and `tgkill`) were added in glibc 2.3. ## Output example Each thread backtrace will have the following format: `<tid> <thread_name> [additional_info]` * tid: as read from the `/proc/<redis-server-pid>/tasks` file * thread_name: the tread name as it is registered in the os/ * additional_info: Sometimes we want to add specific information about one of the threads. currently. it is only used to mark the thread that handles the backtraces collection by adding "". In case of crash - this also indicates which thread caused the crash. The handling thread in won't necessarily appear first. ``` ------ STACK TRACE ------ EIP: /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] 67089 redis-server linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790] /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] redis-server :6379(+0x75e0c)[0xaaaac2fe5e0c] redis-server :6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00] redis-server :6379(aeMain+0x24)[0xaaaac2fe7038] redis-server :6379(main+0xe0c)[0xaaaac3001afc] /lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc] redis-server :6379(_start+0x30)[0xaaaac2fe0370] 67093 bio_lazy_free /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67091 bio_close_file /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67092 bio_aof /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67089:signal-handler (1693824528) -------- ```	2023-09-24 09:47:23 +03:00
Binbin	96e9dec419	Bump codespell from 2.2.4 to 2.2.5 (#12557 ) and adjustments.	2023-09-08 16:10:17 +03:00
alonre24	044e29dd34	redis-benchmark - add the support for binary strings (#9414 ) Recently, the option of sending an argument from stdin using `-x` flag was added to redis-benchmark (this option is available in redis-cli as well). However, using the `-x` option for sending a blobs that contains null-characters doesn't work as expected - the argument is trimmed in the first occurrence of `\X00` (unlike in redis-cli). This PR aims to fix this issue and add the support for every binary string input, by sending arguments length to `redisFormatCommandArgv` when processing redis-benchmark command, so we won't treat the arguments as C-strings. Additionally, we add a simple test coverage for `-x` (without binary strings, and also remove an excessive server started in tests, and make sure to select db 0 so that `r` and the benchmark work on the same db. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-09-02 15:37:04 +03:00
Binbin	4ba144a4eb	Add logreqres:skip flag to new INFO obuf limit test (#12537 ) The new test added in #12476 causes reply-schemas-validator to fail. When doing `catch {r get key}`, the req-res output is: ``` 3 get 3 key 12 __argv_end__ $100000 aaaaaaaaaaaaaaaaaaaa...4 info 5 stats 12 __argv_end__ =1670 txt:# Stats ... ``` And we can see the link after `$100000`, there is a 4 in the last, it break the req-res-log-validator script since the format is wrong. The reason i guess is after the client reconnection (after the output buf limit), we will not add newlines, but append args directly. Since obuf-limits.tcl is doing the same thing, and it had the logreqres:skip flag, so this PR is following it.	2023-09-01 14:15:11 +03:00
Chen Tianjie	b26e8e3213	Optimize ZRANGE offset location from linear search to skiplist jump. (#12450 ) ZRANGE BYSCORE/BYLEX with [LIMIT offset count] option was using every level in skiplist to jump to the first/last node in range, but only use level[0] in skiplist to locate the node at offset, resulting in sub-optimal performance using LIMIT: ``` while (ln && offset--) { if (reverse) { ln = ln->backward; } else { ln = ln->level[0].forward; } } ``` It could be slow when offset is very big. We can get the total rank of the offset location and use skiplist to jump to it. It is an improvement from O(offset) to O(log rank). Below shows how this is implemented (if the offset is positve): Use the skiplist to seach for the first element in the range, record its rank `rank_0`, so we can have the rank of the target node `rank_t`. Meanwhile we record the last node we visited which has zsl->level-1 levels and its rank `rank_1`. Then we start from the zsl->level-1 node, use skiplist to go forward `rank_t-rank_1` nodes to reach the target node. It is very similiar when the offset is reversed. Note that if `rank_t` is very close to `rank_0`, we just start from the first element in range and go node by node, this for the case when zsl->level-1 node is to far away and it is quicker to reach the target node by node. Here is a test using a random generated zset including 10000 elements (with different positive scores), doing a bench mark which compares how fast the `ZRANGE` command is exucuted before and after the optimization. The start score is set to 0 and the count is set to 1 to make sure that most of the time is spent on locating the offset. ``` memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test 0 +inf byscore limit <offset> 1" ``` \| offset \| QPS(unstable) \| QPS(optimized) \| \|--------\|--------\|--------\| \| 10 \| 73386.02 \| 74819.82 \| \| 1000 \| 48084.96 \| 73177.73 \| \| 2000 \| 31156.79 \| 72805.83 \| \| 5000 \| 10954.83 \| 71218.21 \| With the result above, we can see that the original code is greatly slowed down when offset gets bigger, and with the optimization the speed is almost not affected. Similiar results are generated when testing reversed offset: ``` memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test +inf 0 byscore rev limit <offset> 1" ``` \| offset \| QPS(unstable) \| QPS(optimized) \| \|--------\|--------\|--------\| \| 10 \| 74505.14 \| 71653.67 \| \| 1000 \| 46829.25 \| 72842.75 \| \| 2000 \| 28985.48 \| 73669.01 \| \| 5000 \| 11066.22 \| 73963.45 \| And the same conclusion is drawn from the tests of ZRANGE BYLEX.	2023-08-31 14:42:08 +03:00
Binbin	9ce8c54d74	Update sort_ro reply_schema to mention the null reply (#12534 ) Also added a test to cover this case, so this can cover the reply schemas check.	2023-08-31 06:36:35 +03:00

1 2 3 4 5 ...

2172 Commits