Fix LRU crash when getting too many random lua scripts (#1310)

### Problem
Valkey stores scripts in a dictionary (lua_scripts) keyed by their SHA1
hashes, but it needs a way to know which scripts are least recently
used. It uses an LRU list (lua_scripts_lru_list) to keep track of
scripts in usage order. When the list reaches a maximum length, Valkey
evicts the oldest scripts to free memory in both the list and
dictionary. The problem here is that the sds from the LRU list can be
pointing to already freed/moved memory by active defrag that the sds in
the dictionary used to point to. It results in assertion error at [this
line](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L519)

### Solution
If we duplicate the sds when adding it to the LRU list, we can create an
independent copy of the script identifier (sha). This duplication
ensures that the sha string in the LRU list remains stable and
unaffected by any defragmentation that could alter or free the original
sds. In addition, dictUnlink doesn't require exact pointer
match([ref](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L71-L78))
so this change makes sense to unlink the right dictEntry with the copy
of the sds.

### Reproduce
To reproduce it with tcl test:
1. Disable je_get_defrag_hint in defrag.c to trigger defrag often
2. Execute test script
```
start_server {tags {"auth external:skip"}} {

    test {Regression for script LRU crash} {
        r config set activedefrag yes
        r config set active-defrag-ignore-bytes 1
        r config set active-defrag-threshold-lower 0
        r config set active-defrag-threshold-upper 1
        r config set active-defrag-cycle-min 99
        r config set active-defrag-cycle-max 99

        for {set i 0} {$i < 100000} {incr i} {
            r eval "return $i" 0
        }
        after 5000;
    }
}
```


### Crash info
Crash report:
```
=== REDIS BUG REPORT START: Cut & paste starting from here ===
14044:M 12 Nov 2024 14:51:27.054 # === ASSERTION FAILED ===
14044:M 12 Nov 2024 14:51:27.054 # ==> eval.c:556 'de' is not true

------ STACK TRACE ------

Backtrace:
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaDeleteFunction+0x148)[0x723708]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaCreateFunction+0x26c)[0x724450]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](evalCommand+0x2bc)[0x7254dc]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](call+0x574)[0x5b8d14]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommand+0xc84)[0x5b9b10]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommandAndResetClient+0x11c)[0x6db63c]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processInputBuffer+0x1b0)[0x6dffd4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x6bd968]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x659634]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](amzTLSEventHandler+0x194)[0x6588d8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x750c88]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](aeProcessEvents+0x228)[0x757fa8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](redisMain+0x478)[0x7786b8]
/lib64/libc.so.6(__libc_start_main+0xe4)[0xffffa7763da4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x5ad3b0]
```
Defrag info:
```
mem_fragmentation_ratio:1.18
mem_fragmentation_bytes:47229992
active_defrag_hits:20561
active_defrag_misses:5878518
active_defrag_key_hits:77
active_defrag_key_misses:212
total_active_defrag_time:29009
```

### Test:
Run the test script to push 100,000 scripts to ensure the LRU list keeps
500 maximum length without any crash.
```
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
[ok]: Regression for script LRU crash (6811 ms)
[1/1 done]: unit/test (7 seconds)
```

---------

Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <155032684+sungming2@users.noreply.github.com>
Co-authored-by: Seungmin Lee <sungming@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This commit is contained in:
Seungmin Lee 2024-11-18 18:06:35 -08:00 committed by GitHub
parent f9d0b87622
commit 3d0c834203
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -199,10 +199,12 @@ void scriptingInit(int setup) {
}
/* Initialize a dictionary we use to map SHAs to scripts.
* Initialize a list we use for lua script evictions, it shares the
* sha with the dictionary, so free fn is not set. */
* Initialize a list we use for lua script evictions.
* Note that we duplicate the sha when adding to the lru list due to defrag,
* and we need to free them respectively. */
lctx.lua_scripts = dictCreate(&shaScriptObjectDictType);
lctx.lua_scripts_lru_list = listCreate();
listSetFreeMethod(lctx.lua_scripts_lru_list, (void (*)(void *))sdsfree);
lctx.lua_scripts_mem = 0;
luaRegisterServerAPI(lua);
@ -518,9 +520,6 @@ void luaDeleteFunction(client *c, sds sha) {
dictEntry *de = dictUnlink(lctx.lua_scripts, sha);
serverAssertWithInfo(c ? c : lctx.lua_client, NULL, de);
luaScript *l = dictGetVal(de);
/* We only delete `EVAL` scripts, which must exist in the LRU list. */
serverAssert(l->node);
listDelNode(lctx.lua_scripts_lru_list, l->node);
lctx.lua_scripts_mem -= sdsAllocSize(sha) + getStringObjectSdsUsedMemory(l->body);
dictFreeUnlinkedEntry(lctx.lua_scripts, de);
}
@ -549,11 +548,12 @@ listNode *luaScriptsLRUAdd(client *c, sds sha, int evalsha) {
listNode *ln = listFirst(lctx.lua_scripts_lru_list);
sds oldest = listNodeValue(ln);
luaDeleteFunction(c, oldest);
listDelNode(lctx.lua_scripts_lru_list, ln);
server.stat_evictedscripts++;
}
/* Add current. */
listAddNodeTail(lctx.lua_scripts_lru_list, sha);
listAddNodeTail(lctx.lua_scripts_lru_list, sdsdup(sha));
return listLast(lctx.lua_scripts_lru_list);
}