Commit Graph

2435 Commits

Author SHA1 Message Date
Roman Gershman
420046aac8
fix: properly seriailize meta buffer in SendStringArrInternal (#3455)
Fixes #3449 that was introduced by #3425

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-06 10:43:05 +03:00
Roman Gershman
e482eefcbb
chore: disable serialization_max_chunk_size in regtests (#3445)
Intended to stabilize regression tests before releasing our next version.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-05 07:38:18 +00:00
Shahar Mike
38fba1d398
fix: cluster_mgr.py to use CLUSTER MYID (#3444) 2024-08-05 07:29:31 +00:00
Borys
faea4eef45
test: fix test_disconnect_replica (#3442) 2024-08-05 10:07:27 +03:00
Roman Gershman
6da445fcfe
feat: DEBUG REPLICA PAUSE now pauses fullsync (#3441)
Before that PAUSE paused the reconnection reconciler flow,
now it also stops the ongoing full sync replication if such exists.

In addition, this PR applies some clean-ups and removes redundant code

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-05 09:42:57 +03:00
Kostas Kyrimis
3f08a60148
chore: reset serialization_max_chunk_size to 0 (#3432)
* reset serialization_max_chunk_size to 0
* reword flag information

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-08-05 09:36:23 +03:00
Roman Gershman
9eacedf58e
chore: simplify master replication cancelation interface (#3439)
* chore: simplify master replication cancelation interface

Before that CancelReplication did too many things, moreover,
we had StopReplication that did the same.

This PR moves CancelReplication under ReplicaInfo struct,
and reduces code duplication around this change.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* Update src/server/dflycmd.cc

Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
Signed-off-by: Roman Gershman <romange@gmail.com>

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
2024-08-04 19:00:52 +00:00
Vladislav
55d39b66ff
chore: fix memcached pipeline test (#3438) 2024-08-04 15:41:17 +03:00
Roman Gershman
8f7c36e4b3
chore: reorganize EngineShard::Heartbeat (#3437)
* chore: reorganize EngineShard::Heartbeat

1. Simplify CacheStats by using accessorts directly provided by DbSlice
2. Separate eviction for tiering as tiering can be done on replica.
---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-04 15:00:43 +03:00
Roman Gershman
cfd2273fb0
chore: improve replication locks (#3436)
* chore: improve replication locks

Allow non-exclusive, read-only access to Dfly::ReplicaInfo structure.
The most important change is in DflyCmd::CancelReplication, where before
it has locked ReplicaInfo mutex and then continued with locking the global mutex.
It is dangerous because most operation lock them in the opposite order.

Also rename ambigous GetReplicaInfo accessors to clearer names.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* chore: comments

* chore: comments

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-04 10:55:50 +00:00
Vladislav
2ef475865f
test(cluster): Migration replication test (#3417) 2024-08-04 12:45:02 +03:00
Shahar Mike
2aa0b70035
feat(server): Support replica-announce-ip/port (#3421)
* feat: Support `replica-announce-ip`/`port`

Before this PR, we only supported `cluster_announce_ip`.
It's basically the same feature, but used for cluster announcements
instead of replication.

This PR adds support for `replica-announce-ip` and
`replica-announce-port`, which can be set via new flags `--announce_ip=`
and `--announce_port=`. These flags apply to both cluster and replica
announcements.

Tested via running Sentinel, and making sure it is able to connect to
announced ip+port, while it can't connect to announced false /
unavailable ip+port.

Note: this PR deprecates `--cluster_announce_ip`, but continues to
support it. We will remove it in a future version.

Fixes #3380

* fix failing test

* destructure
2024-08-04 12:35:14 +03:00
Roman Gershman
c9ed3f7b2b
chore: retire TEST_EnableHeartBeat (#3435)
Now unit tests will run the same Hearbeat fiber like in prod.
The whole feature was redundant, with just few explicit settings of maxmemory_limit
I succeeeded to make all unit tests pass.

In addition, this change allows passing a global handler that is called by heartbeat from a single thread.
This is not used yet - preparation for the next PR to break hung up replication connections on a master.

Finally, this change has some non-functional clean-ups and warning fixes to improve code quality.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-03 20:17:23 +03:00
Vladislav
82298b8122
fix(server): Implement SCRIPT GC command (#3431)
* fix(server): Implement SCRIPT GC command
2024-08-02 23:49:51 +03:00
Roman Gershman
f652f10743
chore: optimize SendStringArrInternal even more (#3425)
Before - sending 200K items requires more than 12K send calls.
Now - requires less than 2K calls. Latency also went down though not by x6.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-02 14:53:20 +03:00
Roman Gershman
8622c27ce1
chore: expose metric that shows how many task submitters are blocked (#3427)
* chore: expose metric that shows how many task submitters are blocked

This should help us in identifying deadlocks quickly.

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-01 21:27:15 +03:00
Borys
e2b6cfb384
chore: skip cluster tests if redis-server wasn't found (#3416)
* chore: skip cluster tests if redis-server wasn't found
2024-08-01 13:04:02 +00:00
Roman Gershman
a0918de2d3
feat: Support non-root paths for json.merge (#3419)
* feat: Support non-root paths for json.merge

Pass path argument and rewrite the JSON.MERGE code similar to OpToggle
or other mutating functions. Currently works only with --experimental_flat_json=false.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* chore: comments

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-01 08:33:36 +00:00
Roman Gershman
0ad310717d
chore: Tiered fixes (#3401)
1. Add background offloading stats
2. remove direct_fd override - helio is already updated with default=false, so it's not needed anymore.
3. remove redundant tiered_storage_memory_margin flag

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-01 11:03:13 +03:00
Roman Gershman
71b861572a
chore: remove verbose printing of tests (#3420)
Motivation: to avoid 80MB logs into stdout like this one:
https://github.com/dragonflydb/dragonfly/actions/runs/10174852001/job/28141278813?pr=3401

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-01 10:52:50 +03:00
Vladislav
e273015c0b
fix(connection): Count memchached pipelined commands (#3413) 2024-08-01 10:14:36 +03:00
Kostas Kyrimis
7e911c100a
fix: json.merge exception crash (#3409)
json.merge would throw an exception when the json object did not contain the element to replace because RecursiveMerge functions used &dest->at(k_v.key()) which threw the exception. Remove RecursiveMerge completely and use the one implemented in jsoncons lib.

* add test
* replace RecursiveMerge with mergepatch::apply_merge_patch
* add exception handling for that flow

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-31 17:03:36 +03:00
Borys
558a22d5b8
fix: crash with NS in multi/exec #3410 (#3415) 2024-07-31 10:06:32 +00:00
Kostas Kyrimis
1aa0720843
chore: increase timeout of regression tests (#3412)
The recent changes of the serialization_max_chunk_size set to 1 for extreme testing increased the running time of the tests causing them sometimes to timeout.

* increase timeout on reg tests from 40 to 50

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-31 07:28:44 +00:00
Kostas Kyrimis
aa02070e3d
chore: add db_slice lock to protect segments from preemptions (#3406)
DastTable::Traverse is error prone when the callback passed preempts because the segment might change. This is problematic and we need atomicity while traversing segments with preemption. The fix is to add Traverse in DbSlice and protect the traversal via ThreadLocalMutex.

* add ConditionFlag to DbSlice
* add Traverse in DbSlice and protect it with the ConditionFlag
* remove condition flag from snapshot
* remove condition flag from streamer

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-30 15:02:54 +03:00
Vladislav
f536f8afbd
chore: cancel slot migrations on shutdown (#3405) 2024-07-30 12:47:58 +03:00
Roman Gershman
e464990643
feat: stabilize non-coordinated omission mode (#3407)
* feat: stabilize non-coordinated omission mode

1. Our latency/RPS computations were off because we started measuring before drivers
   started running. Now, Run/Start phases are separated, so the start time is measured more precisely
   (after the start phase)
2. Introduced progress per connection - one of my discoveries is that driver
   connections progress with differrent pace when running in coordinated omission mode.
   This can reach x5 speed differrences. Now we measure and output fastest/slowest progress.
3. Coordinated omission is great when the Server Under Test is able to sustain the required RPS.
   But if the actual RPS is lower than the one is sent than the final latencies will be infinitely high.
   We fix it by introducing self-adjusting sleep interval, so if the actual RPS is lower
   we will increase the interval to be closer to the actual RPS.

Show p99 latency and maximum pending requests per connection.

Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
Signed-off-by: Roman Gershman <romange@gmail.com>

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
2024-07-30 11:55:43 +03:00
Shahar Mike
89a48a7aa8
chore: Support setting the value of replica-priority (#3400)
* chore: Support setting the value of `replica-priority`

This PR adds a small refactor to the way we set and get config names
which have dashes (`-`) and underscores (`_`).

Until now, words were separated by underscores because this is how our
flags library (absl) works. However, this is incompatible with Valkey,
which uses dashes as a word separator.

Once merged, we will support both underscores and dashes in config
names, but will only return the name with dashes. **This is a behavior
change**.

We're doing this in order to be compatible with `replica-priority` and
possibly other config names that Valkey uses.

* Flag restore

* normalize to '_'
2024-07-29 23:02:49 +03:00
Shahar Mike
7100168bab
chore: Don't print password to log on replica AUTH failure (#3403) 2024-07-29 22:36:39 +03:00
Roman Gershman
776bd79381
fix: reenable macos builds (#3399)
* fix: reenable macos builds

Also, add debug function that prints local state if deadlocks occure.

* fix: free cold memory for non-cache mode as well

* chore: disable FreeMemWithEvictionStep again

Because it heavily affects the performance when performing evictions.

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-28 22:40:51 +03:00
Vladislav
1a8c12225b
chore(tiering): Move cool entry warmup to DbSlice (#3397)
Signed-off-by: Vladislav Oleshko <vlad@dragonflydb.io>
2024-07-28 17:30:41 +03:00
Shahar Mike
20bda84317
Revert "chore: set serialization_max_chunk_size to 1 byte (#3379)" (#3398)
This reverts commit 2867d54a05.
2024-07-28 06:48:46 +00:00
Stepan Bagritsevich
28cfde0a27
fix: Fix unsupported object type rejson-rl in RedisInsight (#3384)
* fix: Fix unsupported object type rejson-rl in RedisInsight

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>

* fix(generic_family): fix case for the TYPE option in SCAN command

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>

* feat(generic_family_test): Add test for the Redis GUI

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>

* refactor: address comments

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>

* refactor: address comments 2

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>

* refactor: change variable name from obj_type_as_string to obj_type

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>

---------

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>
2024-07-27 19:05:00 +02:00
Roman Gershman
6b67f44e29
chore: tiering - make Modify work with cool storage (#3395)
1. Fully support tiered_experimental_cooling for all operations
2. Offset cool storage usage when computing memory pressure situations in Hearbeat.
3. Introduce realtime entry counting per db_slice and provide DCHECK to verify it vs the old approach.
   Later we will switch to realtime entry and free memory computations when computing bytes per object,
   and remove the old approach in CacheStats().
4. Show hit rate during the run of dfly_bench loadtest.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-27 14:31:29 +03:00
Kostas Kyrimis
9d16bd6f6e
fix(acl): remove none from acl categories (#3392)
None does not exist in Valkey and its entry was missing from the indexes we use to map categories to commands leading to an out of bounds access and causing a segfault.

* remove none from acl categories

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-26 09:27:29 +03:00
Roman Gershman
0a26a06065
chore: tiered fixes (#3393)
1. Use introsive::list for CoolQueue.
2. Make sure that we ignore cool memory usage when computing average object size to
   prevent evictions during dashtable growth attempts.
3. Remove items from the cool storage before evicting them from the dash table.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-25 23:38:44 +03:00
Kostas Kyrimis
2867d54a05
chore: set serialization_max_chunk_size to 1 byte (#3379)
Update the flag for extreme testing. We should remove this before the release.

* set serialization_max_chunk_size to 1 byte

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-25 23:10:44 +03:00
Kostas Kyrimis
6d9e370e2d
fix: test_big_value_serialization_memory_limit shutdown timeout (#3390)
The problem is that the test test_big_value_serialization_memory_limit will try to shutdown dragonfly at the end with a timeout of 15 seconds. Dragonfly during shutdown takes a snapshot which might take more than 15 seconds and the test fails.

* call flushall before we exit the test

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-25 21:38:09 +03:00
Kostas Kyrimis
79d7f57b67
fix: disable inline transactions when db_slice has registered callbacks (#3391)
Inline transactions do not acquire any locks and therefore they should not preempt. This is no longer true when db_slice has registered callbacks.

* disable inline transactions when db_slice has registered callbacks

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-25 16:06:35 +00:00
Kostas Kyrimis
a95cf2e857
chore: do not preempt on db_slice::RegisterOnChange (#3388)
For big value serialization it is required to support preemption when db_slice::RegisterOnChange is called to avoid UB when a code path is iterating over the change_cb_ and preempts because it serializes a big value. As this is problematic and can lead to data inconsistencies I replace the std::vector with std::list and bound the iteration of change_cb_ on paths that preempt.

* replace std::vector with std::list for change_cb_
* bound iteration of change_cb_ on paths that preempt

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-25 16:08:02 +03:00
Kostas Kyrimis
4b851be57a
fix: remove fiber guard from non atomic section (#3381)
We might preempt when we serialize a big value and the code in journal was protected by an atomic guard triggering a check failed.

* remove fiber guard from non atomic section
* move LocalBlockingCounter to common

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-25 16:06:35 +03:00
Roman Gershman
e2d65a0900
chore: reenable evictions upon insertion to avoid OOM rejections (#3387)
* chore: reenable evictions upon insertion to avoid OOM rejections

Before: when running dragonfly with --cache_mode we could get OOM rejections
even though the eviction policy allowed to evict items to free memory.
Ideally, dragonfly in cache mode should not respond with the OOM error.

This PR reuses the same Eviction step we have in the Heartbeat and conditionally applies it
during the insertion. In my test the OOM errors went from 500K to 0 and the server
still respected memory limit.

Also, remove the old heuristics that has never been used.

Test:

./dfly_bench --key_prefix=bar: -d 1024 --ratio=1:0 --qps=200 -n 3000
./dragonfly --dbfilename=  --proactor_threads=2 --maxmemory=600M --cache_mode

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-25 15:28:57 +03:00
Shahar Mike
fb4222d01e
fix: Fix test_take_over_seeder (#3385)
* fix: Fix `test_take_over_seeder`

There are a few issues with the test:

1. Not using the admin port, which could cause pause to deadlock
2. Not waiting for some of the `task`s (although that won't cause a
   failure)

But also in the product code:

1. We used to `std::move()` the same pointer multiple times
2. We assigned to the same status object from multiple threads

Hopefully this fixes the test. It used to fail every ~100 attempts on my
machine, now it's been >1,000 and they all passed.

* add comments

* remove shard_ptr param
2024-07-25 08:00:05 +00:00
Roman Gershman
181d356341
chore: update cached stats inside PollExecution (#3376)
* chore: update cached stats inside PollExecution
2024-07-25 10:46:03 +03:00
Roman Gershman
8a9c9adbc5
chore: introduce a cool queue that gradually retires cool items (#3377)
* chore: introduce a cool queue that gradually retires cool items

This PR introduces a new state in which the offloaded value is not freed from memory but instead stays
in the cool queue.

Upon Read we convert the cool value back to hot table and delete it from storage.
When we low on memory we retire oldest cool values until we are above the threshold.

The PR does not fully finish the feature but it is workable enough to start (load)testing.
Missing:
a) Handle Modify operations
b) Retire cool items in more cases where we are low on memory. Specifically, refrain from evictions as long as cool items exist.

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-25 09:09:40 +03:00
Roman Gershman
02b72c9042
chore: dfly_bench - print ongoing error counts (#3382) 2024-07-24 22:13:11 +03:00
Kostas Kyrimis
52b29b302c
update: replication_acks_interval flag to 1000 (#3378)
* update replication_acks_interval flag to 1000

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-24 13:28:56 +00:00
Kostas Kyrimis
929222a7df
chore: add mem test for big values and default the flag (#3369)
* default serialization_max_chunk_size to 10 mb
* add test for big values
* small rename of enum to conform style guide

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-07-24 16:07:27 +03:00
Roman Gershman
03b3f86aed
chore: Track db_slice table memory instantly (#3375)
We update table_memory upon each deletion and insertion of an element.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-24 14:13:08 +03:00
Vladislav
f73c7d0e42
fix(transaction): Properly store block cancel status (#3371) 2024-07-24 14:05:00 +03:00