Bug#1093243: Upgrade to 6.1.123 kernel causes mariadb hangs

Discussion:

Add Reply

Salvatore Bonaccorso

2025-01-23 21:00:01 UTC

Hi Xan,

I rented a Linode and have been trying to load it down with sysbench
activity while doing a mariabackup and a mysqldump, also while spinning up
the CPU with zstd benchmarks. So far I've had no luck triggering the fault.
https://www.dwarmstrong.org/kernel/
(except that I used make -j24 to build in parallel and used make
localmodconfig to compile only the modules I need)
6.1.123 (equivalent to linux-image-6.1.0-29-amd64)
6.1.122
6.1.121
6.1.120
So far they have all exhibited the behavior. Next up is 6.1.119 which is
equivalent to linux-image-6.1.0-28-amd64. My expectation is that the fault
will not appear for this kernel.
https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.120
I have to work on some other things, and it'll take a while to prove the
negative (that is, to know that the failure isn't happening). I'll post
back with the 6.1.119 results when I have them.

Additionally please try with 6.1.120 and revert this commit

3ab9326f93ec ("io_uring: wake up optimisations")

(which landed in 6.1.120).

If that solves the problem maybe we miss some prequisites in the 6.1.y
series here?

Regards,
Salvatore

Xan Charbonnet

2025-01-24 02:20:01 UTC

Permalink

Post by Salvatore Bonaccorso
Additionally please try with 6.1.120 and revert this commit
3ab9326f93ec ("io_uring: wake up optimisations")
(which landed in 6.1.120).
If that solves the problem maybe we miss some prequisites in the 6.1.y
series here?

I hope I did all this right. I found this:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3181e22fb79910c7071e84a43af93ac89e8a7106

and attempted to undo that change in the vanilla 6.1.124 source by
making the following change to io_uring/io_uring.c:

585,594d584
< static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
< __releases(ctx->completion_lock)
< {
< io_commit_cqring(ctx);
< spin_unlock(&ctx->completion_lock);
< io_commit_cqring_flush(ctx);
< if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
< __io_cqring_wake(ctx);
< }
<
1352c1342
< __io_cq_unlock_post_flush(ctx);
---

Post by Salvatore Bonaccorso
__io_cq_unlock_post(ctx);

I rebooted into the resulting kernel and am happy to report that the
problem did NOT occur!

Xan Charbonnet

2025-01-27 16:50:01 UTC

Permalink

The MariaDB developers are wondering whether another corruption bug,
MDEV-35334 ( https://jira.mariadb.org/browse/MDEV-35334 ) might be related.

The symptom was described as:
the first 1 byte of a .ibd file is changed from 0 to 1, or the first 4
bytes are changed from 0 0 0 0 to 1 0 0 0.

Is it possible that an io_uring issue might be causing that as well?
Thanks.

-Xan