Discussion:
main thread pthread_exit/sys_exit bug!
(too old to reply)
Kaz Kylheku
2009-02-01 22:40:10 UTC
Permalink
Basically, if you call pthread_exit from the main thread of a process, and keep
other threads running, the behavior is ugly.

I logged this initially as a bug against glibc, but then resolved it
with a kernel patch against linux 2.6.26:

Please see:

http://sources.redhat.com/bugzilla/show_bug.cgi?id=9804

I've known about this for some time, first having reproduced it on 2.6.17;
finally got around to fixing it.

When the main thread of a POSIX threads process calls pthread_exit, the process
should stick around until all the other threads do the same, or until one of
them calls _exit or exit, or until the process terminates abnormally. During
this time, it would be nice if the process behaved normally: if it did not
appear defunct in the process list and if POSIX job control was possible on it.

An easy way to achieve this is to insert a wait into the top of sys_exit, so
that do_exit is not called unless all the other threads have terminated. This
is another special case like do_group_exit. In the group exit, we zap the other
threads. In this case, we must not zap the other threads, but neither should we
fall through do_exit and become defunct!

The patch involves a controversial move: returning -ERESTARTSYS from sys_exit.
This is because the main thread may be stuck in sys_exit and have to respond to
a signal, and then go back to sys_exit. It appears to be working fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-02 06:50:04 UTC
Permalink
Post by Kaz Kylheku
Basically, if you call pthread_exit from the main thread of a process, and keep
other threads running, the behavior is ugly.
Yes, known problem.

Please look at

[RFC,PATCH 3/3] do_wait: fix waiting for stopped group with dead leader
http://marc.info/?t=119713920000003

I'll try to re-do and re-send this patch this week.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-02 07:20:08 UTC
Permalink
Post by Oleg Nesterov
Post by Kaz Kylheku
Basically, if you call pthread_exit from the main thread of a process, and keep
other threads running, the behavior is ugly.
Yes, known problem.
Please look at
[RFC,PATCH 3/3] do_wait: fix waiting for stopped group with dead leader
http://marc.info/?t=119713920000003
I'll try to re-do and re-send this patch this week.
I believe that my straight-forward fix is pretty much good to go. I
checked into my distro, so we will see how it holds up.

It's a bad idea to allow the main thread to terminate. It should stick
around because it serves as a facade for the process as a whole. If
the main thread is allowed to bail all the way through do_exit, who
knows what kind of problems may show up because of that.

What if one of my developers is working on a server which has called
pthread_exit in the main thread, and wants to attach gdb to it? Will
that work if the main thread (a.k.a group leader) is a defunct
process?

I just tried this test case and it worked perfectly with my patch! gdb
attached to the process by the pid of teh group leader. It correctly
showed as that thread being stopped in __exit_thread. I can see the
other threads, etc.

bash:~# /projects/sw/kaz/bug-repro-programs/pthread-exit &
[1] 2093
bash:~# gdb -p 2093
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "mips64-linux".
Attaching to process 2093
Reading symbols from /projects/sw/kaz/bug-repro-programs/pthread-exit...done.
Reading symbols from /lib32/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2d46c4b0 (LWP 2098)]
[New Thread 0x2cc6c4b0 (LWP 2097)]
[New Thread 0x2c46c4b0 (LWP 2096)]
[New Thread 0x2bc6c4b0 (LWP 2095)]
[New Thread 0x2b46c4b0 (LWP 2094)]
Loaded symbols for /lib32/libpthread.so.0
Reading symbols from /lib32/libc.so.6...done.
Loaded symbols for /lib32/libc.so.6
Reading symbols from /lib32/ld.so.1...done.
Loaded symbols for /lib32/ld.so.1
Reading symbols from /lib32/libgcc_s.so.1...done.
Loaded symbols for /lib32/libgcc_s.so.1
0x2abd3da4 in __exit_thread () from /lib32/libc.so.6
(gdb) where
#0 0x2abd3da4 in __exit_thread () from /lib32/libc.so.6
#1 0x2ab18ab0 in __libc_start_main (main=0x10000710 <main>, argc=1,
ubp_av=0x7fcd7524, init=<value optimized out>, fini=<value optimized out>,
rtld_fini=<value optimized out>, stack_end=<value optimized out>)
at libc-start.c:245
#2 0x100005dc in _ftext ()

If I try this on an unpatched kernel that allows a main thread to bail
through do_exit, this is what happens:

Attaching to process 14651
ptrace: No such process.
/root/14651: No such file or directory.

I don't think that this is solved by any patch that allows the group
leader to bail through do_exit. It's not just a problem of waiting on
a dead group leader. If you want to maintain the illusion that the OS
provides a process that contains threads, and the group leader is the
representation of that process, then you have to keep that leader
alive; the lifetime of that leader cannot be shorter than that of the
process illusion.

Patch:

http://sourceware.org/bugzilla/attachment.cgi?id=3702
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-02 17:00:17 UTC
Permalink
Post by Kaz Kylheku
Post by Oleg Nesterov
Post by Kaz Kylheku
Basically, if you call pthread_exit from the main thread of a process, and keep
other threads running, the behavior is ugly.
Yes, known problem.
Please look at
[RFC,PATCH 3/3] do_wait: fix waiting for stopped group with dead leader
http://marc.info/?t=119713920000003
I'll try to re-do and re-send this patch this week.
I believe that my straight-forward fix is pretty much good to go. I
checked into my distro, so we will see how it holds up.
(the patch: http://sourceware.org/bugzilla/attachment.cgi?id=3702)

Well, perhaps something like your patch makes sense.

+/*
+ * A single thread is exiting, and it is the leader of the group.
+ * This coresponds to the main thread calling pthread_exit.
+ * In this case, the persists until all the other
+ * threads call pthread_exit, or someone calls exit or _exit.
+ * To implement this case, we park the thread leader in
+ * a loop which waits until the thread list becomes empty,
+ * or it receives a fatal signal.
^^^^^^^^^^^^^^
The comment is not exactly right, we return on every signal, not only fatal.

+int
+do_leader_exit(void)
+{
+ int ret = 0;
+
+ if (unlikely(!thread_group_empty(current))) {
+ DECLARE_WAITQUEUE(wait, current);
+
+ add_wait_queue(&current->signal->wait_chldexit, &wait);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ do {
+ if (thread_group_empty(current))
+ break;
+
+ try_to_freeze();
+ schedule();
+ if (signal_pending(current))
+ ret = -ERESTARTSYS;
+ } while (ret == 0);
+
+ remove_wait_queue(&current->signal->wait_chldexit, &wait);
+
+ __set_current_state(TASK_RUNNING);
+ }
+
+ return ret;
+}

the above is just the open-coded
wait_event_interruptible(&current->signal->wait_chldexit,
thread_group_empty(current));


asmlinkage long sys_exit(int error_code)
{
+ if (thread_group_leader(current)) {
+ int ret = do_leader_exit();
+ if (ret != 0)
+ return ret;
+ }
do_exit((error_code&0xff)<<8);
}

afaics, -ERESTARTSYS is not exactly correct. We can dequeue the
signal without SA_RESTART. But we never should abort sys_exit().
ERESTARTNOINTR is better. But see below.

I am worried this patch can confuse the user-space. Because, when
the main thread does sys_exit(), the user-space has all rights
to assume it exits ;) But with this patch the main thread will
continue to handle the signals until the while group exits, I'm
afraid libpthread.so won't be happy.

And what if the signal handler does siglongjmp() and aborts sys_exit() ?
Post by Kaz Kylheku
It's a bad idea to allow the main thread to terminate. It should stick
around because it serves as a facade for the process as a whole. If
the main thread is allowed to bail all the way through do_exit, who
knows what kind of problems may show up because of that.
What if one of my developers is working on a server which has called
pthread_exit in the main thread, and wants to attach gdb to it? Will
that work if the main thread (a.k.a group leader) is a defunct
process?
And I think gdb is wrong. It can see this process has other live threads,
and attach to them. I didn't check gdb, but iirc "strace -f" works
correctly in this case.

I cced other people. Let's see what they think.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-02 20:20:13 UTC
Permalink
Post by Oleg Nesterov
Well, perhaps something like your patch makes sense.
+/*
+ * A single thread is exiting, and it is the leader of the group.
+ * This coresponds to the main thread calling pthread_exit.
+ * In this case, the persists until all the other
+ * threads call pthread_exit, or someone calls exit or _exit.
+ * To implement this case, we park the thread leader in
+ * a loop which waits until the thread list becomes empty,
+ * or it receives a fatal signal.
^^^^^^^^^^^^^^
The comment is not exactly right, we return on every signal, not only fatal.
Yes; that's stale text, referring to some earlier code that had been changed.
Post by Oleg Nesterov
the above is just the open-coded
wait_event_interruptible(&current->signal->wait_chldexit,
thread_group_empty(current));
Terrific! I will make the change locally.

Btw, it's ``hand coded''. ``Open coding'' is what both programmers and macros
do. Hand conding is the dumb thing I did instead of using the macro to open
code it. :)
Post by Oleg Nesterov
I am worried this patch can confuse the user-space. Because, when
the main thread does sys_exit(), the user-space has all rights
to assume it exits ;)
Only glibc knows about sys_exit. (Or are there run-times for other language
implementations that have their own binding to sys_exit, bypassing pthreads?)

The POSIX interface used by applications is pthread_exit, and there is no
assumption there about it being an exit-like system call. It has a number of
standard-defined user-space chores to do, in fact.
Post by Oleg Nesterov
But with this patch the main thread will
continue to handle the signals until the while group exits, I'm
afraid libpthread.so won't be happy.
And what if the signal handler does siglongjmp() and aborts sys_exit() ?
pthread_exit peforms cleanup unwinding (required by Unix and POSIX) and
destruction of thread-specific storage.

Glibc does this and then performs its own longjmp to a handler in the startup
code above main. At that point it's no longer correct to longjmp to any of
the frames that have been aborted by that action. It's still executing user
code; sys_exit has not been called yet, and the signal handler can go
off.

Other than that, there is actually nothing wrong with aborting sys_exit. It
hasn't done any cleanup yet through do_exit, so it can be nicely
reentered later.

I'd say that any programs that are broken by this patch are
probably ``fork in the toaster'' category anyway.

Note that programs can also abort the exit function with signals, too,
and there is nothing that can be done in the kernel about it.
Post by Oleg Nesterov
Post by Kaz Kylheku
It's a bad idea to allow the main thread to terminate. It should stick
around because it serves as a facade for the process as a whole. If
the main thread is allowed to bail all the way through do_exit, who
knows what kind of problems may show up because of that.
What if one of my developers is working on a server which has called
pthread_exit in the main thread, and wants to attach gdb to it? Will
that work if the main thread (a.k.a group leader) is a defunct
process?
And I think gdb is wrong. It can see this process has other live threads,
and attach to them.
But if this problem is patched in this very simple way in the kernel, then gdb
installations ``Just Work'' without even being recompiled.

Probably, the only way the gdb maintainers would accept another Linux-specific
hack would be if they didn't know that a kernel patch will fix it. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ulrich Drepper
2009-02-02 20:20:13 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Oleg Nesterov
I am worried this patch can confuse the user-space. Because, when
the main thread does sys_exit(), the user-space has all rights
to assume it exits ;) But with this patch the main thread will
continue to handle the signals until the while group exits, I'm
afraid libpthread.so won't be happy.
I haven't looked at the patch nor tried it.

If the patch changes the behavior that the main thread, after calling
sys_exit, still react to signals sent to this thread or to the process
as a whole, then the patch is wrong. The userlevel context of the
thread is not usable anymore. It will have run all kinds of
destructors. The current behavior is AFAIK that the main thread won't
react to any signal anymore. That is absolutely required.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkmHVO8ACgkQ2ijCOnn/RHTJvwCgodxkT+mg0tmrnlhf/IP8hUQc
RYIAn0YC7pTjPHHZa7kmvYSyu/Zw5IIT
=ehdX
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-02 20:40:12 UTC
Permalink
Post by Ulrich Drepper
The userlevel context of the
thread is not usable anymore. It will have run all kinds of
destructors. The current behavior is AFAIK that the main thread won't
react to any signal anymore. That is absolutely required.
Hey Ulrich,

Thanks for articulating that requirement. I think it can be met by
extending the patch a little bit. We can keep the main thread parked
inside sys_exit /and/ get it not to react to signals internally, yet
have the external behavior that the process reacts in the normal
way to certain signals like SIGTSTP, SIGCONT, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-03 02:40:06 UTC
Permalink
Post by Kaz Kylheku
Post by Ulrich Drepper
The userlevel context of the
thread is not usable anymore. It will have run all kinds of
destructors. The current behavior is AFAIK that the main thread won't
react to any signal anymore. That is absolutely required.
Hey Ulrich,
Thanks for articulating that requirement. I think it can be met by
extending the patch a little bit.
I've now done that.

The exiting thread leader, if there are still other
threads alive, gets its own private signal handler array in which
every action is set to SIG_IGN, using the ignore_signals
function.

I experimented with blocking signals, but that approach
breaks the test case of being able to attach GDB to the
exiting thread.

As part of the patch, I found it convenient to extend the
incomplete sys_unshare functionality w.r.t. signal handlers,
rather than reinvent the wheel.

Cheers ...

http://sourceware.org/bugzilla/attachment.cgi?id=3702
http://sourceware.org/bugzilla/attachment.cgi?id=3705
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-03 13:40:12 UTC
Permalink
Post by Kaz Kylheku
Post by Kaz Kylheku
Post by Ulrich Drepper
The userlevel context of the
thread is not usable anymore. It will have run all kinds of
destructors. The current behavior is AFAIK that the main thread won't
react to any signal anymore. That is absolutely required.
Hey Ulrich,
Thanks for articulating that requirement. I think it can be met by
extending the patch a little bit.
I've now done that.
The exiting thread leader, if there are still other
threads alive, gets its own private signal handler array in which
every action is set to SIG_IGN, using the ignore_signals
function.
I experimented with blocking signals, but that approach
breaks the test case of being able to attach GDB to the
exiting thread.
As part of the patch, I found it convenient to extend the
incomplete sys_unshare functionality w.r.t. signal handlers,
rather than reinvent the wheel.
This is wrong, we can not and must not unshare ->sighand.
Post by Kaz Kylheku
Cheers ...
http://sourceware.org/bugzilla/attachment.cgi?id=3702
http://sourceware.org/bugzilla/attachment.cgi?id=3705
This adds multiple problems. Just for example, fs/proc/ takes
leader->sighand->siglock to protect the list of sub-threads.
Of course this doesn't work any longer after unsharing. And
there are numerous similar problems.

ignore_signals() in do_leader_exit() is not right too. This
thread group should hangle the group-wide signals even if
the main thread exits.

atomic_read(&sigh->count) in unshare_sighand() is racy, and
in fact bogus. (yes, the whole unshare_sighand() is bogus,
it never populates new_sighp).

The changing of ->sighand in do_unshare() is very wrong, we
can free the sighand_struct which is currently locked/used/etc.


Kaz, I don't really understand why you are trying to add these
complications to the kernel :(

If the thread exits - it should exit. Yes, we have problems
with the exited main thread, we should fix them.

Yes, gdb refuses to attach to the dead thread (I didn't check
this myself, but I think you are right). But there is nothing
wrong here, because we can't ptrace this thread. But, gdb
_can_ ptrace the process, and it can see it have other threads.

OK, if nothing else. Let's suppose your patch is correct. What
about robust futexes? How can we delay exit_robust_list() ?
I don't think we can.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-03 20:00:14 UTC
Permalink
Post by Oleg Nesterov
Post by Kaz Kylheku
Post by Kaz Kylheku
Post by Ulrich Drepper
The userlevel context of the
thread is not usable anymore. It will have run all kinds of
destructors. The current behavior is AFAIK that the main thread won't
react to any signal anymore. That is absolutely required.
Hey Ulrich,
Thanks for articulating that requirement. I think it can be met by
extending the patch a little bit.
I've now done that.
The exiting thread leader, if there are still other
threads alive, gets its own private signal handler array in which
every action is set to SIG_IGN, using the ignore_signals
function.
I experimented with blocking signals, but that approach
breaks the test case of being able to attach GDB to the
exiting thread.
As part of the patch, I found it convenient to extend the
incomplete sys_unshare functionality w.r.t. signal handlers,
rather than reinvent the wheel.
This is wrong, we can not and must not unshare ->sighand.
You are right; it breaks important invariant conditions which
connect the thread group together, like the one about the
lock, et cetera. The patch goes too far: rather than simply
delaying the finalization (relatively safe), it's messing with
the shared state (risky).

Well, it doesn't bother me that that has to be thrown out.
In fact, I do not agree with the requirement that the thread
which calls pthread_exit must not respond to signals;
the original patch works for me.

I.e. in my embedded GNU/Linux distro, that requirement
doesn't exist. And since I can't find it in the Single
Unix Specification, so much for that!

Nothing in the spec says that once pthread_exit is called,
signals are stopped. This function invokes cleanup handling,
and thread-specific-storage destruction. During any of those
tasks, signals can still be happening. Any of those
tasks can easily enter into an indefinite wait. What if
a cleanup handler performs a blocking RPC to a remote
server? Well, there you are, stuck in pthread_exit,
handling signals, and not cleaning up your robust list, etc.

I also don't require robust locks to be cleaned up
instantly if they are owned by a main thread that has
called pthread_exit.

My organization is a heavy user of robust mutexes;
they protect the integrity of a large, ``real time'' database
stored in shared memory. I don't think that this would
affect us in any way. The principal concern is that
a process dies, for whatever reason, while holding locks.
It's more to recover from catastrophic failures, not from
mutex locking mistakes. If a thread locks a mutex and
doesn't release it due to bad program logic, that is a
problem whether or not that thread dies. It's not
particularly useful that we can resolve that situation
with EOWNERDEAD in the kernel when that thread
happens to die, because that's just one case where
we are lucky, so to speak.
Post by Oleg Nesterov
Yes, gdb refuses to attach to the dead thread (I didn't check
this myself, but I think you are right). But there is nothing
wrong here, because we can't ptrace this thread. But, gdb
_can_ ptrace the process, and it can see it have other threads.
Face it, allowing the thread leader to exit is as wrong as doing
other stupid things to the leader, like unsharing the signal
handler.

Either way, you are breaking some little stick which is holding
up the illusion that there is a process which has threads.
Post by Oleg Nesterov
OK, if nothing else. Let's suppose your patch is correct.
Obviously, the second part isn't. I'm very happy to have
any excuse to throw this out, thanks!

The first part isn't Ulrich-compliant. That is signficant,
and duly noted; but it's not the law where I'm sitting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-03 21:40:08 UTC
Permalink
Post by Kaz Kylheku
Well, it doesn't bother me that that has to be thrown out.
In fact, I do not agree with the requirement that the thread
which calls pthread_exit must not respond to signals;
the original patch works for me.
What about other users? We can't know what how much they
depend on the current behaviour.
Post by Kaz Kylheku
I.e. in my embedded GNU/Linux distro, that requirement
doesn't exist. And since I can't find it in the Single
Unix Specification, so much for that!
Nothing in the spec says that once pthread_exit is called,
signals are stopped. This function invokes cleanup handling,
and thread-specific-storage destruction. During any of those
tasks, signals can still be happening. Any of those
tasks can easily enter into an indefinite wait. What if
a cleanup handler performs a blocking RPC to a remote
server? Well, there you are, stuck in pthread_exit,
handling signals, and not cleaning up your robust list, etc.
I also don't require robust locks to be cleaned up
instantly if they are owned by a main thread that has
called pthread_exit.
OK, OK. Please forget about signals, futexes, etc.
Simple program:

pthread_t main_thread;

void *tfunc(void *a)
{
pthread_joni(main_thread, NULL);
return NULL;
}

int main(void)
{
pthread_t thr;

main_thread = pthread_self();
pthread_create(&thr, NULL, tfunc, NULL);
pthread_exit(NULL);
}

I bet this will hang with your patch applied. Because
we depend on sys_futex(->clear_child_tid, FUTEX_WAKE, ...).

Kaz, you know, it is not easy to say "you patch is wrong
in any case, no matter how much it will be improved" ;)
But even if the current behaviour is not optimal, we must not
change it unless we think it leads to bugs. We can't know
which application can suffer. The current behaviour is old.
Post by Kaz Kylheku
Face it, allowing the thread leader to exit is as wrong as doing
other stupid things to the leader, like unsharing the signal
handler.
Perhaps. That is why I said _something_ like your patch perhaps
makes sense. But this is tricky, and I don't see a simple/clean
way to improve things. And, otoh, I do not see _real_ problems
with the zombie leaders.


As for original problem, it should be fixed anyway. wait_task_stopped()
should take SIGNAL_STOP_STOPPED into account, not task->state.
Unless we are ptracer, of course.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-03 23:10:11 UTC
Permalink
Post by Oleg Nesterov
Post by Kaz Kylheku
Well, it doesn't bother me that that has to be thrown out.
In fact, I do not agree with the requirement that the thread
which calls pthread_exit must not respond to signals;
the original patch works for me.
What about other users? We can't know what how much they
depend on the current behaviour.
If they haven't run into this gaping job control issue,
they haven't done a whole lot of testing, obviously!

Those who have run into it would certainly have to implement
a workaround --- like not calling pthread_exit from the main
thread!

I.e. ``Q: Doctor, it hurts when I do this; A: So don't
do that!''.
Post by Oleg Nesterov
OK, OK. Please forget about signals, futexes, etc.
pthread_t main_thread;
void *tfunc(void *a)
{
pthread_joni(main_thread, NULL);
return NULL;
}
int main(void)
{
pthread_t thr;
main_thread = pthread_self();
pthread_create(&thr, NULL, tfunc, NULL);
pthread_exit(NULL);
}
This test case appears to be conforming, so it
has to work.

The initial thread is considered joinable.

For instance a Rationale note in Issue 6 of the
SUS claims that one reason for the existence of
pthread_detach is so that the initial thread could
be detached, which cannot be done through thread
creation attributes for that thread. So the intent
is clearly that the initial thread is joinable!
Post by Oleg Nesterov
I bet this will hang with your patch applied.
It almost certainly will, and it does have to do with
futexes.

The main thread hasn't gone through the step where
it clears the TID, so the lll_wait_tid
futex wait in pthread_join will block. There is no
short-circuit indication in the library to indicate that
the main thread has died.

This TID trick is analogous to the robust list clean up. It's the same
kind of thing: fixing up a value of some registered
user-space memory location, signaling.
Post by Oleg Nesterov
Kaz, you know, it is not easy to say "you patch is wrong
in any case, no matter how much it will be improved" ;)
Sure it is!

I will save you from that, because I do not believe in piling
hacks on top of hacks to fix something that may be
the wrong approach, even in situations where there is a
good chance that after some finite number of hacks,
it will finally be right. I did that in LinuxThreads once upon
a time (mind you, that was so great, FreeBSD had to have it!)

I do think there is a clean, non-hacky way to reason
about this.

If we think about the process-containing-threads model that
the kernel is trying to emulate, and what should happen
when threads exit, we come to the following reasoning:

When a thread exits, there does have to be certain cleanup
with respect to that thread. But the process-wide cleanup
is not performed until all the threads are gone (thread
count hits zero). This is easy to implement under the
process-contains-threads model.

These actions are not cleanly separated in Linux. There
is a do_exit function which handles both the thread-things
and the process-things in one swoop, so to speak.

The zombie problem occurs because do_exit goes too
far, cleaning up things that it shouldn't; things that
are necessary in holding up the POSIX-conforming
illusion that there is a process that contains threads.

My kneejerk approch was: hey wait, let's hold back
from doing /anything/ in do_exit; in fact let's not
call it at all if we're the initial thread and there are
still others. But obviously, it's not just anything in
do_exit that causes problems. Maybe some in-between
approach can work. The pthread_join test case can
be fixed in a clean way, as can robust cleanup.

The thread can signal pthread_join by resetting its TID
to zero and hitting the futex. It can do the robust
list cleanup, etc.

If you can identify a good separation about what to do
first, and what to do later, maybe you can some decent
compromise among the concerns. Breaking up the
do_exit logic into

do_exit_thread
do_exit_process

would probably not hurt. You then have to pick whether
each action belongs to one or the other and stick
it into the appropriate function, with some clear
guidelines about what goes where.

In sys_exit, the two pieces could be used somehow like this:

do_exit_thread();

if (leader and not_empty(group))
{ special_logic(); }

do_exit_process();

As one rule (for instance), any cleanup that threatens
the integrity of the process/thread model goes into do_exit_process.

So for instance if you ptrace that process, it still has all of its
memory areas intact (you don't have to look for a different
PID in the process list in order to find them).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Roland McGrath
2009-02-05 03:40:09 UTC
Permalink
I haven't seen the clear explanation of what specific actual problems there
are here. But I'm quite sure this is not the right approach to address them.

Kaz has said things that seemed to imply that the behavior is erratic or
the semantics are somehow ill-defined when the group leader has died with
other threads living on. In fact, this case is perfectly well-specified
and there is no mystery about it.

The group leader dies and becomes a zombie. The zombie group leader is
kept from reaping and parent notification by the delayed_group_leader()
logic and related code, until the last thread in the group dies. The tgid
(leader's tid), aka PID in POSIX terms, remains as the PID for the process
as a whole and signals to it work fine, etc.

Quite some time ago, there was some /proc bug wherein /proc/pid/task could
not be listed when the group leader had died. That prevented strace or gdb
from attaching to the process after its initial thread used pthread_exit.
I don't recall when that was fixed, but it's been fine for a good while.
That is the only problem for debuggability of this case that I recall
knowing about.

Certainly long ago there were many problems with job control signals in
multi-thread groups, and there have been many little corner cases fixed in
that over the 2.6.x period. I'm not aware of any such problems remaining.
But if there are some, they need to be fixed in the signals code. It's
certainly clear how it's supposed to work, and that's no different when the
group leader is dead.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kaz Kylheku
2009-02-05 05:00:07 UTC
Permalink
Post by Roland McGrath
I haven't seen the clear explanation of what specific actual problems there
are here. But I'm quite sure this is not the right approach to address them.
Kaz has said things that seemed to imply that the behavior is erratic or
the semantics are somehow ill-defined when the group leader has died with
other threads living on. In fact, this case is perfectly well-specified
and there is no mystery about it.
I haven't observed anything that could be called erratic. The behavior that
occurs, occurs reliably.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-05 16:20:14 UTC
Permalink
Post by Kaz Kylheku
Post by Roland McGrath
I haven't seen the clear explanation of what specific actual problems there
are here. But I'm quite sure this is not the right approach to address them.
Kaz has said things that seemed to imply that the behavior is erratic or
the semantics are somehow ill-defined when the group leader has died with
other threads living on. In fact, this case is perfectly well-specified
and there is no mystery about it.
I haven't observed anything that could be called erratic. The behavior that
occurs, occurs reliably.
Yes we have the bug, and wait_task_stopped() should be fixed. But it is
buggy anyway, even if we delay the death of the main thread. But I also
think we shouldn't.

(and I am sorry, I still can't find the time to redo my old patch, will
try to do this asap).

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Roland McGrath
2009-02-05 21:40:05 UTC
Permalink
Post by Oleg Nesterov
Yes we have the bug, and wait_task_stopped() should be fixed. But it is
buggy anyway, even if we delay the death of the main thread. But I also
think we shouldn't.
Sorry, I'd missed the actual bug report among all the tangential verbiage.
I wrote this test case for it. Is there any problem other than this one?


Thanks,
Roland

==========

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <pthread.h>
#include <assert.h>

static void *
thfunc (void *arg)
{
sleep (2);
puts ("stopping");
raise (SIGSTOP);
puts ("resumed");
exit (0);
}

int
main (void)
{
pthread_t th;
int rc = pthread_create (&th, NULL, &thfunc, NULL);
assert_perror (rc);
pthread_exit (0);
/*NOTREACHED*/
return 1;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-05 23:30:14 UTC
Permalink
Post by Roland McGrath
I wrote this test case for it. Is there any problem other than this one?
I don't know about other problems with the zombie leaders.

Except, I am worried whether the fix I have in mind is correct ;)
It is simple, wait_task_stopped() should do

if we tracer:

check ->state, eat ->exit_code

else:
check SIGNAL_STOP_STOPPED, use ->group_exit_code

This looks logical, and should fix the problem. But this is
the user-visible change. For example,

$ sleep 100
^Z
[1]+ Stopped sleep 100
$ strace -p `pidof sleep`
Process 11442 attached - interrupt to quit

strace hangs in do_wait(). But after the fix strace will happily
proceed. I can't know whether this behaviour change is bad or not.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Roland McGrath
2009-02-09 03:40:07 UTC
Permalink
Post by Oleg Nesterov
I don't know about other problems with the zombie leaders.
Ok, then we can just concentrate on the test case I posted.
Post by Oleg Nesterov
Except, I am worried whether the fix I have in mind is correct ;)
It is simple, wait_task_stopped() should do
I think the first problem is we'll never even get into wait_task_stopped().
We'll be in wait_consider_task() on the group leader, which is EXIT_ZOMBIE.
First we need to adjust this:

if (p->exit_state == EXIT_ZOMBIE && !delay_group_leader(p))
return wait_task_zombie(p, options, infop, stat_addr, ru);

to maybe:

if (p->exit_state == EXIT_ZOMBIE) {
if (delay_group_leader(p))
return wait_task_zombie_leader(p, options,
infop, stat_addr, ru);
return wait_task_zombie(p, options, infop, stat_addr, ru);
}

In wait_task_zombie_leader(), it will have to take the siglock and try to
figure out if there is a completed group stop to report.
Post by Oleg Nesterov
check ->state, eat ->exit_code
Being the ptracer does not affect the delay_group_leader logic.
It just affects individual vs group stop reports. So the existing
code path is right for ptrace.
Post by Oleg Nesterov
check SIGNAL_STOP_STOPPED, use ->group_exit_code
We don't want wait to change group_exit_code. But we need the "reported as
stopped" tracking that wait_task_stopped() gets by clearing exit_code. So
I think what we need is to get the zombie group_leader->exit_code to be set
to ->group_exit_code as it would have been if the leader were alive and had
participated in the group stop.
Post by Oleg Nesterov
This looks logical, and should fix the problem. But this is
the user-visible change. For example,
$ sleep 100
^Z
[1]+ Stopped sleep 100
$ strace -p `pidof sleep`
Process 11442 attached - interrupt to quit
strace hangs in do_wait(). But after the fix strace will happily
proceed. I can't know whether this behaviour change is bad or not.
I think this would only happen if the "reported as stopped" bookkeeping I
mentioned above were broken. The "Stopped" line means that the shell just
did do_wait(WUNTRACED), so wait_task_stopped() cleared ->exit_code when
reporting it as stopped. Now strace does PTRACE_ATTACH and then a wait;
it can't see a fresh wait result here because ->exit_code is still zero.

100% untested concept patch follows.


Thanks,
Roland

==========
diff --git a/kernel/exit.c b/kernel/exit.c
index f80dec3..0000000 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1437,7 +1437,8 @@ static int wait_task_stopped(int ptrace,
exit_code = 0;
spin_lock_irq(&p->sighand->siglock);

- if (unlikely(!task_is_stopped_or_traced(p)))
+ if (unlikely(!task_is_stopped_or_traced(p)) &&
+ (ptrace || p->exit_state != EXIT_ZOMBIE || !delay_group_leader(p)))
goto unlock_sig;

if (!ptrace && p->signal->group_stop_count > 0)
@@ -1598,9 +1599,20 @@ static int wait_consider_task(struct tas

/*
* We don't reap group leaders with subthreads.
+ * When the group leader is dead, it still serves as
+ * a moniker for the whole group for stop and continue.
+ * But for ptrace, stop and continue are reported per-thread.
*/
- if (p->exit_state == EXIT_ZOMBIE && !delay_group_leader(p))
- return wait_task_zombie(p, options, infop, stat_addr, ru);
+ if (p->exit_state == EXIT_ZOMBIE) {
+ if (!delay_group_leader(p))
+ return wait_task_zombie(p, options,
+ infop, stat_addr, ru);
+ *notask_error = 0;
+ if (unlikely(ptrace))
+ return 0;
+ return wait_task_stopped(p, options, infop, stat_addr, ru) ?:
+ wait_task_continued(p, options, infop, stat_addr, ru);
+ }

/*
* It's stopped or running now, so it might
diff --git a/kernel/signal.c b/kernel/signal.c
index b6b3676..0000000 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1653,6 +1653,27 @@ finish_stop(int stop_count)
}

/*
+ * Complete group stop bookkeeping after decrementing sig->group_stop_count,
+ * to new value stop_count. When it reaches zero, mark the process stopped.
+ *
+ * If the group leader is already dead, then it did not participate
+ * normally in the group stop. But its ->exit_code stands for the whole
+ * group in do_wait() bookkeeping, so we need it to reflect the stop.
+ */
+static inline void complete_group_stop(struct task_struct *tsk,
+ struct signal_struct *sig,
+ int stop_count)
+{
+ if (stop_count)
+ return;
+
+ sig->flags = SIGNAL_STOP_STOPPED;
+
+ if (tsk->group_leader->exit_state)
+ tsk->group_leader->exit_code = sig->group_exit_code;
+}
+
+/*
* This performs the stopping for SIGSTOP and other stop signals.
* We have to stop all threads in the thread group.
* Returns nonzero if we've actually stopped and released the siglock.
@@ -1696,8 +1717,7 @@ static int do_signal_stop(int signr)
sig->group_stop_count = stop_count;
}

- if (stop_count == 0)
- sig->flags = SIGNAL_STOP_STOPPED;
+ complete_group_stop(current, sig, stop_count);
current->exit_code = sig->group_exit_code;
__set_current_state(TASK_STOPPED);

@@ -1933,9 +1953,8 @@ void exit_signals(struct task_struct *ts
if (!signal_pending(t) && !(t->flags & PF_EXITING))
recalc_sigpending_and_wake(t);

- if (unlikely(tsk->signal->group_stop_count) &&
- !--tsk->signal->group_stop_count) {
- tsk->signal->flags = SIGNAL_STOP_STOPPED;
+ if (unlikely(tsk->signal->group_stop_count)) {
+ complete_group_stop(tsk, sig, --tsk->signal->group_stop_count);
group_stop = 1;
}
out:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-09 05:00:15 UTC
Permalink
I am already sleep, will return tomorrow. Just a souple of quick notes...
Post by Roland McGrath
I think the first problem is we'll never even get into wait_task_stopped().
We'll be in wait_consider_task() on the group leader, which is EXIT_ZOMBIE.
Yes sure. I meant, instead of just checking task_is_stopped_or_traced() in
wait_consider_task(), we should do somthing like

int wait_is_stopped(p, ptrace)
{
if (ptrace)
// wait_task_stopped() will also check p->exit_code != 0
return task_is_stopped_or_traced(p);
else
// wait_task_stopped() will also check ->group_exit_code != 0
return !!(signal->flags & SIGNAL_STOP_STOPPED);
}
Post by Roland McGrath
In wait_task_zombie_leader(), it will have to take the siglock and try to
figure out if there is a completed group stop to report.
Post by Oleg Nesterov
check ->state, eat ->exit_code
Being the ptracer does not affect the delay_group_leader logic.
It just affects individual vs group stop reports. So the existing
code path is right for ptrace.
Post by Oleg Nesterov
check SIGNAL_STOP_STOPPED, use ->group_exit_code
We don't want wait to change group_exit_code. But we need the "reported as
stopped" tracking that wait_task_stopped() gets by clearing exit_code.
I never understood this.

Why do we mix the normal group stop with the ptrace "per-thread" stops?

Look. We have the main thread M and the sub-thread T. We stop this process,
its parent does do_wait() and clears M->exit_code.

Now, ptracer can attach to T (it still has ->exit_code != 0), but not to M.
This always looked very strange to me.

Or. ptracer attaches to the main thread and (say) does nothing. We send
SIGSTOP to another thread. The whole group stops, but its parent can't
see this. Why? Then ptracer does PTRACE_DETACH, and the parent still can't
(and will never can) see the stop unless the ptracer puts something reasonable
into ->exit_code. But even in this case we lost the notofication.
Post by Roland McGrath
So
I think what we need is to get the zombie group_leader->exit_code to be set
to ->group_exit_code as it would have been if the leader were alive and had
participated in the group stop.
Please see below.
Post by Roland McGrath
Post by Oleg Nesterov
$ sleep 100
^Z
[1]+ Stopped sleep 100
$ strace -p `pidof sleep`
Process 11442 attached - interrupt to quit
strace hangs in do_wait(). But after the fix strace will happily
proceed. I can't know whether this behaviour change is bad or not.
I think this would only happen if the "reported as stopped" bookkeeping I
mentioned above were broken. The "Stopped" line means that the shell just
did do_wait(WUNTRACED), so wait_task_stopped() cleared ->exit_code when
reporting it as stopped. Now strace does PTRACE_ATTACH and then a wait;
it can't see a fresh wait result here because ->exit_code is still zero.
Yes. And why ptracer should wait?
Post by Roland McGrath
100% untested concept patch follows.
it adds more special cases for the delay_group_leader() zombies :(
Post by Roland McGrath
+static inline void complete_group_stop(struct task_struct *tsk,
+ struct signal_struct *sig,
+ int stop_count)
+{
+ if (stop_count)
+ return;
+
+ sig->flags = SIGNAL_STOP_STOPPED;
+
+ if (tsk->group_leader->exit_state)
+ tsk->group_leader->exit_code = sig->group_exit_code;
+}
This doesn't look exactly right. Unless another thread does sys_exit_group()
later (they all can exit via sys_exit), wait_task_zombie() may report this
->exit_code == SIGSTOP.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2009-02-09 05:20:05 UTC
Permalink
Post by Oleg Nesterov
Yes sure. I meant, instead of just checking task_is_stopped_or_traced() in
wait_consider_task(), we should do somthing like
In short, please see the "patch" below. I doubt it can be compiled,
just for the illustration.

Oleg.

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1417,6 +1417,19 @@ static int wait_task_zombie(struct task_
return retval;
}

+static int *wait_xxx(struct task_struct *p, int ptrace)
+{
+ if (ptrace) {
+ if (task_is_stopped_or_traced(p))
+ return &p->exit_code;
+ } else {
+ if (p->signal->flags & SIGNAL_STOPPED_STOPPED)
+ return &p->signal->group_exit_code;
+ }
+
+ return NULL;
+}
+
/*
* Handle sys_wait4 work for one task in state TASK_STOPPED. We hold
* read_lock(&tasklist_lock) on entry. If we return zero, we still hold
@@ -1427,7 +1440,7 @@ static int wait_task_stopped(int ptrace,
int options, struct siginfo __user *infop,
int __user *stat_addr, struct rusage __user *ru)
{
- int retval, exit_code, why;
+ int retval, exit_code, *p_code, why;
uid_t uid = 0; /* unneeded, required by compiler */
pid_t pid;

@@ -1437,22 +1450,16 @@ static int wait_task_stopped(int ptrace,
exit_code = 0;
spin_lock_irq(&p->sighand->siglock);

- if (unlikely(!task_is_stopped_or_traced(p)))
- goto unlock_sig;
-
- if (!ptrace && p->signal->group_stop_count > 0)
- /*
- * A group stop is in progress and this is the group leader.
- * We won't report until all threads have stopped.
- */
+ p_code = wait_xxx(p, ptrace);
+ if (unlikely(!p_code))
goto unlock_sig;

- exit_code = p->exit_code;
+ exit_code = *p_code;
if (!exit_code)
goto unlock_sig;

if (!unlikely(options & WNOWAIT))
- p->exit_code = 0;
+ *p_code = 0;

/* don't need the RCU readlock here as we're holding a spinlock */
uid = __task_cred(p)->uid;
@@ -1608,7 +1615,7 @@ static int wait_consider_task(struct tas
*/
*notask_error = 0;

- if (task_is_stopped_or_traced(p))
+ if (wait_xxx(p, ptrace))
return wait_task_stopped(ptrace, p, options,
infop, stat_addr, ru);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Loading...