Killing Trees (the Windows way)
I’ve just been finalizing – in more ways than one - the 0.6 beta of Steel. It’s been “nearly done” for about a week or so now, but there have been a couple of problems, now hopefully fixed.
The main one was the interesting habit of Steel zapping your desktop when a Ruby debugging session ended. It didn’t happen very often, but occasionally - ZAP!! - when you once had Word, Excel, Outlook, Visual Studio, etc., all up and running, all that remained was a virgin desktop. It took me some time to track this one down, and it’s all down to one of the most undesirable ‘features’ of Windows I’ve ever come across.
It’s to do with terminating a process ‘tree’. To be tidy, I thought it was a good idea to clean up any potential sub-processes that a Ruby program had created while I was debugging it. No problem – just find the Windows KillTree API. Except that there isn’t one. It seems that you have to do it the hard way and figure out which process is a child of the main Ruby process. OK – a little strange that Microsoft hadn’t provided an API for the job when there are over 80,000 of the things to do everything from formatting your hard disk to cleaning the fluff out of the keyboard. And stranger, there didn’t seem to be any documentation on how to do it; normally, there’s an MSDN article on stuff like that. At this point, alarm bells should have been ringing. But a quick search of Google came up with a technique which seemed to work fine. Most of the time.
It turns out that a process does indeed have a reference to its parent process (the process that created it).This is the parent Process ID (PID). However, the parent PID can have exited without killing off its child. Worse, far worse, Windows reuses PIDs! So not only can the parent of a process be a non-existent process (not too bad) – it can also point to a perfectly good process that isn’t its parent!
In my case, occasionally, just occasionally, the ‘parent’ of the desktop Explorer was my Ruby process. So killing the sub-tree of processes supposedly created by Ruby zapped the desktop Explorer and all its child processes – Outlook, Word, Visual Studio, etc. Baaaah...!!!
On reflection, I can’t see an absolutely safe way of killing a process tree in Windows because of this PID re-use. There just isn’t a cast-iron guarantee that the ‘parent’ of a given process really is the parent. I haven’t found any reference to this ’feature’ on the web anywhere. It seems that the issue of PID re-use is reasonably widely known, but the basic fact that you can’t build a good KillTree isn’t.
Note that you can tell whether the Parent Process Id for a process is valid or not. The key is the time the process was created. Assume your child started at 12:00 and its parent ID is 513. You get the process creation time for process 513. If it is before 12:00 AND this process is still alive, then it must hav been alive for the whole time (otherwise it would have a different creation time).
Thus you can tell whether the parent ID is valid or not.
You might have a look at getpids. It is a small tool (with easily parseable output) that does what the preceding post described.
It is curious. I’ve never had a similar issue when using TaskManager’s kill process tree functionality, but I can’t say for sure that it could never happen.
I can believe that you will almost never see this problem in the normal course of events. The way I managed to find it was to continously create and kill a Ruby process via the IDE - it took about 10-15 minutes before this occurred. I would guess about 1 time in a 100 - hard work.
The reason I was so concerned about catching this was that I thought it was something that I had done in my Visual Studio Debug Engine - and I really wanted to get it and fix it before the code got out into the wide world. I was also trying to track down any memory leaks (always troublesome when dealing with COM). I thought the two might be related and I was totally surprised when I found the real reason - nothing whatsoever to do with COM or Visual Studio.
I’ve been through two different APIs for finding process information, so I don’t think there is an absolutely bulletproof way of killing a process tree in Windows.
If you have studied OSes, this is not so strange. It is similar to the *nix process model.
Imagine that you have a Windows box with a very, very long uptime. In fact, imagine a Windows that is so stable that you never reboot it. If you have a fixed-size structure for your process ID (e.g. an integer), then eventually as processes are started, do work and end, you will run out of process IDs. That is, you will need to reuse process IDs.
Imagine that you want to start a process. Imagine that process, in turn, wishes to start a new process and have this new process continue to run long after the original process exits. In windows, this long running process might be something like a Service. In the *nix world it is called a daemon. Since the original parent process is now dead, we would like to assign *some* process ID as the ’parent’ of this orphaned child process. In *nix, typically the ID of the first starting process (usually called ’init’) is used.
I would recommend reading up on fork(), exec(), and clone() for an understanding of the process creation thinking that informed the Windows development teams, and
for a discussion of CreateProcess, etc.
Actually, I do find it strange.
A 32-bit PID allows for 4 billion values, so allowing for (excessive to my mind) 10 processes being created per second CONTINUOUSLY, that will give a PID re-use interval of 13 plus years. So it seems to me that a PID should almost never be re-used, even on a Unix system which I think uses processes far more freely than a Windows system.
I don’t know much about Unix internals, but I do know quite a bit about VAX VMS internals - even though its over 15 years since I last programmed a VMS system (!). Under VMS, a process would typically create sub-processes (though not always, so you could do Unix daemon stuff), but the key thing was that killing a process automatically zapped its sub-processes. The sub-processes were absolutely linked to the parent. This allowed you to create a ’job’ which could always be cleaned up correctly.
Further, a VMS PID was unique and never re-used (from what I remember), though I’m not too sure about how this worked across clusters of VAXes. Sadly, I threw out all my VMS internals documentation years ago, so I can’t check this.
I’m just surprised that PIDs are re-used in Windows when there is no need for them to be so. And I’m even more surprised to find that I could kill off the wrong process accidentally.
I’m afraid it strikes me as bad design - probably a legacy from the days when NT had to be squeezed into 16MB of RAM.
Thanks for the link (interesting) anyway - I’ll have a good look around there.
I wonder if there might be some information on the Sysinternals site since the Process Explorer app has a Kill Tree function.
I used Process Explorer to help me track this one down. I remember clearly looking at the parent process id of the desktop explorer in Process Explorer and seeing that it was the same as my Ruby process - and that Process Explorer reported the Ruby Process as the parent! It all clicked as to what was going on then. In fact, you can see in the Process Explorer that the desktop Explorer normally has a ’non-existent process’ as it’s parent PID. Presumably, this is the PID of the logon process that created it.
There really doesn’t seem to be a bullet proof way of doing a ’KillTree’. I think you can be more intelligent about doing it - checking to see if the process that is about to be zapped is the desktop, for example, but it all looks a bit messy.
I suspect that the trouble comes from the internal design of Windows: there doesn’t seem to be the concept of a process tree with child sub-processes. That, coupled with the PID re-use, is the core of the problem.
I’d certainly be interested in a really good KillTree mechanism, though.
Have you looked into WIN32 Job Objects or Console Process Groups?
Create your own windows service to listen to __InstanceCreationEvent and __InstanceDeletionEvent on ManagementEventWatcher using WMI.
Maintain the list of processes in a dictionary and on recieving the above events, update the list to remove terminated processes from the list and to update the child processes parent id to be 0 or something.
Information about processes can be fetched using the Win32_Process class using WMI including the ParentID:
Use the InvokeMethod on ManagementObject to terminate a process using WMI. Repeat for all child processes in the dictionary.