11

I originally asked this question on StackOverflow. Then realised that this is probably a better place.

I have bluepill setup to monitor my delayed_job processes. (Ruby On Rails application)

Using Ubuntu 12.10.

I am starting and monitoring the bluepill service itself using Ubuntu's upstart. My upstart config is below (/etc/init/bluepill.conf).

description "Start up the bluepill service"

start on runlevel [2]
stop on runlevel [016]

expect daemon
exec sudo /home/deploy/.rvm/wrappers/<app_name>/bluepill load /home/deploy/websites/<app_name>/current/config/server/staging/delayed_job.bluepill

# Restart the process if it dies with a signal
# or exit code not given by the 'normal exit' stanza.
respawn

I have also tried with expect fork instead of expect daemon. I have also tried removing the expect... line completely.

When the machine boots, bluepill starts up fine.

$ ps aux | grep blue
root      1154  0.6  0.8 206416 17372 ?        Sl   21:19   0:00 bluepilld: <app_name>

The PID of the bluepill process is 1154 here. But upstart seems to be tracking the wrong PID. It is tracking a PID which does not exist.

$ initctl status bluepill
bluepill start/running, process 990

I think it is tracking the PID of the sudo process which started the bluepill process.

This is preventing the bluepill process from getting respawned if I forcefully kill bluepill using kill -9.

Moreover, I think because of the wrong PID being tracked, reboot / shutdown just hangs and I have to hard reset the machine every time.

What could be the issue here?

UPDATE:

The problem remains as of today (3 May 2015) on Ubuntu 14.04.2 .

The problem is not because of using sudo. I am not using sudo anymore. My updated upstart config is this:

description "Start up the bluepill service"

start on runlevel [2]
stop on runlevel [016]

# Restart the process if it dies with a signal
# or exit code not given by the 'normal exit' stanza.
respawn

# Give up if restart occurs 10 times in 90 seconds.
respawn limit 10 90

expect daemon

script
    shared_path=/home/deploy/websites/some_app/shared

    bluepill load $shared_path/config/delayed_job.bluepill
end script

When the machine boots, the program loads up fine. But upstart still tracks the wrong PID, as described above.

The workaround mentioned in the comments may fix the hanging issue. I haven't tried it, though.

4
  • have you tried looking at what process 990 is? ps aux | grep 990 should do it but pstree 990 might be more informative.
    – Oli
    Jul 12, 2013 at 8:33
  • No process with the PID of 990 exists.
    – Anjan
    Jul 12, 2013 at 8:48
  • 2
    as far as the need to reboot to get upstart back into a good state - see this nice tool: github.com/ion1/workaround-upstart-snafu Jan 7, 2015 at 21:27
  • and you can speed up that tool with this command: $ echo 3000 | sudo tee /proc/sys/kernel/pid_max Jan 7, 2015 at 21:28

3 Answers 3

8

Quite late, but hopefully this can be of help to other users.

There is a documented bug in upstart which can cause initctl to track the wrong PID if you specify the incorrect fork stanza in an upstart config: https://bugs.launchpad.net/upstart/+bug/406397

What happens is that upstart checks the fork stanza and determines how many forked processes it should check before choosing the "true" PID of the program being controlled. If you specify expect fork or expect daemon but your program does not fork a sufficient number of times, start will hang. If, on the other hand, your process forks too many times, initctl will track the wrong PID. Theoretically, it should be documented in this section of the upstart cookbook, but as you can see in this situation there is a PID associated with the killed process when there shouldn't be.

The implications of this are explained in the bugtracker comments, but I'll summarize here: besides initctl not being able to stop the daemon process and being stuck in an undocumented/illegal state <service> start/killed, process <pid>, if the process belonging to that PID stops (and it usually will) then the PID is freed up for re-use by the system.

If you issue initctl stop <service> or service <service> stop, initctl will kill that PID the next time it appears. This means that, somewhere down the road if you don't reboot after making this mistake, the next process to use that PID will be immediately killed by initctl even though it won't be the daemon. It could be something as simple as cat or as complex as ffmpeg, and you'd have a hard time figuring out why your software package crashed in the middle of some routine operation.

So, the issue is that you specified the wrong expect option for the number of forks your daemon process actually makes. They say there is an upstart rewrite that addresses this issue, but as of upstart 1.8 (latest Ubuntu 13.04/January 2014) the issue is still present.

Since you used expect daemon and ended up with this issue, I recommend trying expect fork.

Edit: Here's a Ubuntu BASH-compatible script (original by Wade Fitzpatrick modified to use Ubuntu sleep) that spawns processes until the available process ID address space is exhausted, at which point it starts back at 0 and works its way up to the "stuck" PID. A process is then spawned at the PID initctl is hung up on, and initctl kills it and resets.

#!/bin/bash

# usage: sh /tmp/upstart_fix.sh <pid>

sleep 0.001 &
firstPID=$!
#first lets exhaust the space
while (( $! >= $firstPID ))
do
    sleep 0.001 &
done

# [ will use testPID itself, we want to use the next pid
declare -i testPID
testPID=$(($1 - 1))
while (( $! < $testPID ))
do
    sleep 0.001 &
done

# fork a background process then die so init reaps its pid
sleep 3 &
echo "Init will reap PID=$!"
kill -9 $$
# EOF
1
  • This answer has some useful and interesting information however it's unclear to me how this answer answers the initial question as @Anjan mentioned "I have also tried with expect fork instead of expect daemon. I have also tried removing the expect... line completely."
    – user12345
    Jul 27, 2014 at 20:22
5

For the provided example:

$ initctl status bluepill
bluepill start/running, process 990

a quick solution for me is:

# If upstart gets stuck for some job in stop/killed state
export PID=990
cd /usr/local/bin
wget https://raw.github.com/ion1/workaround-upstart-snafu/master/workaround-upstart-snafu
chmod +x workaround-upstart-snafu
./workaround-upstart-snafu $PID

source: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=582745#37

I hope this will be helpful. What is going on is explained in the other answers.

1
  • Nice script. This can take a minute or two. A reboot might sometimes be preferable and also fixes this. Jul 22, 2016 at 13:50
0

Unless you are running an Upstart user level job or using the setuid stanza - then your job is running as root.

Since Upstart is already running as root, why do you need to use sudo at all in your exec stanza?

Using sudo or su in the exec stanza have caused the same problems for me as you describe here.

Typically I will experience item 1 OR both 1 AND 2:

  1. upstart follows the incorrect PID
  2. upstart hangs when I try to stop the process

Of course, additionally you must have the expect stanza reflect the correct number of forks.

YMMV, but for me:

  • using sudo or su in the exec stanza with the correct number of forks specified generally results in situation 1 above.
  • incorrect number of forks specified (with our without sudo/su in exec) results in situation 1 AND 2 above.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .