[Linuxha-users] Caught in a loop

Simon Edwards simon.edwards at linuxha.net
Tue Feb 1 22:05:42 GMT 2005


Hi Michael,
	On the "--restart" suggestion my intention would probably be to add
some simple high-level shell scripts soon to do that and other similar
tasks.

	On for making alterations to the file systems when something breaks,
what I do currently depends on what breaks. If the start-up script for
the application does not run, then I typically alter the script to
temporarily insert an "exit 0" just after the start to get the file
systems mounted. Then I typically use the "lemsctl" to pause monitoring
so I can fiddle;

# lemsctl --application X --msg "PAUSE"

Later you can start the monitoring again as necessary;

# lemsctl --application X --msg "RESUME"

If everything is working and you simply wish to take down the
application for maintenance/changes then just the "lemsctl" commands can
be used. You can even remove the application IP address(es) if you wish
via;

# ifconfig eth0:1 inet 0.0.0.0

On this approach running the "RESUME" option for the lemsctl command
later will re-assign the missing IP addresses.

Regards,
Simon.

On Tue, 2005-02-01 at 12:18 +1000, Michael Mansour wrote:
> Hi Simon,
> 
> I made the change to the PIDFILE setting below, but I also had to make the 
> modification to httpd.conf since that is the file which produces the pidfile 
> that apachectl reads.
> 
> Once I did the above, I was able to stop/start httpd with both local and 
> clustered setups independent of one another. I tried all combinations of stop/
> start for local and clustered httpd and all worked fine.
> 
> The lems process monitor was also working as expected, and I ran this against 
> the /apache/admin/scripts/shutdown script (which didn't interfere with my 
> local httpd).
> 
> The only thing I could suggest as an improvement to this process would be 
> maybe adding a "--restart" application option to either clstartapp or 
> clhaltapp (or maybe a clrestartapp command?) ie. something that will perform 
> a:
> 
> # clhaltapp -A apache -V
> # clstartapp -A apache -V
> 
> in one go.
> 
> The other thing which concerned me was when I had to make a modification to a 
> file within the /apache clustered filesystem, but I couldn't start the apache 
> app so couldn't get access to the filesystem. What I had to do was use 
> drbd_tool to mount the drbd0 device on both nodes, mount the /apache 
> filesystem, make the change on the primary node, unmount the /apache 
> filesystem, shutdown the drbd0 device and then test the clstartapp again to 
> see if the app would start. Is there a better/easier way to perform this 
> trouble-shooting process?
> 
> Thanks.
> 
> Michael.
> 
> > Hi Michael,
> > 	On the lems front simply change the
> > "<process_string>httpd</process_string>" line in the "httpd.xml" 
> > file to tie it down to matching the correct processes, for example:
> > 
> > <process_string>httpd -f /apache/admin/conf/httpd.conf</process_string>
> > 
> > That should take care of that one.
> > 
> > As for the /apache/admin/scripts/apachectl script possibly worth
> > changing the PIDFILE setting rather than setting RUNNING to 0, ie;
> > 
> > PIDFILE=/var/run/httpd-cluster.pid
> > 
> > Hope this works for you! Let me know if not.
> > 
> > Regards,
> > Simon.
> > 
> > On Mon, 2005-01-31 at 14:50 +1000, Michael Mansour wrote:
> > > Hi Simon,
> > > 
> > > After thinking about this alot and running through some of my own setups 
> for 
> > > it, I couldn't get it to work. What I tried was to have a separate /etc/
> httpd/
> > > conf.d.cluster directory together with the normal /etc/httpd/conf 
> directory.
> > > 
> > > So I decided to test your suggestion below, but found this didn't work 
> either, 
> > > with the problem being that the report shown in /var/log/cluster/apache.
> start.
> > > log is:
> > > 
> > > /apache/admin/scripts/apachectl start: httpd (pid 9605) already running
> > > 
> > > so it seemed no attempt at running another port 80 listener was made.
> > > 
> > > I'll explain this test setup for you. I set this up as:
> > > 
> > > 1. /etc/httpd_local be the ServerRoot for the local httpd (local:80)
> > > 
> > > 2. /etc/httpd be the ServerRoot for the cluster (clusternode1:80)
> > > 
> > > For 1, it starts up fine using the normal OS start scripts "service httpd 
> > > start":
> > > 
> > >  9605 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9608 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9609 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9610 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9611 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9612 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9613 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > >  9614 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 
> > > and:
> > > 
> > > tcp        0      0 local:80              0.0.0.0:*               LISTEN
> > > 
> > > But when I then execute 2, the apache app starts up fine with 
> "clstartapp":
> > > 
> > > [root at node1 conf]# clstartapp -A apache -V
> > > INFO  31/01/2005 04:10:43 Validated checksum for cluster configuration
> > > INFO  31/01/2005 04:10:43 Checked that node names resolve to IP addresses
> > > INFO  31/01/2005 04:10:43 Validated Build run has completed against this 
> > > configuration.
> > > INFO  31/01/2005 04:10:43 drbd kernel module loaded already on xxx.xxx.
> xxx.xxx
> > > INFO  31/01/2005 04:10:43 Checking heartbeats for any sign of life...
> > > WARN  31/01/2005 04:10:43 Attemping ICMP ping of xxx.xxx.xxx.xxx...
> > > INFO  31/01/2005 04:10:44 drbd kernel module loaded already on xxxxxxxxx
> > > INFO  31/01/2005 04:10:44 Local DRBD devices started successfully.
> > > INFO  31/01/2005 04:10:44 DRBD: Skipping ENBD decisioning and relying on 
> meta 
> > > data...
> > > INFO  31/01/2005 04:10:44 Attempting to start DRBD services on xxxxxxxxx
> > > INFO  31/01/2005 04:10:45 DRBD devices started successfully on xxxxxxxxx
> > > INFO  31/01/2005 04:10:45 Validated consistency of available data for 
> DRBD.
> > > INFO  31/01/2005 04:10:45 Both data copies believed good.
> > > WARN  31/01/2005 04:10:45 Locking services not available.
> > > INFO  31/01/2005 04:10:45 Attempting to register application apache as 
> > > starting...
> > > INFO  31/01/2005 04:10:46 Application registered successfully as starting.
> > > INFO  31/01/2005 04:10:46 Checking IP address for application is not in 
> use...
> > > INFO  31/01/2005 04:10:48 Required application IP address is not pingable 
> - 
> > > continuing.
> > > INFO  31/01/2005 04:10:48 Attempting to make local DRBD devices primary...
> > > INFO  31/01/2005 04:10:48 All local DRBD now primary.
> > > INFO  31/01/2005 04:10:48 Running "/sbin/fsck -t ext3 -a /dev/drbd0"...
> > > INFO  31/01/2005 04:10:48 Running "PATH=$PATH:/sbin:/bin:/usr/sbin; mount 
> -t 
> > > ext3 -o rw /dev/drbd0 /apache"...
> > > INFO  31/01/2005 04:10:48 File systems mounted on DRBD devices.
> > > INFO  31/01/2005 04:10:48 choose_interface: Link beat ok on interface eth0
> > > INFO  31/01/2005 04:10:48 choose_interface: Assigning IP address xxx.xxx.
> xxx.
> > > xxx to interface eth0...
> > > INFO  31/01/2005 04:10:48 cmd=/sbin/ifconfig eth0:1 inet xxx.xxx.xxx.xxx 
> > > netmask 255.255.255.0 2>&1
> > > INFO  31/01/2005 04:10:48 Running /sbin/cluster/tools/send_arp xxx.xxx.
> xxx.xxx 
> > > xx:xx:xx:xx:xx:xx xxx.xxx.xxx.xxx FF:FF:FF:FF:FF:FF eth0 to send 
> gratuitous 
> > > arp request
> > > INFO  31/01/2005 04:10:48 choose_interface: Successfully assigned IP 
> address 
> > > xxx.xxx.xxx.xxx to eth0:1
> > > INFO  31/01/2005 04:10:48 choose_interface: Running IP level testing for 
> > > interface eth0:1
> > > INFO  31/01/2005 04:10:48 choose_interface: Test for IP xxx.xxx.xxx.
> xxx(tcp) 
> > > was OK.
> > > INFO  31/01/2005 04:10:48 choose_interface: IP level testing for interface 
> > > eth0:1 succeeded
> > > INFO  31/01/2005 04:10:48 Applications start completed successfully
> > > 
> > > but the output of the apache.start.log being:
> > > 
> > > /apache/admin/scripts/apachectl start: httpd (pid 12088) already running
> > > 
> > > so there's no listen on port 80 of the cluster IP (clusternode1:80). 
> > > 
> > > So you know also, if I start 2 by itself (without first starting 1), the 
> > > listen on clusternode1:80 works.
> > > 
> > > So what I decided to do was a test, for the following bit:
> > > 
> > >     # check for pidfile
> > >     if [ -f $PIDFILE ] ; then
> > >         PID=`cat $PIDFILE`
> > >         if [ "x$PID" != "x" ] && kill -0 $PID 2>/dev/null ; then
> > >             STATUS="httpd (pid $PID) running"
> > >             RUNNING=1
> > > 
> > > in the /apache/admin/scripts/apachectl file, I modified it to RUNNING=0 to 
> see 
> > > what would happen, basically forcing the startup of apache on 
> clusternode1. I 
> > > then halted the app and stopped the local httpd.
> > > 
> > > I then restarted the local httpd (which establishes a local:80) and then 
> > > clstartapp of the apache app, which successfully loaded the clusternode1:
> 80 as 
> > > we can see:
> > > 
> > > tcp        0      0 clusternode1:80        0.0.0.0:*               LISTEN
> > > tcp        0      0 local:80        0.0.0.0:*               LISTEN
> > > 
> > > and:
> > > 
> > > 16683 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16686 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16687 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16688 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16689 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16690 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16691 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16692 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16693 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16839 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16840 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16841 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16842 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16843 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16844 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16845 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16846 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 16848 ?        S      0:00 /usr/sbin/httpd -f /apache/admin/conf/httpd.
> conf
> > > 
> > > as we can see modification produced the result I wanted.
> > > 
> > > I tested the websites and they all worked fine.
> > > 
> > > Halting the apache app also stopped just clusternode1:80 listen, again 
> what I 
> > > wanted.
> > > 
> > > I'm not sure what impact the modification above will make to the overall 
> usage 
> > > of linuxha.net, but it's the only way I was able to get this to work.
> > > 
> > > One other thing of note, when I ran:
> > > 
> > > # service httpd stop
> > > 
> > > while both local:80 and clusternode1:80 were up, the clusternode1:80 httpd 
> > > processes were killed instead of local:80, not something I wanted. Also, 
> while 
> > > this was the case clstat showed the apache application was still running 
> on 
> > > clusternode1:
> > > 
> > >  Application       Node      State  Started  Monitor  Stale  Fail-over?
> > >       apache      local    STARTED  0:00:00  Running      0         Yes
> > > 
> > > even though this isn't the case since we see:
> > > 
> > > 16683 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16686 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16687 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16688 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16689 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16690 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16691 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16692 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 16693 ?        S      0:00 /usr/sbin/httpd -d /etc/httpd_local
> > > 
> > > which is the local:80 instance of httpd and not clusternode1:80 - I 
> suspect 
> > > the lems monitor is only assessing whether httpd daemons are running, and 
> not 
> > > specifically whether the httpd running is local:80 or clusternode1:80.
> > > 
> > > I'd really like to get this working properly, so if you have any 
> suggestions 
> > > as to how I can achieve this I'd really appreciate it.
> > > 
> > > Michael.
> > > 
> > > > Hello Michael,
> > > > 	Having two Server roots (i.e. /etc/httpd and /etc/httpd_local) is 
> > > > the way to go I would guess. Initially copy all the files form one 
> > > > to the other and then customise each by changing the Listen entry in 
> > > > each to the local and clustered IP addresses as appropriate.
> > > > 
> > > > 	The -d option on httpd startup can then be using to specify the
> > > > required server root (/etc/httpd or /etc/httpd_local) as required.
> > > > 
> > > > Regards,
> > > > Simon.
> > > > 
> > > > On Fri, 2005-01-28 at 14:10 +1000, Michael Mansour wrote:
> > > > > Hi Simon,
> > > > > 
> > > > > Just so you know, I have only one apache instance which runs as the 
> > > "apache" 
> > > > > application starts up. I didn't realise I could have had two.
> > > > > 
> > > > > The way I have it setup is to not have any apache server started on 
> system 
> > > > > boot. I then form the cluster and start the apache app, which reads 
> the 
> > > other 
> > > > > conf files from /etc/httpd/conf.d/*conf and starts up the virtual 
> servers.
> > > > > 
> > > > > I'm going to think about how I can now have two apache instances 
> running, 
> > > the 
> > > > > local one and the clustered one. Do I need to have two sets of /etc/
> httpd/
> > > > > conf.d directories in this case? one for the local and one for the 
> > > clustered?
> > > > > 
> > > > > Thanks.
> > > > > 
> > > > > Michael.
> > > > > 
> > > > > > Hello Michael,
> > > > > > 	From what you're saying when the use clhaltapp on node1 both Apache
> > > > > > daemons stop? If that is the case you will need to modify the script
> > > > > > that is in place for the" stopscript" command for the application in
> > > > > > the /etc/cluster/apache/appconf.xml. First thing to do is to test 
> it-
> > > > > > with the apache application running on node1 as well as your local
> > > > > > Apache, run whatever the "stopscript" is. It should only stop the
> > > > > > clustered Apache instance. If no httpd processes are running you now
> > > > > > know that this script needs to be modified in some way.
> > > > > > 
> > > > > > 	Possibly both use the same pid file (/var/run/httpd.pid)?
> > > > > > 
> > > > > > Regards,
> > > > > > Simon
> 
> 
> 




More information about the Linuxha-users mailing list