[Linuxha-users] Re: LinuxHA failover problems
Simon Edwards
simon.edwards at linuxha.net
Mon Jan 24 22:38:51 GMT 2005
Hello James,
Hopefully you should not need to load the drbd module yourself - this
should be done shortly after the failed node rejoins the cluster. If you
look in the /var/log/cluster/lems/lems-mysql.log on the current primary
you will see the "fsmonitor" attempt to load the DRBD module. It always
attempts to just switch the rejoined node to be a normal secondary to
attempt a partial resync - but if unsuccessful after a few attempts in
then does a brute force sync everything... I'm currently trying
different failure scenarios to find out why quite often that might be
the case.
If you definitely don't get the resync started after a couple of
minutes after the failed node rejoins the cluster please let me know.
I've also logged the "clstat" problem as something to look at. BTW -
could you mail me the /proc/drbd output whilst the synchronization is
taking place - your devices are probably larger than those I use and
that might be part of the reason for the problem.
Regards,
Simon.
On Mon, 2005-01-24 at 15:49 -0400, James MacLean wrote:
> Simon Edwards wrote:
>
> >James,
> > I made one or two small changes to 0.7.8 so let me run through the same
> >scenario as you...
> >
> >
> Hi Simon,
>
> Latest is much better on this end :). After the reboot, I get :
>
> # clform -join
> INFO 24/01/2005 19:36:35 Validated checksum for cluster configuration
> INFO 24/01/2005 19:36:35 Checking cluster status...
> INFO 24/01/2005 19:36:35 p-6.ednet.ns.ca is running - p-5.ednet.ns.ca
> will attempt tojoin cluster.
> INFO 24/01/2005 19:36:35 Starting cldaemon on p-5.ednet.ns.ca...
> INFO 24/01/2005 19:36:35 Waiting for p-5.ednet.ns.ca to join the cluster...
> INFO 24/01/2005 19:36:40 No response returned!
> INFO 24/01/2005 19:36:45 Connection made to p-5.ednet.ns.ca.
> INFO 24/01/2005 19:36:46 Node p-5.ednet.ns.ca successfully joined cluster.
>
> After which the cluser shows as active.
>
> But... then it appears that I must load the drbd module myself or the
> syncing doesn't startup. Maybe I do not wait long enough?
>
> Then the sync begins, but during this time I get a small perl error when
> I issue clstat :
>
> # clstat -application mysql
> Cluster: cluster1 - UP
>
>
> Application Node State Runnnig Monitor Stale Fail-over?
> mysql p-6 STARTED 0:00:29 Running 1 Yes
>
> File Systems
>
> Mount Point Valid Type State % Complete Completion
> Argument "8192)M" isn't numeric in division (/) at /sbin/cluster/clstat
> line 499.
> /var/lib/mysql local drbd Syncing 0 % 32:00
>
> Network Configuration for mysql on p-5.ednet.ns.ca
>
> Intfce Status Times used Time since use
> eth0 Active 1 0:00:29:22
>
> General Monitors
>
> Type Name Status
> Flag Check flag_check Running
> FS Monitor fsmonitor Running
> IP Monitor ip Running
> Link Monitor link Running
> IP Assignment move_ip Stopped
>
> Which goes away once the sync is complete.
>
> Also... :(, the sync takes as long as if I was syncing from nothing. I
> thought during my tests of drbd alone that if on in the drbd pair was
> off line for just a while then the sync happened quite fast? It appears
> to be syncing as if a new Secondary is being brought into the cluster?
> Of course maybe I'm just expecting it to be too fast ;).
>
> Thanks for your quick responses. Now to do more tests :).
>
> take care,
> JES
More information about the Linuxha-users
mailing list