CLEANACCESS Archives

November 2006

CLEANACCESS@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Kelley, Tim" <[log in to unmask]>
Reply To:
Cisco Clean Access Users and Administrators <[log in to unmask]>
Date:
Mon, 20 Nov 2006 10:57:07 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (107 lines)
Hi All,

I hope that someone on this listserv has a solution to or an idea why
our failover CAM (3.6.4) pair did not failover when it was needed.  We
noticed (from a user) that our users could not authenticate; an attempt
to login to the manager via the web pages could not connect, same with
an attempt to connect via ssh.  We next attempted to ssh directly to our
primary CAM (no luck) but we could ssh to our secondary CAM.  We
restarted the service on the secondary and received the following:
	"Starting High-Availability services  OK  ]Please wait while
bringing up service IP.
cl_status: 2006/11/20_08:55:43 ERROR: initiate_connection: connect
failure: Connection refused 
cl_status: 2006/11/20_08:55:43 ERROR: Cannot signon with heartbeat
..."

On connecting the console to our primary manager we noticed it was
spamming the console with the following:
	"EXT3-fs error (device sda2) ext3_get_inode unable to read inode
ATA abnormal status 0xD0 on port )x177"

A brief search on the web turns-up that we might have a hard drive or
controller failure, but our larger concern right now, since the service
came back up after reboot, is why the failover didn't work as planned.  

The following is from our ha-log from the primary manager that spans the
downtime interval and reboot:

heartbeat: 2006/11/19_12:59:00 info: Daily informational memory
statistics
heartbeat: 2006/11/19_12:59:00 info: MSG stats: 996/12954186 ms age 810
[pid2868/MST_CONTROL]
heartbeat: 2006/11/19_12:59:00 info: ha_malloc stats: 24000/345445184
1648924/774000 [pid2868/MST_CONTROL]
heartbeat: 2006/11/19_12:59:00 info: RealMalloc stats: 1652348 total
malloc bytes. pid [2868/MST_CONTROL]
heartbeat: 2006/11/19_12:59:00 info: Current arena value: 2035712
heartbeat: 2006/11/19_12:59:00 info: MSG stats: 0/3 ms age 50049378
[pid2880/HBFIFO]
heartbeat: 2006/11/19_12:59:00 info: ha_malloc stats: 0/50  252/0
[pid2880/HBFIFO]
heartbeat: 2006/11/19_12:59:00 info: RealMalloc stats: 1468 total malloc
bytes. pid [2880/HBFIFO]
heartbeat: 2006/11/19_12:59:00 info: Current arena value: 135168
heartbeat: 2006/11/19_12:59:00 info: MSG stats: 0/0 ms age 49155652
[pid2881/HBWRITE]
heartbeat: 2006/11/19_12:59:00 info: ha_malloc stats: 0/0  0/0
[pid2881/HBWRITE]
heartbeat: 2006/11/19_12:59:00 info: RealMalloc stats: 0 total malloc
bytes. pid [2881/HBWRITE]
heartbeat: 2006/11/19_12:59:00 info: Current arena value: 0
heartbeat: 2006/11/19_12:59:00 info: MSG stats: 0/0 ms age 49155652
[pid2882/HBREAD]
heartbeat: 2006/11/19_12:59:00 info: ha_malloc stats: 0/17272104  14/0
[pid2882/HBREAD]
heartbeat: 2006/11/19_12:59:00 info: RealMalloc stats: 270 total malloc
bytes. pid [2882/HBREAD]
heartbeat: 2006/11/19_12:59:00 info: Current arena value: 135168
heartbeat: 2006/11/19_12:59:00 info: These are nothing to worry about.
heartbeat: 2006/11/20_09:06:43 info: **************************
heartbeat: 2006/11/20_09:06:43 info: Configuration validated. Starting
heartbeat 1.2.4
heartbeat: 2006/11/20_09:06:43 info: heartbeat: version 1.2.4
heartbeat: 2006/11/20_09:06:45 info: Heartbeat generation: 12
heartbeat: 2006/11/20_09:06:45 info: UDP Broadcast heartbeat started on
port 694 (694) interface eth1
heartbeat: 2006/11/20_09:06:45 notice: Using watchdog device:
/dev/watchdog
heartbeat: 2006/11/20_09:06:45 info: pid 3036 locked in memory.
heartbeat: 2006/11/20_09:06:45 info: Local status now set to: 'up'
heartbeat: 2006/11/20_09:06:46 info: pid 3049 locked in memory.
heartbeat: 2006/11/20_09:06:46 info: pid 3050 locked in memory.
heartbeat: 2006/11/20_09:06:46 info: pid 3051 locked in memory.
heartbeat: 2006/11/20_09:06:46 info: Link chi-nacman2:eth1 up.
heartbeat: 2006/11/20_09:06:46 info: Status update for node chi-nacman2:
status active
heartbeat: 2006/11/20_09:06:46 info: Local status now set to: 'active'
heartbeat: 2006/11/20_09:06:46 info: Link chi-nacman1:eth1 up.
heartbeat: 2006/11/20_09:06:46 info: remote resource transition
completed.
heartbeat: 2006/11/20_09:06:46 info: remote resource transition
completed.
heartbeat: 2006/11/20_09:06:46 info: Local Resource acquisition
completed. (none)
heartbeat: 2006/11/20_09:06:46 info: Initial resource acquisition
complete (T_RESOURCES(them))
heartbeat: 2006/11/20_09:06:46 info: Running /etc/ha.d/rc.d/status
status

This seems normal.

The hosts files include the service IP:

Primari (chi-nacman1): 132.241.200.10 chi-nacman1 chi-nacman1
Secondary (chi-nacman2): 132.241.200.10 chi-nacman2 chi-nacman2

Should this be reversed? Should chi-nacman1 hosts' file be
"132.241.200.10 chi-nacman2 chi-nacman2"?

Thanks for any help,

Tim

Tim Kelley
ResNet Coordinator
California State University, Chico

ATOM RSS1 RSS2