RESCOMP Archives

February 2006

RESCOMP@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Xudong Yu <[log in to unmask]>
Reply To:
Research Computing Support <[log in to unmask]>, Xudong Yu <[log in to unmask]>
Date:
Fri, 17 Feb 2006 17:31:43 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (119 lines)
Hi Robin:
	If you mean:

Port State of Port #1 on HCA #0 ........ DOWN

	This is normal, port 1 for HCA is usually unused.
	However, this message:

Port State of Port #0 on HCA #0 ........ UP 4X

	is different from what I have, I have 

Port State of Port #0 on HCA #0 ........ UP

	Don't have 4X there. I am not sure if it's normal.

	Also, I want you know that I have a host list have create at /home/xudong/hpl/bin/Linux_ATHLON_VSIPL,
you run linpack like:

	/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 200 -paramfile paramfile -hostfile list1 ./xhpl

	of course, before you run it, you need to modify the HPL.dat file to makes Ps * Qs = 200. 

	Also, for the quick testing, I have put Ns as 500, this number is too small for real performance measurement.
we need increase it to something like 50000 for real testing.

	I have run the 36 cpu job for linpack, it come out as 2.1 Gflops with 500 Ns, it will go up if we put Ns as
50000.

thanks

xudong



-----Original Message-----
From: Robin [mailto:[log in to unmask]]
Sent: Friday, February 17, 2006 5:21 PM
To: Research Computing Support; Xudong Yu
Cc: [log in to unmask]; Madhusudan; [log in to unmask]; Patrick
Sutton
Subject: Re: update 1-41830023 Infiniband linpack testing on Miami
University


All,

It might be the module that Steve Cruz was talking about.
Steve Cruz mentioned that one of the ports in the switch is bad. It  
does seem to be entire module that is bad.

How can we get this done as soon as possible ?

I'll re-flash those nodes' HCA and see what happened.
Up 4% (what ever that means).

---- Performing InfiniBand HCA Self Test ----
Number of HCAs Detected ................ 1
PCI Device Check ....................... PASS
Kernel Arch ............................ ia32e
Host Driver Version .................... rhel3-2.4.21-32.EL-3.2.0-67
Host Driver RPM Check .................. PASS
HCA Type of HCA #0 ..................... CougarCub
HCA Firmware on HCA #0 ................. v3.3.5 build 3.2.0.67  
HCA.CougarCub.A1
HCA Firmware Check on HCA #0 ........... PASS
Host Driver Initialization ............. PASS
Number of HCA Ports Active ............. 1
Port State of Port #0 on HCA #0 ........ UP 4X
Port State of Port #1 on HCA #0 ........ DOWN
Error Counter Check .................... PASS
Kernel Syslog Check .................... PASS
Node GUID .............................. 00:05:ad:00:00:04:8c:b0
------------------ DONE ---------------------


Thanks,
Robin


On Feb 17, 2006, at 4:59 PM, Xudong Yu wrote:

> Hi Robin:
> 	From the host list, I found that both 4-2 and 4-19 is down.
> 	And also, when I try to launch the simple cpi job at /home/xudong/ 
> hpl/bin/Linux_ATHLON_VSIPL like:
>
> /usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 81 -paramfile  
> paramfile -hostfile list1 ./cpi
>
> 	I have seen that error message like:
>
> 6] Abort: Got an asynchronous event: VAPI_PORT_ERROR  
> (VAPI_EV_SYNDROME_NONE) at line 362 in file mpid/vapi/viainit.c
>
> 	come out, usually this indicate a bad infiniband card or something  
> wrong inside infiniband switcher.
>
> 	I have a list for the bad node for my 80 nodes testing, and the  
> list is growing after that, this is a incomplete list:
>
> 3-1
> 3-2
> 3-7
> 3-8
> 3-9
> 3-15
> 3-19
> 3-21
> 3-22
> 	The bad host list is really too big, so I thought maybe something  
> wrong in the switcher, we need to make sure
> all the infiniband card operate well before we goto any linpack  
> testing.
>
> thanks
>
> xudong

ATOM RSS1 RSS2