Subject: | |
From: | |
Reply To: | |
Date: | Sun, 5 Mar 2006 18:50:50 -0500 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Stephen Wright wrote:
>I Googled libimf.so; the first dozen hits were about this particular
>problem. Apparently the Intel 9.0 compilers specify libimf.so after
>libm.so when linking, whereas it should be the other way around. Various
>fixes are recommended, such as using Intel 8.1 compilers instead or
>explicitly specifying the order of -limf and -lm on the command line. I'll
>try again, see how it works, and let you know.
>
>Maybe the sysadmin can fix this in the compiler configuration?
>
>
This is related to the LD_LIBRARY_PATH not being passed by SSH during a
non-interactive pasword-less SSH (mpirun_ssh).
I modified bash startup script so that all default modules is loaded by
passwd-less *non-interactive* SSH session (as well).
This will make all your apps (that is dynamicly linked) to use Intel 9's
lib - since Intel 9 is the default module for everyone.
This will get rid of the error message and makes all your apps link to
Intel 9's lib. You won't see this error message again.
We had very brief discussion on the issue on Thursday/Friday.
The core issue here is to make mpirun_ssh 'pass' the appropriate PATH,
LD_LIBRARY_PATH so that it's consistent w/ Modules.
This is so that not just the *default* PATH and LD_LIBRARY_PATH got used.
This is after talking to 2 different Cisco guys, Dave, and thinking
about the issues for a while. This is about mantaining the consistency
between env variables (PATH, LD_LIBRARY_PATH) and our 'Modules' setup.
If we were to not unnecessarily burden users, we ought to have a wrapper
script (/usr/local/topspin/mpi/mpich/bin/mpirun_ssh.wrap.sh).
If you use the wrapper script, it will make SSH pass in appropriate PATH
and LD_LIBRARY_PATH (in a way that is consistent w/ Module) for you.
This is about setting a sane framework for handling *multiple* versions
of apps and multiple versions of lib - what Module is for.
When all (agree on the problem - it's rather subtle) and agree on the
solution , we can move the wrapper script to mpirun_ssh so that it's
transparent to users.
Otherwise, we should move on to a different solution.
I accidentally hit 'Send' on the previous email; please pardon
unfinished sentences. By default, Linux does not include current directory
in PATH env.
Thanks,
Robin
>- Seve
>
>
>
>>3. Problem: I tried to run the following PBS job:
>>
>>
>>
>>>#PBS -l nodes=1:ppn=2
>>>#PBS -l walltime=06:00:00
>>>#PBS -j oe
>>>set -x
>>>date
>>>cd $TMPDIR
>>>
>>># copy the executable
>>>cp ${HOME}/mnm/mnm3mpi .
>>>cp ${HOME}/mnm/interval .
>>>
>>>ls
>>>
>>># run the executable
>>>mpirun_ssh -np 2 -hostfile $PBS_NODEFILE mnm3mpi > trial-0000.out
>>>
>>>date
>>>
>>># copy the output
>>>cp ./trial-0000.out ${HOME}/mnm
>>>
>>>
>>Here's the output file showing an error:
>>
>>
>>
>>>++ date
>>>Sun Mar 5 14:18:54 EST 2006
>>>++ cd /tmp/pbs.738.mulnx31
>>>++ cp /home/wrightse/mnm/mnm3mpi .
>>>++ cp /home/wrightse/mnm/interval .
>>>++ ls
>>>interval
>>>mnm3mpi
>>>++ mpirun_ssh -np 2 -hostfile /var/spool/PBS/aux/738.mulnx31 mnm3mpi
>>>/usr/bin/env: mnm3mpi/usr/bin/env: mnm3mpi: No such file or directory
>>>: No such file or directory
>>>++ date
>>>Sun Mar 5 14:18:54 EST 2006
>>>++ cp ./trial-0000.out /home/wrightse/mnm
>>>
>>>
>>Suspecting that the path didn't include the '.' directory, I changed the
>>mpirun command to use './mnm3mpi' rather than 'mnm3mpi', which gave me the
>>following error instead:
>>
>>
>>
>>>++ date
>>>Sun Mar 5 14:22:40 EST 2006
>>>++ cd /tmp/pbs.739.mulnx31
>>>++ cp /home/wrightse/mnm/mnm3mpi .
>>>++ cp /home/wrightse/mnm/interval .
>>>++ ls
>>>interval
>>>mnm3mpi
>>>++ mpirun_ssh -np 2 -hostfile /var/spool/PBS/aux/739.mulnx31 ./mnm3mpi
>>>./mnm3mpi: error while loading shared libraries: libimf.so: cannot open
>>>shared object file: No such
>>>file or directory
>>>./mnm3mpi: error while loading shared libraries: libimf.so: cannot open
>>>shared object file: No such
>>>file or directory
>>>++ date
>>>Sun Mar 5 14:22:40 EST 2006
>>>++ cp ./trial-0000.out /home/wrightse/mnm
>>>
>>>
>>Any ideas what's going wrong here?
>>
>>Steve
>>
>>
>>
|
|
|