Tracing logs in the RB and the CE.
A job in LCG is identified by the EDG jobid (or simply jobid).
With the command edg-job-get-logging-info -v 1 <jobid>, it is possible to
retrieve some logging information about the job, but in some cases it is necessary
to investigate what happened with the job within the CE in more depth.
In order to do that, an identifier for the job within the CE must be provided to
the CE's site admin, so that she can identify which job we are referring to.
This identifier is called the JM-contact string. It is not returned by the
edg-job-get-logging-info command, but it can be obtained from the RB logging files.
The procedure to do so is described as follows:
- In the logging information, look for the RB's logging file.
Something like /var/edgwl/logmonitor/CondorG.log/CondorG.xxx.log
(search for "CondorG.log")
- Retrieve that file from the RB:
globus-url-copy gsiftp://<hname>/var/edgwl/logmonitor/CondorG.log/CondorG.xxx.log
file:`pwd`/RB_log
- In the retrieved file, look for the last part of the jobid.
For https://lxn1182.cern.ch:9000/HAZrszS8RzlM82MlAqrAIw,
it'd be: HAZrszS8RzlM82MlAqrAIw
- Get the string of the form xxxxx.0000.0000 associated with that part of the jobid.
- In the same retrieved file, search repeatedly for that string until the JM-contact string
(labelled as such) is found.
- When telling the site about the job, refer to both the EDG jobid and the JM-contact string.
All this can be done automatically with a PERL script (it includes a descrition of the
arguments and functionality):
getJM.pl
Getting the BDII configuration of the RB.
In LCG2, the information about resources comes from the BDII.
In the UI and the WNs where the jobs run, this is defined by the environmental
variable LCG_GFAL_INFOSYS (it will not necessarily have the
same value in both, unless we set the it in the job's JDL).
Nevertheless, the Resource Broker may use yet another different BDII, and that
determines which CEs are seen when doing the match making process.
The following recipe can be used to find out which BDII our RB uses.
- To find out which RB is being used by default in a UI, look at:
$EDG_WL_LOCATION/etc/<VO>/edg_wl_ui.conf
Take the hostname appearing as value of the attribute NSAddresses
- Do a globus-url-copy of the following configuration file: /opt/edg/etc/edg_wl.conf
globus-url-copy gsiftp://<hname>/opt/edg/etc/edg_wl.conf file:`pwd`/RB_conf
- In the retrieved file, look for the value of the II_Contact parameter.
Debugging authentication errors.
First, let us review the main issues to take care of to avoid authentication problems:
- Users need a valid proxy, be registered in a VO and use grid-proxy-init.
- Have a correct grid-mapfile (/etc/grid-security/grid-mapfile).
- Updated every six hours (at a random minute) by a cronjob.
- It requires edg-mkgridmap.conf to be correct (with correct format!).
- Have an updated Certificate Revocation List (CRL) for each CA.
- This is updated by a cronjob by downloading them from a web site.
- The CA must update its CRL before the old one expires!!
- The Certification Authority RPMs must be up-to-date.
And when an authentication error occurs...
- Look the error messages carefully: they are usually really informative (though dense and thick)
- Try to get more detailed messages by calling globus-url-copy in debug mode to get a
file from the host causing problems:
globus-url-copy -dbg gsiftp://<hostname>/etc/group file:/tmp/myFile
- Check the log file /var/log/messages for authentication error messages
- Check the steps above:
- Have a look at the grid-mapfile: check users DN and VO
- Check the date of the CRL file (/etc/grid-security/certificates/xxxxxx.r0)
grep 'ext Update' /etc/grid-security/certificates/*.r0
- Check that the CAs RPMs are installed