Causes of Maradona error

Q. What does the error "Status Reason:Cannot read JobWrapper output, both from Condor and from Maradona" mean?

R. The error means that the resource broker is not able to determine the exit status of the user's job. The resource broker has two methods to try to determine the exit status: A short output returned either via Condor-G (and hence globus) or directly via a globus-url-copy on the worker node (the so called Maradona output).When both methods fail the broker reports the error - so the first possible problem is that there is a transfer problem.


A good test to try and catch a file transfer problem from a site back to the RB this is something like:

globus-job-run yourCEhost/jobmanager-lcgpbs
/opt/globus/bin/globus-url-copy \
file:///etc/group gsiftp://someoneElsesRBhost/tmp/junk

However there can be many other circumstances which produce the "Status Reason:Cannot read JobWrapper output, both from Condor and from Maradona" error.
I've tried to list the main causes of this error and the possible solution adopted during the LHCb DC04.

Please note that a single bad WN can be responsible for all jobs to fail on a site!

  1. One of the reasons include the RB's modified gridftp server not respecting a TCP_RANGE, also a out of date CRLs.
  2. For PBS/torque batch systems a problematic WN with a bad synchronization of the ssh keys may prevent the upload of the output back to the CE - and this could possibly lead to the error.

    To check this the site administrator must do the command:
    (loop over the WNs)

    for I in lcg0*
    do
    echo $I
    ssh root@$I 'su - lhcb001 -c "ssh lcgce02.gridpp.rl.ac.uk /bin/true "'
    done | tee /tmp/ssh.out
    and check if everything is OK.

    Faults with the ssh keys may come from problems with the periodic execution of the command:

    /opt/edg/sbin/edg-pbs-knownhosts
    which is supposed to export correctly the ssh keys between the CE and new WNs.(ce1.egee.fr.cgg.com experience).
  3. It could also be due to a file system server serving the WNs which is down or running in trouble.
    For instance at FZK the reason were hanging mounts due to the changed mount points of the file server.
    Another example is at ce-a.ccc.ucl.ac.uk where the file system has been remounted read-only.
    On October 2005 NIKHEF had a problem with NFS. At a certain point all concurrent jobs started writing output on the disk server (instead of locally). The load of the NFS server went up to infinity and writing the job output fails.
  4. It might be that the job didn't start to run in the WN (bad configuration, for example it doesn't find the home directory or it's filled up (Toronto July 2004, CNAF August 2005))
    Another possibility is that /tmp is full (FZK August 2004)
    Jobs were aborted with this error if the "/tmp" partition is full on job startup.
    Submitting a job directly via Globus to such a node produced an error message like this one:

    /var/spool/PBS/mom_priv/jobs/333595.pbs-.SC: cannot create temp file for here document: No space left on device
  5. Another possibility is that the job has been killed beforehand it finished.
    From Steve Traylen (experienced at RAL)
    "It was a mistake when we merged the two schedulars together that we did last week. One of schedulars had some global defaults on the pcput (CPU time per process) which had not been overridden in the LCG queue settings." causing the following local output:

    from the batch worker log:
    20041209:12/09/2004 05:47:35;0008;
    pbs_mom;Job;280518.csflnx353.rl.ac.uk;pcput exceeded
    limit 108000 20041209:12/09/2004 05:47:35;0008;
    pbs_mom;Job;280518.csflnx353.rl.ac.uk;kill_task:
    killing pid 2365 task 1 with sig 15

  6. A rather interesting case was experienced in Toronto during the Summer of 2004:
    The CE has had a large load on it for some days;
    when that happens ssh or scp processes tend to time out.
    In that occasion the solution has been:
    to increase the number of maximum unauthenticated ssh connections from the default of 10 to 30. This allowed scp processes to succeed more often.

    Note however that if the CE is suffering under a very heavy load due to too many globus/perl processes, this probably won't help much as there are no resources available!
    In general the load should not be so excessive as to prevent transfers.

  7. A mismatch between the callback ports being sent from a CE to the RB and the actual ports which are open for incoming calls in the firewall, resulting in the RB's return calls being blocked. (RAL experience).
  8. At BUDAPEST the underlying cause of Maradona was a problem with Condor.
    Under heavy load the batch system was killing user jobs.
    In effect, LHCb experienced that their jobs got killed 2 seconds after they started - and no output was produced.
  9. At TRIUMF another cause was:
    the PBS server uses the hostname in the job ID, (lcgce01.triumf.ca), and this is used in the PBS scp statement to copy stdout and err to the PBS server.
    However the WN's are on a private network. This is correctly(?) routed (through the lcfg machine actually) to the WAN, but the scp connection is refused because it looks like it came from the lcfg node.
    The fix was to the routing tables on the WN's to use lcgce01.lcg (internal network interface) as the gateway for destination lcgce01.triumf.ca.
  10. Another possible explanation is that, the site administrator has cleaned up the logging of undelivered stdout and stderr (for whatever reason).
  11. Recently we also discovered a correlation between the use of BQS batch system and the massive presence of this error at the site (Lion February 2005).

  12. A Clock skew of the WN is another possible cause as experienced at the site Lancs-LCG2 on April 2005.
  13. A reboot of the CE (for whatever reason) would imply a connection lost with the WNs and then the track of the jobs running there....
  14. CNAF (October 2005): WN in kernel panic status due to AFS.

For auxiliary information visit the WiKi Page


Roberto Santinelli