Causes of Maradona error
Q. What does the error "Status Reason:Cannot read JobWrapper output, both
from Condor and from Maradona" mean?
R. The error means that the resource broker is not able to determine the exit status of the user's job. The resource broker has two methods to try to
determine the exit status: A short output returned either via Condor-G
(and hence globus) or directly via a globus-url-copy on the worker node
(the so called Maradona output).When both methods fail the broker reports
the error - so the first possible problem is that there is a transfer
A good test to try and catch a file transfer problem from a site back to
the RB this is something like:
However there can be many other circumstances which produce the "Status
Reason:Cannot read JobWrapper output, both from Condor and from Maradona"
I've tried to list the main causes of this error and the possible
solution adopted during the LHCb DC04.
Please note that a single bad WN can be responsible for all jobs to
fail on a site!
- One of the reasons include the RB's modified gridftp server not
respecting a TCP_RANGE, also a out of date CRLs.
- For PBS/torque batch systems a problematic WN with a bad
synchronization of the ssh keys may prevent the upload of the
output back to the CE - and this could possibly lead to the error.
To check this the site administrator must do the command:
(loop over the WNs)
for I in lcg0*
ssh root@$I 'su - lhcb001 -c "ssh lcgce02.gridpp.rl.ac.uk /bin/true "'
done | tee /tmp/ssh.out
and check if everything is OK.
Faults with the ssh keys may come from problems with the periodic
execution of the command:
which is supposed to export correctly the ssh keys between
the CE and new WNs.(ce1.egee.fr.cgg.com experience).
- It could also be due to a file system server serving the
WNs which is down or running in trouble. For instance at FZK the reason were hanging mounts due to the changed mount points of the file
Another example is at ce-a.ccc.ucl.ac.uk where the file
system has been remounted read-only.
On October 2005 NIKHEF had a problem with NFS. At a certain point all concurrent jobs started writing output on the disk server (instead of locally). The load of the NFS server went up to infinity and writing the job output fails.
- It might be that the job didn't start to run in the WN (bad
configuration, for example it doesn't find the home directory or it's
filled up (Toronto July 2004, CNAF August 2005))
Another possibility is that /tmp is full (FZK August 2004)
Jobs were aborted with this error if the "/tmp" partition is full on
Submitting a job directly via Globus to such a node produced an
error message like this one:
/var/spool/PBS/mom_priv/jobs/333595.pbs-.SC: cannot create temp
file for here document: No space left on device
- Another possibility is that the job has been killed beforehand it
From Steve Traylen (experienced at RAL)
"It was a mistake when we merged the two schedulars together
that we did last week. One of schedulars had some global
defaults on the pcput (CPU time per process) which had not
been overridden in the LCG queue settings."
causing the following local output:
from the batch worker log:
limit 108000 20041209:12/09/2004 05:47:35;0008;
killing pid 2365 task 1 with sig 15
- A rather interesting case was experienced in Toronto during
the Summer of 2004:
The CE has had a large load on it for some days; when that
happens ssh or scp processes tend to time out.
In that occasion the solution has been:
to increase the number of maximum unauthenticated ssh connections from
the default of 10 to 30. This allowed scp processes to succeed
Note however that if the CE is suffering under a very heavy
load due to too many globus/perl processes, this probably won't help
much as there are no resources available! In general the load should
not be so excessive as to prevent transfers.
- A mismatch between the callback ports being sent from a CE to the
RB and the actual ports which are open for incoming calls in the
firewall, resulting in the RB's return calls being blocked. (RAL
- At BUDAPEST the underlying cause of Maradona was a problem with
Under heavy load the batch system was killing user jobs.
In effect, LHCb experienced that their jobs got killed 2 seconds after
they started - and no output was produced.
- At TRIUMF another cause was:
the PBS server uses the hostname in the job ID, (lcgce01.triumf.ca),
and this is used in the PBS scp statement to copy stdout and err to
the PBS server.
However the WN's are on a private network. This is
correctly(?) routed (through the lcfg machine actually) to the WAN,
but the scp connection is refused because it looks like it
came from the lcfg node.
The fix was to the routing tables on the WN's to use lcgce01.lcg
(internal network interface) as the gateway for destination
- Another possible explanation is that, the site administrator has
cleaned up the logging of undelivered stdout and stderr (for whatever
- Recently we also discovered a correlation between the use of BQS batch
system and the massive presence of this error at the site (Lion February
- A Clock skew of the WN is another possible cause as experienced at the site Lancs-LCG2 on April 2005.
- A reboot of the CE (for whatever reason) would imply a connection lost with the WNs and then the track of the jobs running there....
- CNAF (October 2005): WN in kernel panic status due to AFS.
For auxiliary information visit the