Report from Roberto Santinelli
==============================

Overview
--------
We performed at different times different tests of T&S at different sites.
Preliminary functional tests were performed on the EIS machines. The same 
setup (1CE, 1SE and 2 WN) was used later on to allow for a more detailed 
monitoring of the performances of the services as well as for debugging 
activities and to check the development environment.

Subsequent tests were performed on our first "guinea pig" site in 
Pisa. 
These tests (dated August 2004) in Pisa have proved that all basic 
T&S functionalities were in place. 
However this previous version of T&S was suffering of a memory leak 
problem that the test showed after a few days of running. The 
memory  of the insecure service, receiving connection requests 
every 5 minutes from 9 
different WNs (and 2 VOs) was growing from the initial 3MB upto 10 MB with 
a linear trend of 500K per day.
Afterwards the service it self crashed.
Since that we improved the mechanism by adding further features 
described in the T&S web page 
(http://grid-deployment.web.cern.ch/grid-deployment/eis/docs/configuration_of_tankspark), 
and by fixing these problem.
We then released version 2.0.3 and let Legnaro site (70 WNs per 5 VOs) to 
install it.
  

August-2005  Legnaro tests
*********************************************

Test setup: Legnaro Farm 70 worker nodes
----------------------------------------
Tank Server (CE): dual Xeon 2.4GHz 2GB RAM
Rsync Server(SE): dual Xeon 2.8GHz 4GB RAM
Spark (WN): 5 Blade Centers IBM/Intel each one with 14 dual Xeon nodes. 
CPU clock from 2.4 and 3.0 GHz and 2 GB RAM.
Network: each Blade Center has 1 uplink GE towards the central switch 
(serving 14 nodes) although each node has its own GE network adapter.
The site was configured by sharing the Experiment Software Area among all 
the WNs. 

Functionality tests
-------------------
We tested a full installation process, from the job submission to the 
final notification via e-mail from the server. 
 We did install 1.4 GB of software (Geant4) through an installation job 
and afterwards we received the notification from the server that all 
70 nodes upgraded successfully.
This means that all the steps of the propagation worked fine; all 
comunications between WN/CE/SE succeeded; many of the functionalities are 
there (Flag publication on the InfoSys, mail notification with all information 
correctly collected, local tag, DB update, check of the permission,logs 
server side and so on)

For instance you can see the output of a query into the Tank DB:
mysql> select * from flags;
+-----------------+--------+----------+----------------+----------------+----------------------------+--------------------------------------+---------+
| value           | status | owner    | add_time       | rem_time       |
e_mail| guid                                 | counter |
+-----------------+--------+----------+----------------+----------------+----------------------------+--------------------------------------+---------+
| lcg-utils-2.5.6 | OK     | dteamsgm | 20050719095706 | 00000000000000 |
roberto.santinelli@cern.ch | 8345b0d8-7f56-45ed-acea-76046036005b |
11111 |
+-----------------+--------+----------+----------------+----------------+----------------------------+--------------------------------------+---------+


We also performed a removal process with the same good results.

Such a test allowed for stressing the system because (for a bug in the 
documentation) although the WN did share the experiment software area, the 
flag directory was set to local on each node. 
This turned out in the fact that at each new installation (removal) , each 
each WN requested an upgrade and contacted the r-sync 
server causing at the server side a very CPU-intensive checksum process.
We didn't measured the consumed CPU at that time.

Functional tests at Legnaro site proved also backward compatibility with 
the old way of installing software. (non using T&S). The site was indeed 
in a production state, serving real job execution requests.

Within the functional tests we can also add other tests performed in the 
EIS testbed, where we had more control.

Such tests were simulating possible failures during the 
propagation/installation, NFS area not correctly mounted, lost connection 
with the server, server down, WN down and all possible "dangerous" 
situation that one can run into.
All these tests showed a right answer from T&S reacting well to such 
extreme cases.
We also tried to run concurrent installation processes and studied the 
answer of the service. Everything OK!

Reliability, stability and scalability:
---------------------------------------
 The service was up for more than 
40 days, with the same load (70*5 requests every 10 minutes) without any 
memory growth (see later). The system scales from a 2 node to 70 nodes 
topology  in a pretty good way (from 4MB to 9 MB of memory usage) 
while all installation requests performed have been all successful.

Performances tests:
-------------------
We also measured performances focusing on some metrics.
We have measured the memory and CPU usage on the server side for both the 
secure and insecure service.
After one month of running the two processes were showing the following:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 4738 edguser   15   0  6872 6848  3688 S     0.0  0.3   0:00   0 lcg-tank
                                                                                
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 4746 edguser   15   0  9500 9476  2820 S     0.0  0.4   5:15   0 lcg-utank

At startup the memory usage for lcg-utank was 9.5M while 2.5 MB for 
lcg-tank. The CPU load was negligible (0.1%).

We also measured the status of a rsync server machine while receiving 
request for a 1.4 GB syncronization. This has been possible in the EIS 
testbed. In the meantime we also measured the time needed to syncronize  
such an amount of data (divided in many small size files).

The process took slightly more than 4 minutes 

time command>
wrote 1328048924 bytes  read 716631 bytes  5120483.83 bytes/sec
total size is 1324988251  speedup is 1.00
creating the file /opt/flags/dteam/direct_test
                                                                                
real    4m30.885s
user    1m10.430s
sys     0m29.420s

while on the server side (it is a dual processor PIII 800 MHz) the rsync 
process reached a CPU usage of 16% and 6.4 MB of memory.

PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
26448 dteamsgm  25   0  6476 6476   804 S    17.2  1.2   0:01   0 rsync


Outstanding issues
------------------
We still need to understand why the memory usage of the secure service 
grows with the number of requests.
 Strong suggestions come from the GSI-gSOAP plugin used 
for authentiation because the code of the secure service doesn't differ 
too much from the insecure one apart that.

 Fixed problems
--------------
The experience from Legnaro highlighted some problems in the documentation 
as well as a bug in the startup script that is also used to reniew the 
credential used by the secure service through the host certificate.

Preliminary conclusions
-----------------------
The tests we have performed showed a general good behavior of the T&S 
service. It scales very well upto 70 WNs. The service is stable and 
reliable enough for its task. The load and memory consumption are not 
really important.

We haven't fully tested the r-sync load because Legnaro has 
provided WN sharing the area.
We nevertheless proved that the checksum for 70 nodes and 
1.4 GB of stuff in small size file works fine. We 
should check what happens in case of no-shared file 
system. In this case 1.4 GB is also shipped to 70 different nodes, by 
generating a network storm locally in the farm. 
Even if T&S handles this storm allowing a scheduled syncronization node by 
node, we cannot foresee what happens with big farms and big amount of data 
to be syncronized among its WNs.

A more contained experienced with the Pisa site (and the eis testbed as well) 
has proved however that at least with a 10 machines, such "storm" is not 
critical for the site.


Next steps
----------

o Network and rsync stress tests with a non shared farm with O(100) nodes.
o Functional and performances test with a very big site O(1000) nodes.
o Stress test with a very big site.