Report from Roberto Santinelli ============================== Overview -------- We performed at different times different tests of T&S at different sites. Preliminary functional tests were performed on the EIS machines. The same setup (1CE, 1SE and 2 WN) was used later on to allow for a more detailed monitoring of the performances of the services as well as for debugging activities and to check the development environment. Subsequent tests were performed on our first "guinea pig" site in Pisa. These tests (dated August 2004) in Pisa have proved that all basic T&S functionalities were in place. However this previous version of T&S was suffering of a memory leak problem that the test showed after a few days of running. The memory of the insecure service, receiving connection requests every 5 minutes from 9 different WNs (and 2 VOs) was growing from the initial 3MB upto 10 MB with a linear trend of 500K per day. Afterwards the service it self crashed. Since that we improved the mechanism by adding further features described in the T&S web page (http://grid-deployment.web.cern.ch/grid-deployment/eis/docs/configuration_of_tankspark), and by fixing these problem. We then released version 2.0.3 and let Legnaro site (70 WNs per 5 VOs) to install it. August-2005 Legnaro tests ********************************************* Test setup: Legnaro Farm 70 worker nodes ---------------------------------------- Tank Server (CE): dual Xeon 2.4GHz 2GB RAM Rsync Server(SE): dual Xeon 2.8GHz 4GB RAM Spark (WN): 5 Blade Centers IBM/Intel each one with 14 dual Xeon nodes. CPU clock from 2.4 and 3.0 GHz and 2 GB RAM. Network: each Blade Center has 1 uplink GE towards the central switch (serving 14 nodes) although each node has its own GE network adapter. The site was configured by sharing the Experiment Software Area among all the WNs. Functionality tests ------------------- We tested a full installation process, from the job submission to the final notification via e-mail from the server. We did install 1.4 GB of software (Geant4) through an installation job and afterwards we received the notification from the server that all 70 nodes upgraded successfully. This means that all the steps of the propagation worked fine; all comunications between WN/CE/SE succeeded; many of the functionalities are there (Flag publication on the InfoSys, mail notification with all information correctly collected, local tag, DB update, check of the permission,logs server side and so on) For instance you can see the output of a query into the Tank DB: mysql> select * from flags; +-----------------+--------+----------+----------------+----------------+----------------------------+--------------------------------------+---------+ | value | status | owner | add_time | rem_time | e_mail| guid | counter | +-----------------+--------+----------+----------------+----------------+----------------------------+--------------------------------------+---------+ | lcg-utils-2.5.6 | OK | dteamsgm | 20050719095706 | 00000000000000 | roberto.santinelli@cern.ch | 8345b0d8-7f56-45ed-acea-76046036005b | 11111 | +-----------------+--------+----------+----------------+----------------+----------------------------+--------------------------------------+---------+ We also performed a removal process with the same good results. Such a test allowed for stressing the system because (for a bug in the documentation) although the WN did share the experiment software area, the flag directory was set to local on each node. This turned out in the fact that at each new installation (removal) , each each WN requested an upgrade and contacted the r-sync server causing at the server side a very CPU-intensive checksum process. We didn't measured the consumed CPU at that time. Functional tests at Legnaro site proved also backward compatibility with the old way of installing software. (non using T&S). The site was indeed in a production state, serving real job execution requests. Within the functional tests we can also add other tests performed in the EIS testbed, where we had more control. Such tests were simulating possible failures during the propagation/installation, NFS area not correctly mounted, lost connection with the server, server down, WN down and all possible "dangerous" situation that one can run into. All these tests showed a right answer from T&S reacting well to such extreme cases. We also tried to run concurrent installation processes and studied the answer of the service. Everything OK! Reliability, stability and scalability: --------------------------------------- The service was up for more than 40 days, with the same load (70*5 requests every 10 minutes) without any memory growth (see later). The system scales from a 2 node to 70 nodes topology in a pretty good way (from 4MB to 9 MB of memory usage) while all installation requests performed have been all successful. Performances tests: ------------------- We also measured performances focusing on some metrics. We have measured the memory and CPU usage on the server side for both the secure and insecure service. After one month of running the two processes were showing the following: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 4738 edguser 15 0 6872 6848 3688 S 0.0 0.3 0:00 0 lcg-tank PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 4746 edguser 15 0 9500 9476 2820 S 0.0 0.4 5:15 0 lcg-utank At startup the memory usage for lcg-utank was 9.5M while 2.5 MB for lcg-tank. The CPU load was negligible (0.1%). We also measured the status of a rsync server machine while receiving request for a 1.4 GB syncronization. This has been possible in the EIS testbed. In the meantime we also measured the time needed to syncronize such an amount of data (divided in many small size files). The process took slightly more than 4 minutes time command> wrote 1328048924 bytes read 716631 bytes 5120483.83 bytes/sec total size is 1324988251 speedup is 1.00 creating the file /opt/flags/dteam/direct_test real 4m30.885s user 1m10.430s sys 0m29.420s while on the server side (it is a dual processor PIII 800 MHz) the rsync process reached a CPU usage of 16% and 6.4 MB of memory. PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 26448 dteamsgm 25 0 6476 6476 804 S 17.2 1.2 0:01 0 rsync Outstanding issues ------------------ We still need to understand why the memory usage of the secure service grows with the number of requests. Strong suggestions come from the GSI-gSOAP plugin used for authentiation because the code of the secure service doesn't differ too much from the insecure one apart that. Fixed problems -------------- The experience from Legnaro highlighted some problems in the documentation as well as a bug in the startup script that is also used to reniew the credential used by the secure service through the host certificate. Preliminary conclusions ----------------------- The tests we have performed showed a general good behavior of the T&S service. It scales very well upto 70 WNs. The service is stable and reliable enough for its task. The load and memory consumption are not really important. We haven't fully tested the r-sync load because Legnaro has provided WN sharing the area. We nevertheless proved that the checksum for 70 nodes and 1.4 GB of stuff in small size file works fine. We should check what happens in case of no-shared file system. In this case 1.4 GB is also shipped to 70 different nodes, by generating a network storm locally in the farm. Even if T&S handles this storm allowing a scheduled syncronization node by node, we cannot foresee what happens with big farms and big amount of data to be syncronized among its WNs. A more contained experienced with the Pisa site (and the eis testbed as well) has proved however that at least with a 10 machines, such "storm" is not critical for the site. Next steps ---------- o Network and rsync stress tests with a non shared farm with O(100) nodes. o Functional and performances test with a very big site O(1000) nodes. o Stress test with a very big site.