A QUICK START WITH TANK & SPARK

The purpose of this document is to describe how to manually install, configure and switch on "Tank & Spark" system on a site, how to test the installation and finally how to remove from a site and/or turn off the system it self.
The page should also help site administrators or volunteers that want to try out the mechanism in gathering the last up-to-date release of this software which is in continous progress. Feedbacks or suggestion can be addressed to the authors (see contact at the end of the document).

  1. Manual Installation of Tank & Spark

  2. Manual Configuration of Tank & Spark

  3. Testing the installation

  4. Shutting down the Tank service

  5. More documentation about Experiment Software Installation on LCG-2


  1. Manual Installation of Tank & Spark

    The mechanism comes out with a series of rpms that have to be installed on the CE on each WN and on the SE.
    The "official" version has the following well known limits:

    1. R-sync no configurable. (873 must be the port and /etc/rsyncd.conf the configuration file)
    2. R-sync doesn't allow for an authenticated connection and needs to set writable from anybody the VO dedicated repositories area (on the SE).
    3. The server shows a growth of memory usage as well as a limit of the total number of connections allowed.
    The rpms available on this page fix these limits.

    N.B. Starting from versdion 2.0-4 we have stopped to support RH7.3 platform

    There are globally four different rpms to download and install:

    No architecture rpm
    1. lcg-ManageSoftware-2.0-3.noarch.rpm
    to be installed on each WN (install lcg-ManageSoftware and lcg-ManageVOTag).
    and, depending on which operating system one is using:
    RH7.3 rpms
    2. lcg-tank-gcc32dbg-2.0-3.i386.rpm
    to be installed on the Computing Element (it installs the Tank service)

    3. lcg-spark-gcc32dbg-2.0-3.i386.rpm
    to be installed on each WN (it installs the Spark client).

    4. lcg-tankspark-conf-2.0-4.i386.rpm
    to be installed on CE, SE and each WN (it installs the configuration scripts used at the next step Manual Configuration of Tank & Spark


    or
    SLC3 rpms
    2. lcg-tank-gcc32dbg-2.1-1_sl3.i386.rpm
    to be installed on the Computing Element (it installs the Tank service)

    3. lcg-spark-gcc32dbg-2.1-1_sl3.i386.rpm
    to be installed on each WN (it installs the Spark client).

    4. lcg-tankspark-conf-2.1-1_sl3.i386.rpm
    to be installed on CE, SE and each WN (it installs the configuration scripts used at the next step Manual Configuration of Tank & Spark


    Dependencies
    Tank needs:

    CGSI_gSOAP_2.3 >= 1.1.2
    MySQL-client >= 4.0.13
    MySQL-server >= 4.0.13
    MySQL-shared >= 4.0.13
    mysql++_1.7.9_mysql.4.0.13__LCG_rh73_gcc32
    rpmlib(PayloadFilesHavePrefix) <= 4.0-1
    rpmlib(CompressedFileNames) <= 3.0.4-1

    Spark needs:

    CGSI_gSOAP_2.3 >= 1.1.2
    rsync >= 2.5.7
    rpmlib(PayloadFilesHavePrefix) <= 4.0-1
    rpmlib(CompressedFileNames) <= 3.0.4-1


    Back to top


  2. Manual Configuration of Tank & Spark

    The three different components to be configured (Tank, Spark and Rsync) need just one configuration file:

    $LCG_LOCATION/etc/tankspark/lcgtankspark.conf

    Once you have installed the RPMs, you have to modify this file on the CE in which Tank service runs, on the WN in which the client Spark runs and on the SE in which the r-sync server runs. If you didn't install through LCFG (that takes care of creating for you this configuration file on the basis of the informations that you have provided through the LCFG objects) you have to modify it by hands. This is an example of the configuration file :

    se=pcitgdeis569.crn.ch
    edgVarLoc=/opt/edg/var
    rsyncport=873
    rsyncuser=tango
    ce=lxb0706.cern.ch
    vo=cms alice atlas lhcb dteam
    rsyncrep=/opt/repository
    dbuser=tank
    dbpasswd=lcg_test
    tankconf=/opt/lcg/etc/tank.conf
    rsyncconf=/etc/rsyncd.conf
    sparkconf=/opt/lcg/etc/spark.conf
    flagdir=/opt/flags
    expsoftdir=.
    explocdir=/opt/ext_soft
    sitename=lxb0706
    ldapport=2170
    afsprincipal=
    lifetime=25

    Let's go now to explain more about these parameters:

    "se" is the storage element where r-rync is running. This is the machine in which the experiment software is centrally stored for each VO.

    "rsyncport" is the port number used by the rsync daemon (D=873).

    "rsyncuser" is the user to be used by the client to be authenticated on the rsync server

    "ce" is the Computing Element in which Tanks runs

    "vo" is a list of VOs supported by the site and separated by blank space

    "rsyncrep" is the root directory (for all the experiments on the SE) of the central software repository.

    "dbuser " is the user used by Tank to connect it self to the mysql DB (wn_list).

    "dbpasswd" is the password used by Tank to connect it self to the mysql DB (wn_list) and by every WN in order to connect to the rsync server as "rsyncuser" user

    "tankconf" is the path of the configuration file (created automatically) which is used by Tank

    "rsyncconf" is the path of the configuration file used by the r-sync daemon.

    "sparkconf" is the path of the configuration file used by spark (please leave these last three fields to their defaults)

    "flagdir" is the path in which spark will save the tag files used to track which version of software are installed locally. This is the path for all the vos. Each VO will have its own sub directory with the right ownership.
    If the node is sharing the experiment software area you have to provide this flag area visible for all nodes.
    It means that - if for instance the experiment software is under /opt/exp_soft/some_vo - you have to create under /opt/exp_soft a subdirectory called whatever you want (we suggest flags) and this is the value for flagdir attribute. Each VO will have its own subdirectory under flagdir.

    "expsoftdir" must be set to "." if NO shared file system is there;
    otherwise must be set as the root experiment software dir (common for all the experiments) (for instance /opt/ext_software)

    "explocdir" is the local experiment software root dir common to all the VOs. Specific vo sub directories will be created automatically. This directory *MUST* be the same as "expsoftdir" in case of shared filesystem on the WN.

    "sitename" name of the CE

    "ldapport" port to be used to query the ldapserver on the sitename

    "edgVarLog" is the value of the EDG_VAR_LOC variable on the CE

    "afsprincipal" is the name of the server in which runs the gssklogd daemon used for the conversion of GSI credentials into AFS Kerberos tokens; leave it blank if no AFS shared file system is there (almost always)

    "lifetime" is the lifetime of generated AFS tokens.


    After having filled up this file on each machine, you are ready to start.
    At this very initial moment you have however to take into account whether the farm in which you are going to start Tank & Spark has already the Experiment Software installed there or not.

    In the former case you need to perform some extra actions manually**:
    1. For each "vo_name"you have to duplicate the content of the $VO_EXP_SW_DIR under the rsyncrep/"vo_name" directory on the r-sync machine (let's say the Storage Element). This will prevent "accidental lost of data" through the automatic r-sync mechanism
    2. You have to update manually the Tank DataBase adding entries into the "flags" table accrodingly with the information published by the site.
    **The authors can help you in these operations till some more automatic tools are available
    From now on we will describe how to switch on the whole mechanism. For each component you have to do:

    Tank

    1. Run the command

      bash> $LCG_LOCATION/etc/tankspark/lcfg-tank.sh $LCG_LOCATION/etc/tankspark/lcgtankspark.conf

      Tank is almost completely installed!

      It will perform the following actions:

      1. Create a file "command.sql" that has to be executed later on (step2)
      2. Create the configuration file used by the server application (on the basis of the information set on the lcgtankspark.conf file)
      3. Modify the /etc/sudoers file in order to let edguser do the chown command
      4. Install a new cronjob that renews every 6 hours the server credentials
      5. Install a watch_dog cronjob
      6. Start the services
    2. Run the command

      bash> mysql -u root -p < $LCG_LOCATION/etc/tankspark/command.sql

    Spark

    1. Run the command

      bash> $LCG_LOCATION/etc/tankspark/lcfg-spark.sh $LCG_LOCATION/etc/tankspark/lcgtankspark.conf

    Spark is now installed!

    The command will perform the following actions:

    1. Create the local software directory for each VO whose root is set on the configuration file
    2. Create the local flag directory for each VO (whose root has been set on the configuration file). This directory contains the flag files indicating which version are installed on the WN.
    3. Create the configuration file used by the client application (usually $LCG_LOCATION/etc/spark.conf) with the info provided through the general configuration file
    4. Install for each VO the cronjob that every 10 minute will contact (in a non secure way) the server and query (and -eventually- upgrade) the local experiment software area.

    R-sync

    1. Run the command

      bash> $LCG_LOCATION/etc/tankspark/lcfg-rsync.sh $LCG_LOCATION/etc/tankspark/lcgtankspark.conf

    ...and r-sync is ready !

    1. Create the repository directory for each VO (whose root has been set on the configuration file)
    2. Create the configuration file used by the rsync (usually under /etc/rsync.d)
    3. Start rsync as a daemon listening on port 873 (by default)
    Back to top


  3. Testing your installation

    There are basically few tests to see if everything has been correctly installed. We propose here some basic functional tests and possible solution in case of problem.

    1. Test A

      Since the installation of either CE and WNs, after a while (order of 30 minutes MAXIMUM) you have to see in the mysql db all nodes of the site registered. This happens automatically if the installation is OK.

      1. [root@a_CE root]#> mysql -u DBuser -p wn_list

        Enter password: < DBpasswd

      2. mysql> select * from hosts;
      and the output wil look like:

      | lxshare0203 | 128.142.65.180 | ON | 20050414102004 | 20050314151123 | 00000000000000 | NORMAL |
      | lxb0708.cern.ch | 128.142.65.23 | OFF | 20050401121006 | 20050314151503 | 20050401124124 | NORMAL |
      2 rows in set (0.00 sec)
      If you cannot see anything on this table be sure that daemons lcg-tank and lcg-utank are running on the CE (/opt/lcg/sbin/tank status)
      If the daemon is running see if the table monitors on wn_list DB (mysql) is filled correctly with all VO your site supports.
      Checks eventually if the cronjobs are correctly installed on each WN for each ESM user.
      Checks if these cronjobs are pointing to the right CE-machine.
      Checks either on the CE and on the WNs the existence of /opt/lcg/etc/tank.conf and /opt/lcg/etc/spark.conf respectively.
      Checks if the fields on these files are correctly set for your site.
      Checks if there are old (previous installation) flags named < hostname > on each WNs under the corresponding directory "flagdir". In this case the tool will not write on the DB even if the table hosts is empty.

      Checks if you can run manually (from a WN, as dteamsgm) the line of command that the cronjobs invokes every 10 minutes like for instance:

      /opt/lcg/sbin/lcg-asis-client.sh dteam lxb0706.cern.ch

      If you succeeds to run the command manually and you see the output like that:
      /opt/flags/dteam/
      /opt/exp_soft/dteam/
      /opt/flags/dteam/cmsfarmbl01.lnl.infn.it no present
      host not registered: upgrading functionality called
      Using the configuration file: /opt/lcg/etc/spark.conf***************
      host is http://t2-ce-02.lnl.infn.it:18084
      action is :upgradehost
      the vo used is :dteam
      ###############################################
      ##### Welcome to the spark client program #####
      ###############################################
      #### action is : upgradehost###########
      We are going to contact the server : http://t2-ce-02.lnl.infn.it:18084
      No updates found for this node.

      then the problem is in the syntax of the cronjob it self. This could be true for tcsh/csh account.
      If the output shows you up that the server can't be contacted (gSOAP error) then it means that there is a problem in the communication and it might need further investigation (Do you have the ports number 18084 and 18085 open?)

    2. Test B

      This test relies on the full machinery of Experiment Software Installation. You have to be sure that lcg-ManageSoftware-2.0.1 is installed in your site.

      1. Write these JDLs:

        install.jdl
        Executable = "/opt/lcg/bin/lcg-ManageSoftware";
        InputSandbox = {"install_sw","validate_lcg","uninstall_sw"};
        OutputSandbox = {"stdout", "stderror"};
        stdoutput = "stdout";
        stderror = "stderror";
        Arguments = "--install --validate --validate_script validate_lcg --vo dteam --tag lcg-utils-4.3 --notify ";
        Requirements = other.GlueCEUniqueID == ":2119/jobmanager-pbs-long";
        uninstall.jdl
        Executable = "/opt/lcg/bin/lcg-ManageSoftware";
        InputSandbox = {"install_sw","validate_lcg","uninstall_sw"};
        OutputSandbox = {"stdout", "stderror"};
        stdoutput = "stdout";
        stderror = "stderror";
        Arguments = "--uninstall --vo dteam --tag lcg-utils-4.3 --notify ";
        Requirements = other.GlueCEUniqueID == ":2119/jobmanager-pbs-long";

      2. Write the scripts that usually are provided by the experiments:
        install_sw
        #!/bin/bash
        export TAR_LOC=`pwd`
        wget http://grid-deployment.web.cern.ch/grid-deployment/eis/docs/lcg-util-client.tar.gz
        cd $MAINPATH
        mkdir lcg-utils-4.3
        cd lcg-utils-4.3
        echo 'running the command : tar xzvf $TAR_LOC/lcg-util-client.tar.gz'
        tar xzvf $TAR_LOC/lcg-util-client.tar.gz
        exit $?
        uninstall_sw
        #!/bin/bash
        export TAR_LOC=`pwd`
        cd $MAINPATH
        rm -rf lcg-utils-4.3
        exit 0
        validate_lcg
        #!/bin/bash
        cd $MAINPATH
        cd lcg-utils-4.3
        if [ -f README ]; then
        exit 0
        else
        exit -1
        fi
      3. submit the job : edg-job-submit --vo dteam install.jdl

        In case of:

        SUCCESS

        1. You will receive an e-mail claiming that with a short report node by node.
        2. the command lcg-ManageVOTag --list -vo dteam -host < your site> will return the tag VO-dteam-lcg-utils-4.3 published on the Information System .
        3. On the mysql database you see a new row in the table wn_list.flags whose value is lcg-utils-4.3
        4. On each WN on the directory "explocdir"/dteam you have a new directory called lcg-utils-4.3; the same under the "rsyncrep"/dteam directory on the Storage Element.

        FAILURE

        the test has to be considered failed if one of the above points are not experienced!

        Possible workarounds on the light of the results of the TEST A (auto-registration of hosts):

        1. Is the directory (on the CE) /opt/edg/var/info/dteam owned by dteamsgm:dteam?
        2. Is the cronjob for the ESM of the VO you used in your test running correctly? Check exactly if the /opt/lcg/sbn/lcg-asis-client.sh runs against the right service and against the right VO on the WNs.
        3. Do the local installation directory and local flag dir are set appropriately? Check in to the file /opt/lcg/etc/spark.conf for this VO. Check the ownership of these directory and check if it's a valid path
        4. Is the hostname of the disk repository the wrong one? The hostname of the disk repository is correct but not correctly configured or rsync is not running there. Is there disk space enough? Are the user and the passord set on the WNs (rsyncuser and dbpasswd) the same as the ones set in the /etc/rsyncd.secrets file on the SE? Is the port number the same on SE and WNs?
        5. If the failure it's only partial (for instance the tag published on the Information System is VO-dteam-lcg-utils-4.2-processing-validate) and you receive the e-mail it surely means that sudo is not correctly configured on the CE. add the line:

          edguser ALL = NOPASSWD: /bin/chown

          on the file /etc/sudoers if the file does exist otherwise install sudo and add this line!.

        6. If the tag published is VO-dteam-lcg-utils-4.2-aborted-validate AND you receive the report via e-mail,it means that the system has been installed correctly and the e-mail that you receive will allow you to debug the problem node by node. (may be small local disk space or some other local -WN- problem)
        7. A certain amount of problems comes with rsync. If you see (in your job output) a line like:
          1. rsync error: some files could not be transferred (code 23) at main.c(620):
            The local path (to be syncronized) is not existing or points to a not accessible directory in the WN. Check your /opt/lcg/etc/spark.conf file to see where the vo_locdir points to.
          2. Connection refused rsync error: error in socket IO (code 10) at clientserver.c(83) :
            The rsync server is not running on the SE. Log in to the SE as root and do /etc/init.d/rsyncd start
          3. @ERROR: auth failed on module ... :
            The rsync server has not authorized the user. Check the /etc/rsyncd.secret file and the /etc/rsyncd.conf file on the SE and compare these values with the corresponding /opt/lcg/etc/tank.conf (in the CE) for both password and remote_user .
          4. @ERROR: Unknown module :
            The rsync server did not find the module. Check the /opt/lcg/etc/tank.conf (in the CE) the corresponding remote_module field .
      4. edg-job-submit --vo dteam uninstall.jdl

        1. it will remove the TAG from the Information System (if the flavour is not aborted-install or processing-install)
        2. it will clean-up the VO_DTEAM_SW_DIR on each WN and the cemtral repository on the SE. On the mysql wn_list.flags the status for the row lcg-utils-4.3 becomes 'DEL'.
        No failures are expected for the point 4. if the point 3. ran with success!

  4. Shutting down the Tank service

    There is not at all a script that switch off the service on the Computing Element. Nevertheless the site administartors that doesn't want to keep running the service on his Computing Element can follow the following recipe:

    1. log on the Computing Element as root
    2. run the command: $LCG_LOCATION/sbin/tank stop
    3. modify the crontab by removing (commenting) the jobs:

      - 19 2,8,14,20 * * * /opt/lcg/sbin/tank proxy > /dev/null 2>&1 >>tmp1.$$
      - 0-59/5 * * * * /opt/lcg/sbin/tank watch_dog > /dev/null 2>&1 >>tmp1.$$

    In this way Tank is OFF.
    There could be however the need of switching off the cronjobs on the WNs for all VOs.
    For that you have to :

    1. loops over the WNs of the site and log in as root
    2. on each WN remove for each VO the cronjob:

      0-59/10 * * * * /opt/lcg/sbin/lcg-asis-client.sh < vo> < your_ce>


  5. More documentation about Experiment Software Installation on LCG-2

    1. Original Requirements

    2. FAQ (in Italian)

    3. Sofwtare Installation General Procedure

    4. Tank & Spark in a Nutshell

    5. More documentation about Tank & Spark

    6. lcg-asis available there

    7. Recent results from tests performed within the INFN activity ECGI

    8. Talk given in Melbourne (Dec. 2005)


    Back to top


    Roberto Santinelli's home page

    Roberto Santinelli