Attending: Duncan, Elena, Gareth, Jens, John B, John H, Sam, Matt D, Winnie,
David, Luke, Ewan, George V.

Apols: Tom, Alastair, Brian


0. Operational blog posts

   - Including problems or requests from the usual LHC VOs, if any

   John H had upgraded a pool node to DPM 1.8.10 (from 1.8.8), using puppet,
   and found that a checksum operation now fails, thus leading to the transfer
   failing [Sam reported in chat that it's storing the checksum which fails,
   not the calculation].  Issue might be with a newer pool talking to an older
   head node, or between local and DPM puppet - not known.  John also had
   problems with the peaceful coexistence between the DPM puppet and the xroot
   puppet config.  These issues had not been found previously in pre-release
   testing.

   A CMS specific issue is supposed to be fixed in 1.8.10; however, at the
   moment we cannot recommend upgrading.

   - Any loose ends for the QR for Pete

   Please send to Jens soon (today or tomorrow) - specifically things not
   already in the minutes of meetings, like submitted publications, etc.

1. General round-up of other issues - ie where are we wrt:
  - Support for DiRAC

    New site (Leicester) about to join, needs certificates sorting out.
    Proxy problems, with VOMS proxies; otherwise using very long lived plain
    proxies which works fine.

  - and other "newish" VOs (LIGO? UKQCD?)

    Some of this reported at UKT0 (below).  Paul (LIGO) didn't really have
    time to work with us as he has other priorities.  LIGO already have
    catalogues; could we work with those?

  - Filedumps and syncat
https://twiki.cern.ch/twiki/bin/view/LCG/ConsistencyChecksSEsDumps#Format_of_SE_dumps
     Sam intends to do some more work/testing on this.  Also the accounting
     work is being resurrected; there is a new script which is basically a
     cron job; Matt fixed it and ran it.  Traditionally there are two kinds
     of storage accounting records, a current state record (like the BDII) and
     a delta record which records transfers over a set period (or even
     individual transfers).  One can add up the latter but not usually the former,
     thus Glasgow find themselves the proud owners of several exabytes.

   - GridFTP to CEPH

     Issues reported (see mail from Alastair); Sam reports that the issues are
     with dynamically setting user/role/file parameters - similar to the URLs
     Sebastien had used for CASTOR which had extra stuff hidden in them in a
     non-standard way (ie non-RFC); it would probably be better to use a RFC
     compliant URL because there are plenty of ways of passing in extra stuff
     (although you could debate how much should be stored as a SURL).  Sam had
     been working on a compliant URL parser for GridFTP for CEPH.

   - T2C and T2D proposals/testing - Oxford as a T2C?

     This was suggested for the Friday technical meeting.  Ewan reports not
     much progress due to people being away; so far discussions have been
     about bringing the right people together, etc.  However, there is
     interest from LHCb in testing this, so we may not need CMS (Chris Brew in
     this instance) to test it.  ACTION on Ewan to send a mail to the relevant
     people.

   - Feedback on the E2E workshop presentation (aide-memoire for next
week; Brian is on leave this week)

2. Summary of the UKT0 workshop last week

   It was generally considered a good workshop with lots of interesting
   presentations - which can be found here:
   https://eventbooking.stfc.ac.uk/news-events/uk-t0-workshop-296?agenda=1

   GridPP was well represented both one way - people from GridPP presenting
   achievements or ongoing work - and the other way, where people in other
   science communities or infrastructures express an interest in working with
   GridPP.

   Is there interest in an S3 interface?  LOFAR reports good experiences with
   the S3 interface to AWS (albeit with slow transfer rates?!), so Alastair
   suggested RAL might offer an S3 interface to the Tier 1 CEPH cluster.
   However, Ewan asks whether it would not be better to get over the hurdle of
   getting things set up and working so there will be a full distributed
   infrastructure available, rather than poking away at data held at a
   specific site/endpoint.  Even for new VOs, while we recommend that they
   initially start with a single site, or two, they are then able to scale up
   later to several sites without any additional effort.

   The history with LOFAR was that they liked the cloudy use of things; they
   used a Spanish cloud which then went along and joined the EGI fedcloud.
   They also tried the Tier 1 cloud at RAL but found it harder to use.

   It is clear that many of the communities need to manage data, but that
   includes data transfers.  While the DiRAC experience wasn't entirely
   smooth, it did show that it can be done and the documentation produced for
   DiRAC will also help: we should see whether it will really be easier for
   Leicester, building on the experiences from Durham.  Also other user
   communities have expressed an interest in the transfer-data-to-RAL HOWTO.

   Staying with the transferring data theme, LSST mentioned they'd be
   replicating databases with xroot.  Questioned about this, it turns out it's
   more a plan than an actual operational thing.  So why do they want to
   replicate with xroot rather than HTTP or something?  RAL has a lasers
   plasma physics group writing data into CASTOR directly with xroot, but they
   needed command line access (and non-certificate) to CASTOR from a cluster.
   This is something where we can contribute additional documentation.  Also
   GlobusConnect is an option; Jens had set up GlobusConnect endpoints at RAL
   but they are probably dead now, it's something we can look at again.

   There seems to be interest in a data (e)infrastructure workshop
   specifically, so GridPP would again obviously be involved.

   Similarly for the authentication and authorisation, there is a lot of
   interest in single sign on and federated identity management and previously
   there'd been a few discussions about how one might link DiRAC and GridPP
   for example.  Also EUDAT is working on stuff like this (aka B2STAGE).  We
   had a discussion about whether SARoNGS could be used to generate
   certificates for people, and perhaps higher LoA certificates with longer
   lifetimes - e.g. making SARoNGS an IGTF IOTA CA, getting additional
   attributes using the REFEDS Research and Scholarship (which is not
   generally supported in the UK), and linking caportal (and, perhaps less
   likely), CertWizard with federated identities: some of these will require
   changes in the UKAMF, and some things will require development in a CA
   which does not have much (any) development effort at the moment.  There is
   also the wider perspective of AARC and activities in other projects.  But
   one should of course ask the question, and sometimes an afternoon's focused
   hacking can bring excellent results...  and SARoNGS itself is in fact alive
   and working (not sure about the VOMS part, haven't tested that recently)

   We will be returning to the topic of UKT0 later: it is an opportunity for
   GridPP and an area where in many ways we are leading the way.

3. There is a November GDB coming up shortly (next week):
http://indico.cern.ch/event/319753/
where potentially interesting items include transfer metrics, and cloud
accounting (if it includes storage), and possibly T0 update(?) and maybe
the HEPiX summary.

------------------------------------------------------------------------
Alastair writes:

Hi

I will be at the extremely exiting “SCD Finance Workshop” tomorrow so I won’t
be able to attend.

There is something I would like to report as part of 2)

During the UK T0 meeting LOFAR mentioned how easy it was to use S3 (and other
AWS) compared to Grid stuff*.  It was agreed shortly after the meeting that
RAL would open up access to the S3 API backed by our new Ceph cluster.
Catalin has been using it (since HEPiX) for the testing the CVMFS stratum 1
stuff and has written ~2 million files (a few hundred GB) to the cluster.  It
would be extremely useful if a few external (to RAL) people tested it as well
(in the next few weeks) so we can ensure any firewall/network problems are
solved before we let users on to it.  We decided not to provide a ‘dteam’
account for everyone to share, instead we will provide individual users
accounts on requests.

If you want access to the RAL S3 instance, can you drop me an email.  We will
probably setup default quotas of 100GB and 10 000 files.  This can be
increased (especially if you want to do something interesting) upon request.
I should stress that this is a PRE-production service and hence we provide no
guarantee on the resilience of your data.  Your data is being stored on new
production quality hardware however we do not yet have significant experience
running Ceph as a service.  At some point before April 2016, when we intend to
declare the service production quality, we will be taking the entire cluster
down for a major update and you WILL lose any data stored on the cluster.  We
will try and give notice of this upgrade.


I also note that GridFTP to CEPH is on the agenda?  Has somebody been
specifically asked to talk about this?  There are some “issue” with our plugin
and I would appreciate the advice of something familiar with GridFTP. 

Alastair


* They didn’t upload their talk to the agenda but I have a copy of it, if you
want me to send it around.


------------------------------------------------------------------------
Chat log

John Hill: (28/10/2015 10:01:30)
ouch
Samuel Cadellin Skipsey: (10:02 AM)
That was me, I didn't hear any noise, but apparently you just got noise?
I was going to volunteer John Hill's DPM 1.8.10 checksum issue
(which is now in a JIRA ticket for DPM)
For people who want more detail: the issue is not with checksum calculation, it's that the DPNS doesn't want to store it once calculated.
(The transfer actually succeeds, but the command gives the error because it got an error code from checksum *saving*)
John Hill: (10:09 AM)
Also to make clear - I've updated to the DPM 1.8.10 components on WNs without problem.
John Bland: (10:14 AM)
any link for instructions?
Samuel Cadellin Skipsey: (10:14 AM)
https://wiki.egi.eu/wiki/APEL/Storage
John Bland: (10:15 AM)
thanks
Matt Doidge: (10:18 AM)
I had to edit the star-accounting.py script to get it to parse the "validduration" option" correctly.
< d = datetime.datetime.utcnow() + datetime.timedelta(seconds=int(validduration)) 
--- 
> d = datetime.datetime.utcnow() + datetime.timedelta(seconds=validduration) 
And the simple command line to conjure it was:
mkdir -p /var/spool/apel/outgoing/`date +%Y%m%d` && /usr/share/lcgdm/scripts/star-accounting.py --reportgroups --nsconfig=/usr/etc/NSCONFIG --site="UKI-NORTHGRID-LANCS-HEP" --validduration=86400 > /var/spool/apel/outgoing/`date +%Y%m%d`/`date +%Y%m%d%H%M%S` && ssmsend 
Ewan Mac Mahon: (10:21 AM)
Don't think we've actually done anything on this, mainly just a matter of getting people together.
Lukasz Kreczko: (10:26 AM)
CMS is using 1.8 million PB at Bristol ;)
Jens Jensen: (10:26 AM)
https://eventbooking.stfc.ac.uk/news-events/uk-t0-workshop-296?agenda=1
Ewan Mac Mahon: (10:28 AM)
I think the right message is, as always, grid certs for everyone, ra, ra, ra!
Paige Winslowe Lacesso: (10:33 AM)
DID YOU REALLY SAY IT TOOK *TWO MONTHS* to xfer 50TB? I couldn't believe my ears!
Ewan Mac Mahon: (10:35 AM)
Amazon's solutionfor 50TB transfers is that they'll post you a box and you can courier it back to them with the data on.
https://aws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte-per-week-using-amazon-owned-storage-appliances/
Samuel Cadellin Skipsey: (10:37 AM)
Indeed, I note that Amazon Snowball was explicitly namechecked by at least one presentation.
(And, of course, sneakernet solutions are always higher bandwidth than fibres.)
Ewan Mac Mahon: (10:38 AM)
If anyone has a need for a snowball equivalent, I imagine we'd be able to come up with something suitable.
It doesn't /sound/ hard.
Box full of 'archive' drives set up as a classic SE, and ship it to a decently connected site (RAL/Imperial?) and then FTS everything off it.
Jens Jensen: (10:53 AM)
https://cts.ngs.ac.uk/