Attending: Duncan, Elena, Gareth, Jens, John B, John H, Sam, Matt D, Winnie, David, Luke, Ewan, George V. Apols: Tom, Alastair, Brian 0. Operational blog posts - Including problems or requests from the usual LHC VOs, if any John H had upgraded a pool node to DPM 1.8.10 (from 1.8.8), using puppet, and found that a checksum operation now fails, thus leading to the transfer failing [Sam reported in chat that it's storing the checksum which fails, not the calculation]. Issue might be with a newer pool talking to an older head node, or between local and DPM puppet - not known. John also had problems with the peaceful coexistence between the DPM puppet and the xroot puppet config. These issues had not been found previously in pre-release testing. A CMS specific issue is supposed to be fixed in 1.8.10; however, at the moment we cannot recommend upgrading. - Any loose ends for the QR for Pete Please send to Jens soon (today or tomorrow) - specifically things not already in the minutes of meetings, like submitted publications, etc. 1. General round-up of other issues - ie where are we wrt: - Support for DiRAC New site (Leicester) about to join, needs certificates sorting out. Proxy problems, with VOMS proxies; otherwise using very long lived plain proxies which works fine. - and other "newish" VOs (LIGO? UKQCD?) Some of this reported at UKT0 (below). Paul (LIGO) didn't really have time to work with us as he has other priorities. LIGO already have catalogues; could we work with those? - Filedumps and syncat https://twiki.cern.ch/twiki/bin/view/LCG/ConsistencyChecksSEsDumps#Format_of_SE_dumps Sam intends to do some more work/testing on this. Also the accounting work is being resurrected; there is a new script which is basically a cron job; Matt fixed it and ran it. Traditionally there are two kinds of storage accounting records, a current state record (like the BDII) and a delta record which records transfers over a set period (or even individual transfers). One can add up the latter but not usually the former, thus Glasgow find themselves the proud owners of several exabytes. - GridFTP to CEPH Issues reported (see mail from Alastair); Sam reports that the issues are with dynamically setting user/role/file parameters - similar to the URLs Sebastien had used for CASTOR which had extra stuff hidden in them in a non-standard way (ie non-RFC); it would probably be better to use a RFC compliant URL because there are plenty of ways of passing in extra stuff (although you could debate how much should be stored as a SURL). Sam had been working on a compliant URL parser for GridFTP for CEPH. - T2C and T2D proposals/testing - Oxford as a T2C? This was suggested for the Friday technical meeting. Ewan reports not much progress due to people being away; so far discussions have been about bringing the right people together, etc. However, there is interest from LHCb in testing this, so we may not need CMS (Chris Brew in this instance) to test it. ACTION on Ewan to send a mail to the relevant people. - Feedback on the E2E workshop presentation (aide-memoire for next week; Brian is on leave this week) 2. Summary of the UKT0 workshop last week It was generally considered a good workshop with lots of interesting presentations - which can be found here: https://eventbooking.stfc.ac.uk/news-events/uk-t0-workshop-296?agenda=1 GridPP was well represented both one way - people from GridPP presenting achievements or ongoing work - and the other way, where people in other science communities or infrastructures express an interest in working with GridPP. Is there interest in an S3 interface? LOFAR reports good experiences with the S3 interface to AWS (albeit with slow transfer rates?!), so Alastair suggested RAL might offer an S3 interface to the Tier 1 CEPH cluster. However, Ewan asks whether it would not be better to get over the hurdle of getting things set up and working so there will be a full distributed infrastructure available, rather than poking away at data held at a specific site/endpoint. Even for new VOs, while we recommend that they initially start with a single site, or two, they are then able to scale up later to several sites without any additional effort. The history with LOFAR was that they liked the cloudy use of things; they used a Spanish cloud which then went along and joined the EGI fedcloud. They also tried the Tier 1 cloud at RAL but found it harder to use. It is clear that many of the communities need to manage data, but that includes data transfers. While the DiRAC experience wasn't entirely smooth, it did show that it can be done and the documentation produced for DiRAC will also help: we should see whether it will really be easier for Leicester, building on the experiences from Durham. Also other user communities have expressed an interest in the transfer-data-to-RAL HOWTO. Staying with the transferring data theme, LSST mentioned they'd be replicating databases with xroot. Questioned about this, it turns out it's more a plan than an actual operational thing. So why do they want to replicate with xroot rather than HTTP or something? RAL has a lasers plasma physics group writing data into CASTOR directly with xroot, but they needed command line access (and non-certificate) to CASTOR from a cluster. This is something where we can contribute additional documentation. Also GlobusConnect is an option; Jens had set up GlobusConnect endpoints at RAL but they are probably dead now, it's something we can look at again. There seems to be interest in a data (e)infrastructure workshop specifically, so GridPP would again obviously be involved. Similarly for the authentication and authorisation, there is a lot of interest in single sign on and federated identity management and previously there'd been a few discussions about how one might link DiRAC and GridPP for example. Also EUDAT is working on stuff like this (aka B2STAGE). We had a discussion about whether SARoNGS could be used to generate certificates for people, and perhaps higher LoA certificates with longer lifetimes - e.g. making SARoNGS an IGTF IOTA CA, getting additional attributes using the REFEDS Research and Scholarship (which is not generally supported in the UK), and linking caportal (and, perhaps less likely), CertWizard with federated identities: some of these will require changes in the UKAMF, and some things will require development in a CA which does not have much (any) development effort at the moment. There is also the wider perspective of AARC and activities in other projects. But one should of course ask the question, and sometimes an afternoon's focused hacking can bring excellent results... and SARoNGS itself is in fact alive and working (not sure about the VOMS part, haven't tested that recently) We will be returning to the topic of UKT0 later: it is an opportunity for GridPP and an area where in many ways we are leading the way. 3. There is a November GDB coming up shortly (next week): http://indico.cern.ch/event/319753/ where potentially interesting items include transfer metrics, and cloud accounting (if it includes storage), and possibly T0 update(?) and maybe the HEPiX summary. ------------------------------------------------------------------------ Alastair writes: Hi I will be at the extremely exiting “SCD Finance Workshop” tomorrow so I won’t be able to attend. There is something I would like to report as part of 2) During the UK T0 meeting LOFAR mentioned how easy it was to use S3 (and other AWS) compared to Grid stuff*. It was agreed shortly after the meeting that RAL would open up access to the S3 API backed by our new Ceph cluster. Catalin has been using it (since HEPiX) for the testing the CVMFS stratum 1 stuff and has written ~2 million files (a few hundred GB) to the cluster. It would be extremely useful if a few external (to RAL) people tested it as well (in the next few weeks) so we can ensure any firewall/network problems are solved before we let users on to it. We decided not to provide a ‘dteam’ account for everyone to share, instead we will provide individual users accounts on requests. If you want access to the RAL S3 instance, can you drop me an email. We will probably setup default quotas of 100GB and 10 000 files. This can be increased (especially if you want to do something interesting) upon request. I should stress that this is a PRE-production service and hence we provide no guarantee on the resilience of your data. Your data is being stored on new production quality hardware however we do not yet have significant experience running Ceph as a service. At some point before April 2016, when we intend to declare the service production quality, we will be taking the entire cluster down for a major update and you WILL lose any data stored on the cluster. We will try and give notice of this upgrade. I also note that GridFTP to CEPH is on the agenda? Has somebody been specifically asked to talk about this? There are some “issue” with our plugin and I would appreciate the advice of something familiar with GridFTP. Alastair * They didn’t upload their talk to the agenda but I have a copy of it, if you want me to send it around. ------------------------------------------------------------------------ Chat log John Hill: (28/10/2015 10:01:30) ouch Samuel Cadellin Skipsey: (10:02 AM) That was me, I didn't hear any noise, but apparently you just got noise? I was going to volunteer John Hill's DPM 1.8.10 checksum issue (which is now in a JIRA ticket for DPM) For people who want more detail: the issue is not with checksum calculation, it's that the DPNS doesn't want to store it once calculated. (The transfer actually succeeds, but the command gives the error because it got an error code from checksum *saving*) John Hill: (10:09 AM) Also to make clear - I've updated to the DPM 1.8.10 components on WNs without problem. John Bland: (10:14 AM) any link for instructions? Samuel Cadellin Skipsey: (10:14 AM) https://wiki.egi.eu/wiki/APEL/Storage John Bland: (10:15 AM) thanks Matt Doidge: (10:18 AM) I had to edit the star-accounting.py script to get it to parse the "validduration" option" correctly. < d = datetime.datetime.utcnow() + datetime.timedelta(seconds=int(validduration)) --- > d = datetime.datetime.utcnow() + datetime.timedelta(seconds=validduration) And the simple command line to conjure it was: mkdir -p /var/spool/apel/outgoing/`date +%Y%m%d` && /usr/share/lcgdm/scripts/star-accounting.py --reportgroups --nsconfig=/usr/etc/NSCONFIG --site="UKI-NORTHGRID-LANCS-HEP" --validduration=86400 > /var/spool/apel/outgoing/`date +%Y%m%d`/`date +%Y%m%d%H%M%S` && ssmsend Ewan Mac Mahon: (10:21 AM) Don't think we've actually done anything on this, mainly just a matter of getting people together. Lukasz Kreczko: (10:26 AM) CMS is using 1.8 million PB at Bristol ;) Jens Jensen: (10:26 AM) https://eventbooking.stfc.ac.uk/news-events/uk-t0-workshop-296?agenda=1 Ewan Mac Mahon: (10:28 AM) I think the right message is, as always, grid certs for everyone, ra, ra, ra! Paige Winslowe Lacesso: (10:33 AM) DID YOU REALLY SAY IT TOOK *TWO MONTHS* to xfer 50TB? I couldn't believe my ears! Ewan Mac Mahon: (10:35 AM) Amazon's solutionfor 50TB transfers is that they'll post you a box and you can courier it back to them with the data on. https://aws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte-per-week-using-amazon-owned-storage-appliances/ Samuel Cadellin Skipsey: (10:37 AM) Indeed, I note that Amazon Snowball was explicitly namechecked by at least one presentation. (And, of course, sneakernet solutions are always higher bandwidth than fibres.) Ewan Mac Mahon: (10:38 AM) If anyone has a need for a snowball equivalent, I imagine we'd be able to come up with something suitable. It doesn't /sound/ hard. Box full of 'archive' drives set up as a classic SE, and ship it to a decently connected site (RAL/Imperial?) and then FTS everything off it. Jens Jensen: (10:53 AM) https://cts.ngs.ac.uk/