Attending: Jens (chair+mins), Steve, Alessandra, John B, John H, Daniel, Marcus, Winnie, Brian, Kashif, Sam, Elena, Matt, Raja, Raul, Pete.

Apologies: Tom

Apologies: Alastair who was originally to have reported on hammer cloud testing but haven't finished the results; so will
report next week.


0. Operational blog posts

   No new blog posts - and no operational issues.

   Is the egress waiver from cloud providers for academic users interesting?  Brian points out one should read the small print...
   Jens had recently met people from Microsoft which is now waiving egress (from Azure) for academic users.

   If your result dataset is (much) smaller than the original dataset then it doesn't matter: you can just delete the original
   dataset and download the result... (as the original dataset would be a replica anyway)

   Maybe there is an opportunity to do interesting things with CSPs?

1. There's a GDB next week with lots of GridPP folks speaking.

    http://indico.cern.ch/event/394782/

   Alessandra speaking for ATLAS, but also reporting on GridPP, as will Andrew McNab who will not just speak for LHCb.

   Federated storage:
   * Is it only ATLAS?
   * Duncan monitored CMS use (AAA); Glasgow and QMUL T3s for CMS, so use AAA
   * VAC using ATLAs - only UCL has no storage.

   Part of the question is the future evolution of the T2s.  To have fewer endpoints one either has to run distributed SEs
   a la NorduGrid (which is obviously possible with dCache and should be possible "soon" with DPM, cf. the discussion last
   week), or one has to run fewer sites with storage, a la the T2Cs/T2Ds model.

   Technically it is possible but in the case of DPM will need some work/testing still; however, there may also be
   political issues and issues with how well it works for the experiments as the "federation" now happens at a different level
   and there are again different ways to approach the problem.

   For example, storage at a non-endpoint T2 could be (a) cache, (b1) part of a distributed SE, (b2) part of a dsitributed SE
   holding replicas of data, (b3) part of distributed SE holding cunningly coded data blocks to minimise risk of loss and
   maximise risk of availability while keeping inter-site transfers low, or (c) federated (in the sense that jobs will always
   call out and pick stuff off a remote SE).

   Among these, (a) currently looks most attractive.  Generally more testing is needed, also because DPM's caching mode
   is being developed (see last week).  Options (b) may be less attractive partly politically but also a badly performing
   site can drag the whole thing down...(SAM tests for availability failed with what was thought to be a GFAL problem but
   turned out to be a timeout issue where the timeout in tests were shorter than those in production); also the (b) options
   will need migration of data from old SE to new SE.  Option (c) is basically what we are working with at the moment.

   Option (a) will raise questions about latency in initial read.  CMS files are generally considered "more cache efficient"

   Option (b3) in particular will need Cunning and Clever Stuff which is of course always exciting but may not be palatable
   to experiments...

   Also, as Matt points out, debugging becomes harder the more distributed and complicated your infrastructure is.  Balancing
   and tuning is not effortless even with DPM, as seen in a Ewanless Oxford.

   We also in general should not test things that will not be options for the future.  For example, if sites won't run
   distributed SEs, then there is no point in testing that.

   It is expected that a site will be able to choose: if they feel comfortable running a particular infrastructure, then
   they should be able to do so.  It follows that the various options could/should have estimated level of effort attached
   to them.

   Other things will feed into this process: such as the new DPM stuff (see last week once again) and the Rucio/cache integration
   we have also talked about before (Durham).  And ATLAS's "event service" should be nearing production, where events are
   addressed instead of the ROOT file.

   How is this GDB going to decide on stuff to move forward?  It is likely to have several iterations, but fewer than infinite
   in number, and to lead to the creation of a task force where GridPP ought to be represented (so we have a say in how things
   evolve, and because we are good at doing useful stuff.)  In the past task forces have tended to be quite open although with
   the merger between CERN IT and WLCG ops things seem to have formalised more.

2. Round table updates - what would you be working on if you had time to
do something Interesting™?

   This type of event is useful to give people a chance to speak who do not always say something, and identify common
   items of interest (e.g. DOME, cloud, etc.), but as we were running out of time we decided to postpone it.

3. AOB


Alessandra Forti: (04/05/2016 10:07:08)
I am
Daniel Peter Traynor: (10:14 AM)
Last week (27/4 to 4th may) data rates at QM: WAN IN avg 3.8 Gb/s(peak 12.8Gb/s); OUT avg 4.4Gb/s (peak 14.4Gb/s). Internal storage activity: Writes avg 4.8 Gb/s (peak 14.6 Gb/s), esentialy equal to the wan IN; reads avg 11.1Gb/s (peak 26.4Gb/s). 
It's interesting to see more data leave QM than enter
Jens Jensen: (10:15 AM)
thanks Daniel
Daniel Peter Traynor: (10:25 AM)
for sites with an exisiting storage infrastructure (lustre/gpfs/hdfs/glustrefs) then you need grid software to sit on top (storm / dpmlite). A ceph based cluster could be used in different ways by different users so probably most useful for muti tenant clusters (e.g. a seperate posix partition for local users, xrootd for grid). If a site can afford to run a seperate storage system then DPM's fine but it is a grid only solution (I would say). 
I would guess a lot more resources will be via shared clusters 
Marcus Ebert: (10:33 AM)
or a single griddp-endpoint...
gridpp
Matt Doidge: (10:40 AM)
Debugging problems is the hard thing I find. Is it the headnode? Is it the pool node? Is it the client?
Daniel Peter Traynor: (10:43 AM)
a cache that could be used by anyone/anysite be it a T2/T3/laptop