Attending: Jens (chair+mins), Steve, Alessandra, John B, John H, Daniel, Marcus, Winnie, Brian, Kashif, Sam, Elena, Matt, Raja, Raul, Pete. Apologies: Tom Apologies: Alastair who was originally to have reported on hammer cloud testing but haven't finished the results; so will report next week. 0. Operational blog posts No new blog posts - and no operational issues. Is the egress waiver from cloud providers for academic users interesting? Brian points out one should read the small print... Jens had recently met people from Microsoft which is now waiving egress (from Azure) for academic users. If your result dataset is (much) smaller than the original dataset then it doesn't matter: you can just delete the original dataset and download the result... (as the original dataset would be a replica anyway) Maybe there is an opportunity to do interesting things with CSPs? 1. There's a GDB next week with lots of GridPP folks speaking. http://indico.cern.ch/event/394782/ Alessandra speaking for ATLAS, but also reporting on GridPP, as will Andrew McNab who will not just speak for LHCb. Federated storage: * Is it only ATLAS? * Duncan monitored CMS use (AAA); Glasgow and QMUL T3s for CMS, so use AAA * VAC using ATLAs - only UCL has no storage. Part of the question is the future evolution of the T2s. To have fewer endpoints one either has to run distributed SEs a la NorduGrid (which is obviously possible with dCache and should be possible "soon" with DPM, cf. the discussion last week), or one has to run fewer sites with storage, a la the T2Cs/T2Ds model. Technically it is possible but in the case of DPM will need some work/testing still; however, there may also be political issues and issues with how well it works for the experiments as the "federation" now happens at a different level and there are again different ways to approach the problem. For example, storage at a non-endpoint T2 could be (a) cache, (b1) part of a distributed SE, (b2) part of a dsitributed SE holding replicas of data, (b3) part of distributed SE holding cunningly coded data blocks to minimise risk of loss and maximise risk of availability while keeping inter-site transfers low, or (c) federated (in the sense that jobs will always call out and pick stuff off a remote SE). Among these, (a) currently looks most attractive. Generally more testing is needed, also because DPM's caching mode is being developed (see last week). Options (b) may be less attractive partly politically but also a badly performing site can drag the whole thing down...(SAM tests for availability failed with what was thought to be a GFAL problem but turned out to be a timeout issue where the timeout in tests were shorter than those in production); also the (b) options will need migration of data from old SE to new SE. Option (c) is basically what we are working with at the moment. Option (a) will raise questions about latency in initial read. CMS files are generally considered "more cache efficient" Option (b3) in particular will need Cunning and Clever Stuff which is of course always exciting but may not be palatable to experiments... Also, as Matt points out, debugging becomes harder the more distributed and complicated your infrastructure is. Balancing and tuning is not effortless even with DPM, as seen in a Ewanless Oxford. We also in general should not test things that will not be options for the future. For example, if sites won't run distributed SEs, then there is no point in testing that. It is expected that a site will be able to choose: if they feel comfortable running a particular infrastructure, then they should be able to do so. It follows that the various options could/should have estimated level of effort attached to them. Other things will feed into this process: such as the new DPM stuff (see last week once again) and the Rucio/cache integration we have also talked about before (Durham). And ATLAS's "event service" should be nearing production, where events are addressed instead of the ROOT file. How is this GDB going to decide on stuff to move forward? It is likely to have several iterations, but fewer than infinite in number, and to lead to the creation of a task force where GridPP ought to be represented (so we have a say in how things evolve, and because we are good at doing useful stuff.) In the past task forces have tended to be quite open although with the merger between CERN IT and WLCG ops things seem to have formalised more. 2. Round table updates - what would you be working on if you had time to do something Interesting™? This type of event is useful to give people a chance to speak who do not always say something, and identify common items of interest (e.g. DOME, cloud, etc.), but as we were running out of time we decided to postpone it. 3. AOB Alessandra Forti: (04/05/2016 10:07:08) I am Daniel Peter Traynor: (10:14 AM) Last week (27/4 to 4th may) data rates at QM: WAN IN avg 3.8 Gb/s(peak 12.8Gb/s); OUT avg 4.4Gb/s (peak 14.4Gb/s). Internal storage activity: Writes avg 4.8 Gb/s (peak 14.6 Gb/s), esentialy equal to the wan IN; reads avg 11.1Gb/s (peak 26.4Gb/s). It's interesting to see more data leave QM than enter Jens Jensen: (10:15 AM) thanks Daniel Daniel Peter Traynor: (10:25 AM) for sites with an exisiting storage infrastructure (lustre/gpfs/hdfs/glustrefs) then you need grid software to sit on top (storm / dpmlite). A ceph based cluster could be used in different ways by different users so probably most useful for muti tenant clusters (e.g. a seperate posix partition for local users, xrootd for grid). If a site can afford to run a seperate storage system then DPM's fine but it is a grid only solution (I would say). I would guess a lot more resources will be via shared clusters Marcus Ebert: (10:33 AM) or a single griddp-endpoint... gridpp Matt Doidge: (10:40 AM) Debugging problems is the hard thing I find. Is it the headnode? Is it the pool node? Is it the client? Daniel Peter Traynor: (10:43 AM) a cache that could be used by anyone/anysite be it a T2/T3/laptop