Attending: Gareth, Jens (chair+mins), John B, John H, Adam, Winnie, Sam, Brian, David, Elena, Matt D, Jeremy, Duncan, Steve, Robert, Govind, Ewan, Peter L; Dave C at 10.28. Special guest stars (from RAL): Tom Byrne, Bruno Canning, Alistair Dewhurst, George Vasilakakos; Tom presenting. 0. Operational blog posts Sam has things he ought to report in the blog. Many people have things they ought to report in the blog... 1. Wahid's report from last week's pre/GDB Wahid's presentation https://indico.cern.ch/event/319743/session/1/contribution/2/material/slides/1.pdf * CMS still use SRM. * Issues with DAV for deletion. SRM quicker for bulk deletion. * Choice of space if not using space token (xroot or WebDAV) * CEPH - needs an extra layer anyway to manage the ingest and retrieval of data. * Support for 3rd party transfer essential - like GridPP * Need to remove RFIO as protocol, also for DPM's internal communication. * If only BringOnline is used in the SRM API, there'd be little need for SRM even for tape, as BringOnline alone could be implemented more simply. * lcg-utils still being used despite being first deprecated and now not supported. Note the inertia: people are still using RFIO with CASTOR (at CERN), and still using lcg-utils. It takes a long time for people to change, particularly when they have few resources to modify their middleware. dcache talked about standards - there are three kinds, if you will, those from a standards body such as IETF and OGF (like HTTP, DAV, resp. SRM and GridFTP), those that are de facto standards but arising out of our own community (such as xroot), and those that are de facto standards but owned by someone outside our community (like S3). The risk is obviously higher with the latter as the chance that the protocol being changed by circumstances outside our control is higher. 2. Update from the RAL CEPH team EOS Diamond (as usual, no relation to either EOS or Diamond) plugin for xroot; worked on firefly which is the older release. Infrastructure currently running on older hardware, RAL saw bad performance and then the cluster died: experience shows that we need one OSD per core, and usage is not necessarily balanced. Guidelines 1GHz/OSD; 1GB RAM/OSD (RAL seeing ~700MB/OSD). All the nodes in the cluster should be kept up to spec. RAL using EC, not replication. A major upgrade of the not heavily loaded test cluster took 30 mins; a similar upgrade on a production cluster went well except some OSDs needed restarting, and one was wiped as the filesystem had corrupted and it was easier to rebuild it than to try to rescue it. Gateway - easier to get things out the same way they were put in. S3 support in FTS: you will need to give it your key... 3. AOB wahid: (21/01/2015 10:04:12) https://indico.cern.ch/event/319817/ https://indico.cern.ch/event/319743/session/1/contribution/2/material/slides/1.pdf Samuel Cadellin Skipsey: (10:11 AM) It should be noted that there's a conflation of "protocol" and "device type" when talking about CEPH. CephFS, S3, RBD "storage" hosted on CEPH aren't mutually compatible at the low level, but that's not because of the protocol they are accessed by, but because they represent entirely different types of "virtual storage space". There's nothing stopping you from writing multiple *protocol* shims that all back onto the same kind of CHEP high-level storage paradigm. Ewan Mac Mahon: (10:33 AM) So is that one core per disk? That's somewhat CPU heavy compared with what we do at the moment. Michael Adam James Huffman: (10:34 AM) Isn't it 1G RAM per OSD? (as the recommendation) Ewan Mac Mahon: (10:36 AM) That's a /lot/ of CPU. Say 1PB cluster =~ 250 4TB disks (or 750 disks for 3x replication) that number of cores is a respectable compute cluster in its own right. Jens Jensen: (10:43 AM) :-) (May need to speed up a bit - we should finish by 11 at the latest...) Samuel Cadellin Skipsey: (10:51 AM) (It's called radosstriper for those looking for it in the source.)