GridPP Storage Meeting minutes / notes 11 Mar 2020, [minutes by Sam] Ceph post mortem at CERN Issue was actually due to bug in the upstream lz4 library (CERN turned on block compression in the Ceph Block Store in late December 2019). Bug is since fixed in lz4, but Centos7 has old version pre-fix. Since also fixed in Nautilus recent release. Question as to if we use lz4 in other systems: ZFS with compression has lz4 as an option [but not the default]. Bug would only be an issue if stream to lz4 was misaligned or memory fragmented, so depends on ZFS internals... ---- Teng reported on DOMA ACCESS meeting & white Paper, next meeting will be joint HSF * Access. Teng will be presenting on Cache performance with info from UK sites [links to items in chat log] Matt from Lancaster discussing our policies of cannibalising old nodes to replace missing disks from RAIDsets (the "gradual phasing out" policy for out of warranty kit). Do we need Enterprise Disks? Backblaze did a study on this. Pete notes Oxford have tried this locally using non-enterprise disks in a chassis. $250 for 8TB disks? Pete also asked: What should core sites be doing with their money? Resilience? Removing out of warranty disks? Can a core site "grow larger" if it decommissions storage as soon as warranty exceeded Rob Appleyard noted that the Tier-1 phases things out 10% 1 year post warranty, 20% after 2 (and then removes). Matt: 1 week after warranty expires, we lost disks (at Lancs). [Do disks just last less long now than they used to?] [We don't tend to get "cold spares for free" anymore, as presumably margins also narrower at our providers!] -- Matt notes that Lancs is doing much better than the other DPM sites with Rucio/SKA [it has been observed by Alessandra]. Matt thinks this might be because he's *not* done updates as recently as Manchester? [1.10.3.0 or 3.1?] ACTION: people should report the version of DPM they're running with DOME for comparison. --- SARS-Cov-2 / COVID-19 planning Operational model (from RAL) - we all work from home except for skeleton staff for essential activities. Pete is self-isolating! Oxford don't have a full plan yet, but are working on it. RAL test - had a "whole Tier1 meeting" morning & evening mostly raising issues in Zoom. Slack for coordinating within groups. "We can mostly work from home, so..." --------------- Ceph @ Glasgow now at Nautilus (and rebuilt xrootd & gridftp for it). Adding new disks slowly - to talk to Rob Appleyard offline about the best practice for this [OSDs one by one versus adding all of them at low weight and then adjusting weight up over time] ------------------ Pete: how do we report IRIS storage money? Would like to change this to report allocation per VO? SRR Ste noted that this is now work with the wlcg-information reporting taskforce Rationale: need to show we're providing resources to IRIS. Can we just provide a "UK website"? -- ACTION: Sam to collect current SRR status for sites, plan for repo/visualisation report. From Duncan Rand to Everyone: 10:06 AM Is there a link to the repore? report From Winnie Lacesso to Everyone: 10:06 AM https://indico.cern.ch/event/893845/ From Me to Everyone: 10:09 AM https://tracker.ceph.com/issues/39525 From Teng to Everyone: 10:12 AM Sorry, Rob trying to talk Something wrong with his mic From rob-currie to Everyone: 10:13 AM Yes ZFS, uses LZ4, sorry couldn.t type there even... From Teng to Everyone: 10:23 AM https://indico.cern.ch/event/895814/ https://indico.cern.ch/event/895815/ From Vip (Oxford) to Everyone: 10:29 AM sorry £200 for 8TB disk From Daniel Traynor to Everyone: 10:30 AM If we decommissioned out of warranty storage we would lose 3 PB We decommission 1 or 2 nodes for spares From Daniel Traynor to Everyone: 10:37 AM We have to pay for cold spares! From Vip (Oxford) to Everyone: 10:41 AM we dpm 1.13.0-1 on all our pool nodes have From Daniel Traynor to Everyone: 10:46 AM It.s all fine unless something physically breaks. Better than daytime TV From Matt Doidge to Everyone: 10:54 AM https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dteam/SRR/storagesummary.json https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/SRR/storagesummary.json That one's up to date. From Winnie Lacesso to Everyone: 10:54 AM .Crusty.?!? old?! From Daniel Traynor to Everyone: 10:54 AM Vapour? Replacing rebus From Ste to Everyone: 10:57 AM https://twiki.cern.ch/twiki/bin/view/EGEE/WLCGISEvolution From Daniel Traynor to Everyone: 10:57 AM REM JSON is replacing SRM, QMUL still has SRM