GridPP Storage Meeting minutes / notes 11 Mar 2020, [minutes by Sam]

Ceph post mortem at CERN

Issue was actually due to bug in the upstream lz4 library (CERN turned
on block compression in the Ceph Block Store in late December 2019).
Bug is since fixed in lz4, but Centos7 has old version pre-fix. Since
also fixed in Nautilus recent release.

Question as to if we use lz4 in other systems: ZFS with compression
has lz4 as an option [but not the default]. Bug would only be an issue
if stream to lz4 was misaligned or memory fragmented, so depends on
ZFS internals...

----

Teng reported on DOMA ACCESS meeting & white Paper, next meeting will
be joint HSF * Access. Teng will be presenting on Cache performance
with info from UK sites [links to items in chat log]


Matt from Lancaster discussing our policies of cannibalising old nodes
to replace missing disks from RAIDsets (the "gradual phasing out"
policy for out of warranty kit). Do we need Enterprise Disks?

Backblaze did a study on this.

Pete notes Oxford have tried this locally using non-enterprise disks
in a chassis. $250 for 8TB disks?

Pete also asked: What should core sites be doing with their money?
Resilience? Removing out of warranty disks? Can a core site "grow
larger" if it decommissions storage as soon as warranty exceeded


Rob Appleyard noted that the Tier-1 phases things out 10% 1 year post
warranty, 20% after 2 (and then removes).

Matt: 1 week after warranty expires, we lost disks (at Lancs).

[Do disks just last less long now than they used to?] [We don't tend
to get "cold spares for free" anymore, as presumably margins also
narrower at our providers!]


--
Matt notes that Lancs is doing much better than the other DPM sites
with Rucio/SKA [it has been observed by Alessandra]. Matt thinks this
might be because he's *not* done updates as recently as Manchester?
[1.10.3.0 or 3.1?]

ACTION: people should report the version of DPM they're running with
DOME for comparison.

---
SARS-Cov-2 / COVID-19 planning

Operational model (from RAL) - we all work from home except for
skeleton staff for essential activities.

Pete is self-isolating! Oxford don't have a full plan yet, but are
working on it.

RAL test - had a "whole Tier1 meeting" morning & evening mostly
raising issues in Zoom. Slack for coordinating within groups.

"We can mostly work from home, so..."
---------------

Ceph @ Glasgow now at Nautilus (and rebuilt xrootd & gridftp for it).
Adding new disks slowly - to talk to Rob Appleyard offline about the
best practice for this [OSDs one by one versus adding all of them at
low weight and then adjusting weight up over time]


------------------

Pete: how do we report IRIS storage money? Would like to change this
to report allocation per VO?

SRR

Ste noted that this is now work with the wlcg-information reporting taskforce

Rationale: need to show we're providing resources to IRIS.

Can we just provide a "UK website"?

--
ACTION: Sam to collect current SRR status for sites, plan for
repo/visualisation report.


 From Duncan Rand to Everyone:  10:06 AM
Is there a link to the repore?
report
 From Winnie Lacesso to Everyone:  10:06 AM
https://indico.cern.ch/event/893845/
 From Me to Everyone:  10:09 AM
https://tracker.ceph.com/issues/39525
 From Teng to Everyone:  10:12 AM
Sorry, Rob trying to talk
Something wrong with his mic
 From rob-currie to Everyone:  10:13 AM
Yes ZFS, uses LZ4, sorry couldn.t type there even...
 From Teng to Everyone:  10:23 AM
https://indico.cern.ch/event/895814/
https://indico.cern.ch/event/895815/
 From Vip (Oxford) to Everyone:  10:29 AM
sorry Ł200 for 8TB disk
 From Daniel Traynor to Everyone:  10:30 AM
If we decommissioned out of warranty storage we would lose 3 PB
We decommission 1 or 2 nodes for spares
 From Daniel Traynor to Everyone:  10:37 AM
We have to pay for cold spares!
 From Vip (Oxford) to Everyone:  10:41 AM
we dpm 1.13.0-1 on all our pool nodes
have
 From Daniel Traynor to Everyone:  10:46 AM
It.s all fine unless something physically breaks.
Better than daytime TV
 From Matt Doidge to Everyone:  10:54 AM
https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dteam/SRR/storagesummary.json
https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/SRR/storagesummary.json
That one's up to date.
 From Winnie Lacesso to Everyone:  10:54 AM
.Crusty.?!? old?!
 From Daniel Traynor to Everyone:  10:54 AM
Vapour? Replacing rebus
 From Ste to Everyone:  10:57 AM
https://twiki.cern.ch/twiki/bin/view/EGEE/WLCGISEvolution
 From Daniel Traynor to Everyone:  10:57 AM
REM JSON is replacing SRM, QMUL still has SRM