Attending: Jens (chair+mins), Ste, Teng, Wenlong, Winnie, Raja, Dan, Rob, Sam, Matt Ste: DOME upgrade successful. Work on resource reporting for LHCb. Some minor tickets to resolve. Support for mu3e - https://en.wikipedia.org/wiki/Mu3e, anti-muon decay. mu3e may be using DFC, CVMFS for software, expecting 20 TB per run, should be scaling up towards 2024 Teng: xcache monitoring Edinburgh and Birmingham, it is enabled but not yet public due to security issues/concerns with the ELK stack. Also, Rucio for DUNE, and Rucio monitoring for SKA. Wenlong: test of EOSfs storage comparing to ZFS, for performance and redundancy (ie durability), expecting to report by end of year. Winnie: storage gone from 100% working to 100% kablooie. Luke not available to sherlockholmes it but Sam offered to take a look, although he'd not have the same powers of intervention that Luke has. Could be volume related - PhEDEx load tests? Raja: 2x tickets open, both RAL ECHO related: https://ggus.eu/index.php?mode=ticket_info&ticket_id=142350 https://ggus.eu/?mode=ticket_info&ticket_id=143323 Also, problems with user jobs trying to clean up data areas, and crashing - problem with xroot proxy? but proxy developers don't think so. Dan: buying new storage for GridPP/IRIS. LTS Lustre, expecting to copy old stuff over to new stuff and get more space. There is CentOS8 client support. Also, reported a reworking of StoRM, based on nginx. Rob: DOME upgrade successful; expected to do CentOS upgrade then DOME+puppet but ends up being the other way around; mostly no issues in DOME+Puppet upgrade. CentOS 6.4 or thereabouts now need upgrading to CentOS7, or maybe CentOS8. Also, RDF needs DOME, but there may be issues with GridFTP support - expecting to upgrade by the end of the year. Sam: CEPH currently powered down for machine room intervention and power testing, but when back up would expect to do some local load testing (of CEPH, not the power). There are external and internal endpoints, which means that endpoints can be protected in different ways, like supporting certificate-less access for local users, and Argus- banned access on DPM for external users. Also reported a lost very old disk server; disk failure led it to try to rebuild but additional failures during rebuild led to 28TB data loss and the moral to not run stuff on ancient hardware - which Sam of course knows perfectly well and had warned people about, but upgrading hardware also costs money. Matt: script to do the Argus banning for DPM. Sam and Matt report the differences in the approach; Glasgow had called out to Argus for every user access, whereas Matt is caching a local banning file which could also be used for other tools. Potentially can also be used with UK Argus rather than site one, leading to reduction in number of services that a site needs to run. Latency of banning tested with David Crooks. Jens: wants to do more on bringing docs up to date (moaning about it on a reg. basis), so hoping to have a bit of dedicated time in Q1 20. This includes work on testing Rucio as a data management recommendation for new VOs/IRIS. Slightly limited time this Q, though, so hoping to catch up next Y. From Me to Everyone: 10:05 AM Brian suggested they were mumumue From Daniel Traynor to Everyone: 10:06 AM We (you) can document and we can update the how to start on the grid docs on the web From Ste to Everyone: 10:11 AM Here it is, for now... http://hep.ph.liv.ac.uk/~sjones/user-guides/data-on-the-grid/data-on-the-grid.html From Daniel Traynor to Everyone: 10:27 AM I still have 1.3PB of usable storage using R510s From Nandakumar, Raja (STFC,RAL,PPD) to Everyone: 10:33 AM Apologies - I need to leave now... From Daniel Traynor to Everyone: 10:33 AM Also I’ve helped put together a case for 100G WAN link upgrade for QMUL, being led by Research ITS (Chris Walker). Should go in in the next few weeks. Internally QMUL is upgrading to 100G+ backbone with data transfer zones included at the start. From Matt Doidge to Everyone: 10:37 AM https://github.com/mdoidge/dpmban/blob/master/dpmban.sh