Jens (chair+mins), Sam, Luke, Marcus, Winnie, Steve, John H, Chris, John B, Robert, David, RobC, Govind Apologies: Simon (following ATLAS Jamboree) 0. Operational blog posts We only had three posts for the previous quarter... reminder to people to blog when they do something interesting. Operational issue reported by Luke, DMLite SE having issues with lots of simultaneously ongoing deletions - out of memory errors. 1. ZFS follow up questions from last week and summary Additional discussion on the list. Includes discussion of the kernel module: there's a repo for ZFS on linux and a CERN one with an older ZFS. Also worth noting Marcus had no problems with the update on CentOS7. Chris is using dkms 0.6.5.8 Are there any licence issue. Source code can be distributed, precompiled module may be a problem. The disk server, when the kernel is upgraded, will automatically recompile the module, but it follows that disk servers then need to have compilers on them! Alternatively one would need to do some local juju to build the module and distribute it for the upgrade. As regards performance, there has been some discussion regarding the different ways of implementing cache etc, and how differences for read/write could arise. In parallel reads, cache could be useless for large files - need to do some testing with multiple accesses. Who else might be interested in the expertise we develop in GridPP, or potentially in collaborating with us? DESY were previously working with ZFS, and there was interest at CHEPiX. Some previous experiences were less encouraging. 2. Update on non-LHC VOs, specifically (if available) LSST, DiRAC, LIGO, SNO+, NeISS, and CERN@School. LSST: US groups - new workflow. New work expected to start end of Jan. There will be two steps, first the generation of MC data, then the processing of the data. LSST don't care which sites they use, they will just use "the grid" - we tend to recommend that they use just one site for storage, at least initially. Would be worth following up with them but they have no one available at the moment. DiRAC: currently Durham is working but not transferring at the moment (no need), ditto Leicester. Cambridge has two parts and the HPC part (ie Stuart) are currently debugging their connection to RAL. Also EPCC testing/debugging is ongoing. This leaves just DAMTP (ie the other Cambridge site) LIGO: no news. Most interesting aspect is the intention to test secure CVMFS to distribute data, based on HTTPS. Of course, HTTPS eliminates the possibility of having intermediate caches because the point-to-point authentication and in-flight encryption would see an intermediate cache as a man-in-the-middle. There'd be the stratum 0 and the WNs; however, BB had intended to invent a WN-local cache via an object store. CERN@School - who's going to take over (if anyone) after Tom? 3. Status of storage related documents as of morning of 18 Jan. StoRM - Daniel 2014 (red) DPM - Wahid 2014 (red) Performance and Tuning - Wahid 2014 (red) dCache - Brian 2015 (red) SRM file loss - Brian 2015 (red) Suitable hardware for grid SE - Brian 2015 (red) New VO deployment(?) - Jens 2016 (red) Grid Storage (overview) - Jens 2016 (July, Amber) 4. AOB Lukasz Kreczko: (18/01/2017 10:01:04) one operational issue: DMLite SE is going out of memory when many deletions are going on. Samuel Cadellin Skipsey: (10:01 AM) Via HTTP? Lukasz Kreczko: (10:02 AM) Jan 17 05:34:41 lcgse01 kernel: Out of memory: Kill process 6577 (globus-gridftp-) score 5 or sacrifice child Jan 17 05:34:50 lcgse01 kernel: Out of memory: Kill process 11471 (globus-gridftp-) score 5 or sacrifice child Jan 17 05:35:50 lcgse01 kernel: Out of memory: Kill process 14584 (httpd.event) score 12 or sacrifice child Paige Winslowe Lacesso: (10:02 AM) It also oom-killed mysqld & xrootd David Crooks: (10:03 AM) Sam's trying to talk but his mic isn't woprking Samuel Cadellin Skipsey: (10:03 AM) My microphone isn't working, one momemt I don't really have a solution, as I've never seen that happen before. (We've had other services get jammed up, but ATLAS deletions often happen via HTTP, and there's another failure mode / file descriptor leak issue with http/dmlite.) Lukasz Kreczko: (10:05 AM) dpm.x86_64 1.9.0-1.el6 @epel Samuel Cadellin Skipsey: (10:06 AM) Huh, so it could be a bug introduced in the 1.9.x branch, Luke - you've not seen it before? Lukasz Kreczko: (10:06 AM) we've seen the load "explode" before > 400 Samuel Cadellin Skipsey: (10:07 AM) On the head node? Lukasz Kreczko: (10:07 AM) Winnie? yes Samuel Cadellin Skipsey: (10:07 AM) huh, that's very odd. Where was the load actually originating from? Lukasz Kreczko: (10:08 AM) from gridFTP. Deletions == lots of write to the MySQL database which runs on the SE Chris Brew: (10:08 AM) [root@heplni008 ~]# rpm -qa | grep zfs zfs-0.6.5.8-1.el7.centos.x86_64 zfs-release-1-3.el7.centos.noarch libzfs2-0.6.5.8-1.el7.centos.x86_64 zfs-dkms-0.6.5.8-1.el7.centos.noarch [root@heplni008 ~]# rpm -qa | grep spl spl-dkms-0.6.5.8-1.el7.centos.noarch spl-0.6.5.8-1.el7.centos.x86_64 Lukasz Kreczko: (10:09 AM) load is stable now: top - 10:08:02 up 33 days, 18:47, 5 users, load average: 13.25, 32.09, 19.79 Tasks: 532 total, 2 running, 530 sleeping, 0 stopped, 0 zombie Marcus Ebert: (10:09 AM) yes, that should be the plain dkms versions. Would be interesting to see what happens at the next kernel update with it Samuel Cadellin Skipsey: (10:09 AM) Hm, that's possibly why we've not seen it - our deletions from ATLAS often are on http, so it's possibly a gridftp-related issue (or one with gridftp/hdfs) Lukasz Kreczko: (10:10 AM) each gridftp process uses 0.7% of 32 GB, we have 178 such processes Samuel Cadellin Skipsey: (10:10 AM) So, we've seen a lot of gridftp-related-load on pool nodes before, but not on the head. But the config for hdfs and DPM is a bit different for hand-off, isn't it? Lukasz Kreczko: (10:12 AM) it is. The work should be done on the gateway, which does happen. But deletions seem special only 40 gridftp processes on the gateway Samuel Cadellin Skipsey: (10:14 AM) Hm. Is this CMS deletions, out of interest? Lukasz Kreczko: (10:14 AM) yes Samuel Cadellin Skipsey: (10:15 AM) Okay, so I think it's worth pushing this up to dpm-user-support or dpm-dev, as it sounds like a particular load-related bug which I've not seen before. Lukasz Kreczko: (10:16 AM) OK, preparing an email thx Marcus Ebert: (10:16 AM) https://indico.cern.ch/event/505613/contributions/2230928/attachments/1347288/2041281/oral-final-533.pdf pg 12