Jens (chair+mins), Sam, Luke, Marcus, Winnie, Steve, John H, Chris, John B, Robert, David, RobC, Govind

Apologies: Simon (following ATLAS Jamboree)


0. Operational blog posts

   We only had three posts for the previous quarter... reminder to people to blog when they do something interesting.

   Operational issue reported by Luke, DMLite SE having issues with lots of simultaneously ongoing deletions - out of
   memory errors.

1. ZFS follow up questions from last week and summary

   Additional discussion on the list.  Includes discussion of the kernel module: there's a repo for ZFS on linux and
   a CERN one with an older ZFS.  Also worth noting Marcus had no problems with the update on CentOS7.  Chris is using
   dkms 0.6.5.8

   Are there any licence issue.  Source code can be distributed, precompiled module may be a problem.  The disk server, when
   the kernel is upgraded, will automatically recompile the module, but it follows that disk servers then need to have compilers
   on them!  Alternatively one would need to do some local juju to build the module and distribute it for the upgrade.

   As regards performance, there has been some discussion regarding the different ways of implementing cache etc, and
   how differences for read/write could arise.  In parallel reads, cache could be useless for large files - need to do
   some testing with multiple accesses.

   Who else might be interested in the expertise we develop in GridPP, or potentially in collaborating with us?
   DESY were previously working with ZFS, and there was interest at CHEPiX.  Some previous experiences were less
   encouraging.

2. Update on non-LHC VOs, specifically (if available) LSST, DiRAC, LIGO, SNO+, NeISS, and CERN@School.

   LSST: US groups - new workflow.  New work expected to start end of Jan.  There will be two steps, first the generation of
   MC data, then the processing of the data.  LSST don't care which sites they use, they will just use "the grid" - we tend
   to recommend that they use just one site for storage, at least initially.  Would be worth following up with them but they
   have no one available at the moment.

   DiRAC: currently Durham is working but not transferring at the moment (no need), ditto Leicester.
   Cambridge has two parts  and the HPC part (ie Stuart) are currently debugging their connection to RAL.
   Also EPCC testing/debugging is ongoing.  This leaves just DAMTP (ie the other Cambridge site)

   LIGO: no news.  Most interesting aspect is the intention to test secure CVMFS to distribute data, based on HTTPS.
   Of course, HTTPS eliminates the possibility of having intermediate caches because the point-to-point authentication
   and in-flight encryption would see an intermediate cache as a man-in-the-middle.  There'd be the stratum 0 and the WNs;
   however, BB had intended to invent a WN-local cache via an object store.

   CERN@School - who's going to take over (if anyone) after Tom?


3. Status of storage related documents as of morning of 18 Jan.

	StoRM - Daniel 2014 (red)
	DPM - Wahid 2014 (red)
	Performance and Tuning - Wahid 2014 (red)
	dCache - Brian 2015 (red)
	SRM file loss - Brian 2015 (red)
	Suitable hardware for grid SE - Brian 2015 (red)
	New VO deployment(?) - Jens 2016 (red)
	Grid Storage (overview) - Jens 2016 (July, Amber)

4. AOB
Lukasz Kreczko: (18/01/2017 10:01:04)
one operational issue:
DMLite SE is going out of memory when many deletions are going on.
Samuel Cadellin Skipsey: (10:01 AM)
Via HTTP?
Lukasz Kreczko: (10:02 AM)
Jan 17 05:34:41 lcgse01 kernel: Out of memory: Kill process 6577 (globus-gridftp-) score 5 or sacrifice child
Jan 17 05:34:50 lcgse01 kernel: Out of memory: Kill process 11471 (globus-gridftp-) score 5 or sacrifice child
Jan 17 05:35:50 lcgse01 kernel: Out of memory: Kill process 14584 (httpd.event) score 12 or sacrifice child
Paige Winslowe Lacesso: (10:02 AM)
It also oom-killed mysqld & xrootd
David Crooks: (10:03 AM)
Sam's trying to talk but his mic isn't woprking
Samuel Cadellin Skipsey: (10:03 AM)
My microphone isn't working, one momemt
I don't really have a solution, as I've never seen that happen before.
(We've had other services get jammed up, but ATLAS deletions often happen via HTTP, and there's another failure mode / file descriptor leak issue with http/dmlite.)
Lukasz Kreczko: (10:05 AM)
dpm.x86_64 1.9.0-1.el6 @epel
Samuel Cadellin Skipsey: (10:06 AM)
Huh, so it could be a bug introduced in the 1.9.x branch, Luke - you've not seen it before?
Lukasz Kreczko: (10:06 AM)
we've seen the load "explode" before
> 400
Samuel Cadellin Skipsey: (10:07 AM)
On the head node?
Lukasz Kreczko: (10:07 AM)
Winnie?
yes
Samuel Cadellin Skipsey: (10:07 AM)
huh, that's very odd. Where was the load actually originating from?
Lukasz Kreczko: (10:08 AM)
from gridFTP. Deletions == lots of write to the MySQL database which runs on the SE
Chris Brew: (10:08 AM)
[root@heplni008 ~]# rpm -qa | grep zfs
zfs-0.6.5.8-1.el7.centos.x86_64
zfs-release-1-3.el7.centos.noarch
libzfs2-0.6.5.8-1.el7.centos.x86_64
zfs-dkms-0.6.5.8-1.el7.centos.noarch
[root@heplni008 ~]# rpm -qa | grep spl
spl-dkms-0.6.5.8-1.el7.centos.noarch
spl-0.6.5.8-1.el7.centos.x86_64

Lukasz Kreczko: (10:09 AM)
load is stable now:
top - 10:08:02 up 33 days, 18:47, 5 users, load average: 13.25, 32.09, 19.79
Tasks: 532 total, 2 running, 530 sleeping, 0 stopped, 0 zombie
Marcus Ebert: (10:09 AM)
yes, that should be the plain dkms versions. Would be interesting to see what happens at the next kernel update with it
Samuel Cadellin Skipsey: (10:09 AM)
Hm, that's possibly why we've not seen it - our deletions from ATLAS often are on http, so it's possibly a gridftp-related issue (or one with gridftp/hdfs)
Lukasz Kreczko: (10:10 AM)
each gridftp process uses 0.7% of 32 GB, we have 178 such processes
Samuel Cadellin Skipsey: (10:10 AM)
So, we've seen a lot of gridftp-related-load on pool nodes before, but not on the head.
But the config for hdfs and DPM is a bit different for hand-off, isn't it?
Lukasz Kreczko: (10:12 AM)
it is. The work should be done on the gateway, which does happen. But deletions seem special
only 40 gridftp processes on the gateway
Samuel Cadellin Skipsey: (10:14 AM)
Hm. Is this CMS deletions, out of interest?
Lukasz Kreczko: (10:14 AM)
yes
Samuel Cadellin Skipsey: (10:15 AM)
Okay, so I think it's worth pushing this up to dpm-user-support or dpm-dev, as it sounds like a particular load-related bug which I've not seen before.
Lukasz Kreczko: (10:16 AM)
OK, preparing an email
thx
Marcus Ebert: (10:16 AM)
https://indico.cern.ch/event/505613/contributions/2230928/attachments/1347288/2041281/oral-final-533.pdf
pg 12