Attending: Brian, Jens, John B, John H, Winnie, Sam, Wahid, Gareth, Elena,
Raja, Matt D, Chris B, David, Steve, Raul, Govind, Ewan

Apologies: Tom


0. Operational blog posts as usual

   Discussion on list with gsoap errors reported in transfers - also seen in
   CASTOR, although they may not be the same - they don't seem to be
   associated with failed transfers, though.  Sometimes a particular VO,
   may be associated with FTS?  Also seen at Lancaster.

1. There are some potentially interesting things in next week's pre-GDB
    http://indico.cern.ch/event/319819/

2. Round table (postponed from last week) of storage related things.

   Edinburgh: puppet on hold, but intend to get back to it.  Also need to look
   at interfaces and namespace/space reporting.

   Liverpool: "if it ain't broke, don't fix it."  WebDAV died, however, and
   refused to restart.  Could it be something with an update to HTTP which had
   failed to update correctly?

   Glasgow: CEPH plugins for GFAL2 testing imminent (could turn into a blog
   post but also need to turn into a CHEP talk).  Also work on Puppet with
   Edinburgh.

   Brunel: about to upgrade to 1.8.9.  Sent question to list: tracing files is
   much better supported in 1.8.9, Fabricio mining log data.

   LHCb: nothing immediate, but LHCb will need to start some testing.

   Bristol: everything OK.

   Lancaster: there was a problem with the DNS: a change of DNS server was not
   picked up by DPM, but there shouldn't be any special caching in DPM itself.
   Maybe worth restarting something.  Several people had reported similar
   problems.  Also some systems still on SL5.

   Cambridge: Also on 1.8.8; ready to move to 1.8.9 but no specific timeframe.

   RHUL: Nothing to report.  Still on 1.8.8.  Priority is getting services
   outside of firewall.

   Oxford: deliberately remaining on stable 1.8.8.  Growing backlog of stuff,
   including SL5 and YAIM, so at some point will save data and reinstall to
   clear all the issues.  Maybe a few days of downtime, not hugely concerned
   with scheduling otherwise.

   Sheffield: disk servers on 1.8.8.  Some badly behaving disk servers were
   taken offline, and some had to be downgraded.  Are there any instructions
   on the upgrade to 1.8.9?  We used to have some general installation and
   upgrade experiences recorded in the wiki.  We could try to resurrect this,
   but for now it's probably simplest and quickest for Elena to ask on the
   list when particular problems pop up.

   RALPP: Upgraded to dCache 2.10 last week.  Mostly went smoothly, but the
   SRM didn't start, possibly due to database oddities, duplicate entries.
   All storage nodes on SL6 - used the new servers to park the data while
   upgrading the old ones.  Got it all on puppet; expecting to upload puppet
   recipes to github.  Interested in whether CEPH can provide a backend to
   dCache?

   RAL(Brian): investigating filesystem draining rates and a problem with GFAL
   for SNO+, a GFAL test will become critical so it may be advantageous for it
   to work.  Also looking at empty directories in CASTOR.  A namespace dump
   takes 10 days.

3. AOB

Next week!  An audience with LIGO.  Oops, except I forgot I will
probably be at cloud expo. Hm. Maybe you can have a chat with LIGO
without me.

LIGO will be using the T1 and SouthGrid initially (although I saw some
messages on tb-support which seemed to suggest NorthGrid?)


wahid: (04/03/2015 10:06:21)
ssl.conf file in http
t dies a lot anyway 
1.8.8. version for sure and I think still with 18.9 
not restarting - there is the ssl.conf issue - but not aware of another issue
so if it happens again the error or something would be interesting to know
Matt Doidge: (10:11 AM)
There's still a lot of crappy noise in the 1.8.9 logs
John Bland: (10:11 AM)
the error seems to be something about "Error string not specified yet: mod_gridsite: mod_ssl_with_insecure_reneg = 1"
Jens Jensen: (10:11 AM)
RAL is using elasticsearch to search and summarise the CASTOR logs
John Bland: (10:11 AM)
which may well be the ssl thing, looks like webdav was down for longer than I thought
Matt Doidge: (10:12 AM)
The standard logrotate for dmlite doesn't compress
John Hill: (10:12 AM)
John - That error message looks familiar
I stopped the problem recurring by creating an empty ssl.conf
wahid: (10:13 AM)
so the one thing is putting
cat /etc/dmlite.conf
LoadPlugin plugin_config /usr/lib64/dmlite/plugin_config.so
LogLevel 1
Include /etc/dmlite.conf.d/*.conf
Paige Winslowe Lacesso: (10:13 AM)
no update, al is well!
wahid: (10:13 AM)
the other is in /etc/rsyslog.conf
$IncludeConfig /etc/rsyslog.d/*.conf
John Bland: (10:14 AM)
john: yes, the messages sound like your ssl problem, I just discounted it at first as I was sure I'd tested webdav recently but it must have been something else. We have no ssl.conf after yaiming.
wahid: (10:14 AM)
then I think they provide a file 
/etc/rsyslog.d/20-log-dmlite.conf
## send all the dmlite log message to dmlite.log
$RepeatedMsgReduction off # log every message
:msg,contains," dmlite " -/var/log/dmlite/dmlite.log
$RepeatedMsgReduction on
I think matt is correct though - this isn't bein log rotated - so need to add this conf too
John Hill: (10:16 AM)
yaim renames ssl.conf to ssl.conf.yaimsave
John Bland: (10:17 AM)
john: yep, that's what we've got
raul: (10:17 AM)
wahid: thanks. I might email you for help.
John Hill: (10:18 AM)
Which means on the next http update a new ssl.conf is created
wahid: (10:18 AM)
peeople might start paying more attentin once theres a 'task force'
Ewan Mac Mahon: (10:18 AM)
I think that's normal We can hear you.
Possibly not the othr way around though,
wahid: (10:18 AM)
(http task force that was) 
John Bland: (10:19 AM)
john: yes, I might add a local puppet rule for the headnode to ensure ssl.conf is absent
John Hill: (10:20 AM)
It also applies to the pool nodes - I've had the same issue there
wahid: (10:22 AM)
matt on the logrotate 
of dmlite log
do you have a /etc/logrotate.d/dmlite
mine seems to have strange content
John Bland: (10:24 AM)
john: yes, our pool nodes saw the same as well, but they were reyaimed a while ago so we never noticed
The moral of the story is we/wlcg need to monitor webdav better.
wahid: (10:27 AM)
hi - it was this wiki that we were writing general upgrade experiences
https://www.gridpp.ac.uk/wiki/DPMUpgradeTips
anyone is welcome to add more recent ones
it is currently useless for 1.8.9
Chris Brew: (10:28 AM)
Oh, we alos got NFS4.1 working so we have file:// access as well
wahid: (10:30 AM)
there is no particular issue