Storage Minutes 13 April 2011

Brian - ATLAS PRODDISK has been filling up because T1 was missing a
release. Brian is looking at formalising the ATLAS quota size. This is
also a usecase for CVMFS (in that at least then all sites will have
the same software).

Hopefully, we'll see Sonar Test improvements, due to the increase of
parallel streams in STAR-T2 channels (which are what Sonar uses for
most of the T2s)


Wahid raised the issue of FTS tuning for startup time and async latency.

Brian - we've shown before that if an SE can be synchronous, that
improves performance. SE's aren't, because they have to balance the
tape latency in T1 use cases - but this isn't an issue with T2s.
Even in asynchronous mode, there are potential gains to be made by
decreasing the polling delay between requesting a transfer and
checking if it has been successful.
Potential issue for (say, QMUL) sites with only 1 gridftp server
handling all their requests.

QMUL have a problem with their new router - giving a gigabit simplex,
not duplex.
But also, it can take a considerable period for the checksum to be
calculated at the end of the transfer. which can cause FTS to time
out.
Chris asked that verify-checksum mode was turned on for QMUL FTS ATLAS
transfers - it seems to be off at the moment (probably because StoRM
was a special case for some time).

.

Metrics discussion - (designed to test the effectiveness of the
Stroage group, not individual sites)

Some tweaks:
Publication - extend to "publications of similar quality"

Blogging - extend to other outreach activities.

There was a long discussion / argument about what kind of metrics we
want (and if we should overlap with the site performance metrics).

Ewan (and Brian) wanted process-related metrics (i.e. reward working
on the DPM toolkit, as we approve of that)
Wahid and Chris wanted more outcome-related metrics (i.e. reward us
for getting the UK sites to have good overall storage performance).

some sample statements:
Chris: I agree with Wahid. Step back - what's the point of the storage
group? It's to make sure that we can deal with the demands on the
storage! Tools, monitoring of the storage, consistency checks are
things we should do more of, RAL tests were a good thing. How do you
metric "solving problems when we discover them".

Ewan: you don't want a metric "bugs fixed", as that assumes we'll have
bugs to fix.
The problem with the storage performance stuff is that it's already
folded into the site performance metrics.
The group metrics need to be more group based.

Wahid: but if you consider storage availability across the UK, then
that's a measure of our ability to get things to work.

Brian: obviously, if there are metrics that have already been
calculated, trying to ensure those metrics are being met can be a
metric ("metric about getting sites to meet the site metrics")
Ewan: there's a philosophical divide on what we should be metricising
: group activities vs outcomes

Brian: some of the metrics are imposed on us (they are interested in
both aspects!), and the less we come up with ourselves, we may have
them forced upon us.

Eventual decision was made as far as:

UK Storage Performance metrics (hard numbers)
and
Fluffier (things we think we want to be doing) - outcome is more
judgement based (did this work? not "number of release" etc).

And it was agreed that metrics should encourage support for small VOs.
In general, we deferred the rest of the discussion 'til a meeting with
a Jens present.


Sam