hi Jens,

Things that happened this meeting:


1) Blog post discussion - Brian and I both have posts in dev, but will miss this quarter on them.

2) Main event, George's Ceph @ RAL update. Interesting benchmarking results (seems to suggest that the EC config is quite capable of breaking ceph if the stripe width is a large fraction of the available nodes/OSDs).

3) AOB

Discussion of the need for Iliya's ATLAS FAX tests to have xrootd4.2.x redirectors on SE heads. (this is an issue for UK, as most of our DPM heads are still on xrootd3.x, as it's the baseline release, and we had no particular reason to want to potentially disrupt our availability). Discussion of how to deal with this.

* Action - Sam (to look at upgrading the Glasgow DPM head to xrootd4 and document this).

Also discussion, relevant to t2evo group about xrootd redirection (Ewan noted that at the moment the FAX/AAA systems are vulnerable to a head node being completely down, as they need the redirector to be up to push them up to the regional redirector. Should we default to talking to a less local node of the redirector tree to get more resilience?

In context of t2evo, this is actually close to Sam's initial sketch solution for diskless T2s (albeit with the idea of a "regional T2 redirector" being added for this purpose). There are questions of metadata / request load on the redirector service which need to be resolved.


Chat log:

-------

Samuel Cadellin Skipsey: (30/09/2015 09:56)

hi chaps, will be starting in about 5 mins

Ewan Mac Mahon: (10:04 AM)

Post a trailer.

Gareth Douglas Roy: (10:04 AM)

In a world....

Matt Doidge: (10:05 AM)

Careful - someone might start expecting us to do podcasts.

Samuel Cadellin Skipsey: (10:16 AM)

(The latest Coverity scans on the Ceph git head do show some memory leaks, but I'm not sure if they're relevant (didn't check where) )

Ewan Mac Mahon: (10:18 AM)

The choice of 16+3 is interesting; other than obviously changing the overhead, is there any particular reason not to go for something smaller (e.g. 8+2).

This way seems to have the effect of requiring a high minimum number of nodes to be sane.

And hence also a minimum price for a viable system.

Samuel Cadellin Skipsey: (10:19 AM)

Yeah, I do wonder if the half-cluster test is partly problematic because almost all the nodes are involved with every file.

Ewan Mac Mahon: (10:24 AM)

That does appear to be the/a conclusion here. That it should be possible to work out a sensible minimum number of nodes for a given erasure coding pattern.

And that essentially rules out small erasure coded ceph systems for quite large values of 'small'.

Samuel Cadellin Skipsey: (10:26 AM)

well, at least, for small overheads

I mean, RAID5 style x+1 ec is probably accessible to most ceph installs.

Ewan Mac Mahon: (10:31 AM)

It would be interesting to try the half cluster test with a pattern like that and see what actually happens if we could - i.e. try something in a more T2-esque scale.

Can someone blog this?

/s

*sniff*

I don't wanna updte.

Matt Doidge: (10:39 AM)

who wants to go first?

Paige Winslowe Lacesso: (10:39 AM)

Sam, what version is wanted? We have 4.1.1-1

That's on the gridftp servers, the dm-lite se has 4.2.2-1

Ewan Mac Mahon: (10:49 AM)

la la la la la, can't hear you.