Present: Jens (chair+mins), Duncan, Luke, Matt, RobC, Sam, Teng, Dan,
Winnie, Brian, Steve, Govind

0. Operational blog posts and baseline versions

   New blog post from Rob!

   No operational issues reported.

1. DPM workshop has been announced: https://indico.cern.ch/event/699602/

   Sam was planning on going, but it should not necessarily stop
   others from going: if you work with the DPM developers, it is
   useful to meet them in person at least once to build a
   <french>rapport</french> with them.  We should also have a think
   about what we can present - other than a generic "UK status"


2. Agenda for GridPP40: we have a goodish section to cover data topics,
so let's make sure we have some interesting stuff to say!  (In fact
having a précis of the smorgasbord of goodnessy goodness for someone
like Pete C to take to the UKT0 meeting in March some time might be useful?)

The agenda slot is two hours, so we could split it into six slots of
20 mins as an initial proposal; if there is something controversial we
could have a panel discussion but the discussion today did not support
that.  Each slot could be done by one person but it would be good to
do double or triple acts so more people get to say something,
particularly if we can stick to the time.

Current proposals for topics are:

   * Current status and evolutions of T2s, particularly if we can get
     an update from Bham.
   * Hadoop - Bristol has one, Dan's been looking at Hadoop for
     Lustre, and James has/had one at T1.
   * xroot proxy cache
   * Data lakes (see also next topic), real and imagined.  Storage for
     ML? or infrastructure for ML.  CVMFS for experiment data?
   * Hybrid infrastructures - BYOC, EUDAT/EGI(?), DiRAC, iRODS, CDMI
   * Learning from non-LHC (e.g. climate) and supporting non-LHC
     (i.e. what can they learn from us, e.g. SKA, UKT0?)

Dan mentioned that iRODS is not currently widely used by physics but
there are examples such as HyperK's JP/UK link.  Also EUDAT uses iRODS
as storage node for one of its services (= B2SAFE) with GridFTP as an
interface (= B2STAGE).  But there is now iRODS for Lustre... and there
is a StoRM plugin for CDMI.

There doesn't seem to be a dedicated T1 slot on the agenda yet, so we
should incorporate T1 contributions in the above.  Conversely, we
could also contribute to other topics elsewhere on the agenda,
e.g. storage accounting or networking.


3. Technology watch vs buzzword watch. We regularly talk about "UKT0" ie
GridPP as a DTZ and recently Data Lake. Now CERN has a "Data Lake"
proposal; I have extended my demo that you caught a brief glimpse of at
hepsysman. The PMB tells us to get engaged more, I think. If Pete G
joins, he can tell us more (the minutes of the PMBs are sent around but
only much later...)

   * Andrew suggests the PMB want us to do more tracking of peripheral
     technologies, like the stuff we regularly talk about as
     potentially worth exploring but rarely have time to actually do
     something with (see topics above.)
   * CERN proposal for "data lake" supporting SKA.  Initial document
     call for participation, but doesn't seem radically different from
     what we are currently doing with linking data centres (e.g. T2s)
     through federated storage (either at the protocol level, like
     dynafed, or linked as in the NorduGrid dCache.)
   * CERN proposal is linking data centres (in a performant way),
     prototyping and understanding different underlying storage
     platforms, and investigating hierarchical management.  Brian
     points out (see chatlog) there are some thoughts on the analysis
     side as well.  For the storage, however, CERN plan to use EOS, so
     for GridPP to be involved, our (limited) experiences with EOS may
     be useful.  We should follow up.
   * PMB wishes GridPP to be involved, possibly in part because the
     proposal is to support SKA: and supporting SKA is a Good
     Thing(tm) because (a) they benefit from our experiences, (b) big
     data projects should work together, and (c) they will be doing
     some cool science.  Brian has been volunteered with some 10-20%
     of his time.

$. AOB

   NOB


... thanks to Matt for saving the chatlog; due to a pasteo the minuter
managed to lose it.


jens: (07/02/2018 10:03)
https://indico.cern.ch/event/699602/
https://indico.cern.ch/event/684659/timetable/
Daniel Peter Traynor: (10:08 AM)
interesting projects we could look at at QMUL in time for gridp: hadoop 
for lustre https://github.com/intel-hpdd/lustre-connector-for-hadoop ; 
irods for lustre https://github.com/irods-contrib/irods_tools_lustre ; 
with storm plugin for cdmmi https://www.snia.org/cdmi
http://italiangrid.github.io/storm/documentation/sysadmin-guide/1.11.12/#cdmistorm
Duncan Rand: (10:18 AM)
NDGF
Brian: (10:20 AM)
https://indico.cern.ch/event/656157/contributions/2866315/attachments/1594891/2525563/Datalakes_EOS_Workshop_Feb2018.pdf
Lukasz Kreczko: (10:21 AM)
More on Data Lake:
https://indico.cern.ch/event/702591/contributions/2881715/attachments/1595179/2526156/DOMA-post-CWP.pdf
Brian: (10:22 AM)
such as data transfer zones
Duncan Rand: (10:22 AM)
https://news.aarnet.edu.au/cerns-new-data-services-for-science-and-collaborations-with-aarnet/
Ste Jones: (10:23 AM)
A data lake is one step up from a data puddle, but smaller than a data sea.
Samuel Cadellin Skipsey: (10:24 AM)
It turns out that, genuinely, Ste, there's a term "Data Swamp" for a 
Data Lake with lots of stale data
Lukasz Kreczko: (10:33 AM)
and "Big CVMFS"