Last reviewed and polished 24 Oct 2006.

Overview of T2 storage

* Disk only required at Tier 2s

* Quality of disk is classified (in new SRM terminology) as

	- "Replica" (user is expected to easily be able to recover the
	  data from another site),
	- "Output" (data that will be expensive to recreate, e.g. job
	  output data),
	- "Custodial" (cannot be recreated and will not be lost unless
	  a meteor hits the building).

* Different VOs require different CPU/storage ratios.

* Different sites support different VOs.  Most sites support most or
  all GridPP VOs with one or more smaller allocations, and a main VO
  with most of their resources.

* All sites are expected to provide CPU and storage in a specific
  ratio, where the ratio depends on the VO (and possibly time).

* As of Q2 2006, we understood these ratios were required:

	- Atlas 2:1
	- CMS 3:1
	- LHCb 333:1

  Here, the CPU is measured in KSi2K (kilo specint), and storage in
  Terabytes (10^12 bytes we assume).  Thus a site with a 600 KSi2K
  cluster dedicated wholly to CMS must provide 200 TB storage.

Q: Are these ratios still valid?
Q: Will these ratios change? (we expect LHCb will change at some
   point)
Q: What quality of service do experiments expect:
	- Uptime we know ("95%")
	- Guaranteed allocations we know ("yes for the main VOs")
	- One or more of custodial, output, replica?
	- Other QoS factors?
	- Performance targets.

  Currently the only way to guarantee that space is available to a VO
  is by allocating physical storage resources (e.g. a disk partition)
  to the VO and not share it with other VOs.  Storage middleware does
  not support quotas.

Site recommendations:

* Sites should allocate the "obvious" (i.e. multiply the VO's
  requirements with the ratio of resources allocated to it) ratio of
  dedicated storage for the main VO (or main VOs) in which they are
  active and allocate extra shared storage for the other GridPP VOs.

  Example of what "obvious" means, just in case it isn't obvious:
  suppose a site is 20% Atlas and 80% CMS and the ratio is 2:1 for
  Atlas and 3:1 for CMS.

  Then if the farm has 600 ksi2k, you need to provide
	      600 * 0.5 * 0.2 + 600 * 1/3 + 0.8 = 240 TB
  storage.

  This raises the question of how to provide storage for the remaining
  VOs that all sites support.  In the absence of requirements to say
  otherwise, it seems sensible to provide a certain percentage to the
  rest - 2%, 5%.

  Example: Lancaster has 80% allocated to Atlas and 20% to everybody
  else (which includes Atlas).

  Example: Bristol has bought 1:1 CPU/storage hardware by cost, but
  high end storage (GPFS too).  They get 600 KSi2K and 150 TB, so 4:1.
  Not so concerned about absolute numbers.  Bristol is a CMS site so
  should really provide 3:1 (200 TB) (assuming 100% CMS).

* When buying hardware, sites should aim to meet the CPU/storage
  ratio, rather than absolute numbers from planning documents and
  MoUs.

  Sites should share their purchasing and operational experiences with
  the rest of the storage group (but bear in mind that the storage
  mailing list is world readable).  Also performance and optimisation
  hints.

* Performance targets - need to optimise SEs and hardware.

* Keep the timescale in mind.  For example, if LHCb require almost
  nothing today, but a lot next year, then that should be taken into
  account (for sites supporting LHCb).

* Sites should try to guarantee allocations for their major VO(s).
  Ideally by physically dedicating partitions to them, but in general
  allocating pools or pool groups (dCache) to them, not shared with
  other VOs.

* What is the best way to guarantee (physical) storage allocations?
  Probably the best approach is to allocate large physical storage
  units for the main VO(s), but it might be useful to have smaller
  physical units which can be reallocated, e.g. to be able to support
  new VOs, or to give VOs more guaranteed space temporarily.  This
  would require that middleware can drain pools or transparently
  reallocate the storage, but even then it is a rather inefficient use
  of storage.  The FNAL production dCache operates with ~300 pools and
  hundreds of TBs of disk.

  Quotas are not likely to be implemented on a realistic timescale,
  and SRM spaces will not be supported by all SRMs - but high level
  static (set up by sysadmin) SRM spaces may be a reasonable way to
  provide allocations for VOs.

Big (mostly open) questions:

  How much are sites short of delivering the expected storage capacity
  for their main VO(s)?

  What quality of disk do they need to provide?  Is it as simple as
  custodial/output/replica (mapped to hardware in some suitable way)?

  How can sites best enable storage on WNs?  The storage group does
  not currently support distributed filesystems, but sites have been
  talking about lustre, gpfs,...

  What is the largest capacity that DPM can realistically manage?

More things to ponder:

* The NGS has a different model.  Two core sites provide extra
  storage, and the other two provide extra CPU.  GridPP doesn't use
  that model: all sites must provide both storage and CPU.

* Pooling storage between Tier 2s is probably not a good idea (pool
  communication is usually insecure).

* There may be a risk that a site, aiming to meet a N TB target, will
  purchase cheap disk - some VOs will need higher quality storage also
  at Tier 2s (eg for databases)?

* Many sites are finding disk underused?

* Is it a problem that we are not meeting the absolute numbers (which
  are quite ambitious)?

* Cost of storage may include staff, space, power, aircon.  Creating
  multiple copies of files, for resiliency or parallel reading
  streams, means physical capacity has to be N times larger.


Other comments:

* Sites should publish the quality of service according to best
  practice (GLUE schema).

* SE implementations should publish available and used space per VO.
  For VOs sharing space, that space is potentially published twice
  (unless one cheats and publishes a fraction).

  Version 1.3 of the GLUE schema was meant to remedy this, but the
  subject of available and used is still a hot potato.  Both the
  definition and the usefulness of "available" and "used" are a
  subject of heated debate in the SRM protocol group.

* Sites should optimise their installations according to the
  recommendations from the storage group.

* HEPiX storage task force investigating prices and technologies.

* Middleware doesn't support quotas.

  Need to investigate middleware support for guaranteed reservations
  (spaces in SRM 2.2).  A test case was proposed for S2 (SRM 2.2 test
  client) but we don't know whether it was built.  There are no
  guaranteed reservations in SRM 1.X.