Merge "Update Global EC docs with reference to composite rings"

This commit is contained in:
Jenkins 2017-06-13 06:26:54 +00:00 committed by Gerrit Code Review
commit 41c8f1330f
3 changed files with 129 additions and 70 deletions

View File

@ -92,7 +92,7 @@ advantage of many well-known C libraries such as:
* Or write your own! * Or write your own!
PyECLib uses a C based library called liberasurecode to implement the plug in PyECLib uses a C based library called liberasurecode to implement the plug in
infrastructure; liberasure code is available at: infrastructure; liberasurecode is available at:
* liberasurecode: https://github.com/openstack/liberasurecode * liberasurecode: https://github.com/openstack/liberasurecode
@ -179,7 +179,7 @@ Performance Considerations
In general, EC has different performance characteristics than replicated data. In general, EC has different performance characteristics than replicated data.
EC requires substantially more CPU to read and write data, and is more suited EC requires substantially more CPU to read and write data, and is more suited
for larger objects that are not frequently accessed (eg backups). for larger objects that are not frequently accessed (e.g. backups).
Operators are encouraged to characterize the performance of various EC schemes Operators are encouraged to characterize the performance of various EC schemes
and share their observations with the developer community. and share their observations with the developer community.
@ -269,41 +269,45 @@ container that was created with the target durability policy.
Global EC Global EC
********* *********
Since the initial release of EC, it has not been recommended that an EC scheme The following recommendations are made when deploying an EC policy that spans
span beyond a single region. Initial performance and functional validation has multiple regions in a :doc:`Global Cluster <overview_global_cluster>`:
shown that using sufficiently large parity schemas to ensure availability
across regions is inefficient, and rebalance is unoptimized across high latency
bandwidth constrained WANs.
Region support for EC polices is under development! `EC Duplication` provides * The global EC policy should use :ref:`ec_duplication` in conjunction with a
a foundation for this. :ref:`Composite Ring <composite_rings>`, as described below.
* Proxy servers should be :ref:`configured to use read affinity
<configuring_global_clusters>` to prefer reading from their local region for
the global EC policy. :ref:`proxy_server_per_policy_config` allows this to be
configured for individual policies.
.. note::
Before deploying a Global EC policy, consideration should be given to the
:ref:`global_ec_known_issues`, in particular the relatively poor
performance anticipated from the object-reconstructor.
.. _ec_duplication:
EC Duplication EC Duplication
============== ==============
.. warning::
EC Duplication is an experimental feature that has some serious known
issues which make it currently unsuitable for use in production.
EC Duplication enables Swift to make duplicated copies of fragments of erasure EC Duplication enables Swift to make duplicated copies of fragments of erasure
coded objects. If an EC storage policy is configured with a non-default coded objects. If an EC storage policy is configured with a non-default
``ec_duplication_factor`` of ``N > 1``, then the policy will create ``N`` ``ec_duplication_factor`` of ``N > 1``, then the policy will create ``N``
duplicates of each unique fragment that is returned from the configured EC duplicates of each unique fragment that is returned from the configured EC
engine. engine.
Duplication of EC fragments is optimal for EC storage policies which require Duplication of EC fragments is optimal for Global EC storage policies, which
dispersion of fragment data across failure domains. Without duplication, common require dispersion of fragment data across failure domains. Without fragment
EC parameters will not distribute enough unique fragments between large failure duplication, common EC parameters will not distribute enough unique fragments
domains to allow for a rebuild using fragments from any one domain. For between large failure domains to allow for a rebuild using fragments from any
example a uniformly distributed ``10+4`` EC policy schema would place 7 one domain. For example a uniformly distributed ``10+4`` EC policy schema
fragments in each of two failure domains, which is less in each failure domain would place 7 fragments in each of two failure domains, which is less in each
than the 10 fragments needed to rebuild a missing fragment. failure domain than the 10 fragments needed to rebuild a missing fragment.
Without duplication support, an EC policy schema must be adjusted to include Without fragment duplication, an EC policy schema must be adjusted to include
additional parity fragments in order to guarantee the number of fragments in additional parity fragments in order to guarantee the number of fragments in
each failure domain is greater than the number required to rebuild. For each failure domain is greater than the number required to rebuild. For
example, a uniformally distributed ``10+18`` EC policy schema would place 14 example, a uniformly distributed ``10+18`` EC policy schema would place 14
fragments in each of two failure domains, which is more than sufficient in each fragments in each of two failure domains, which is more than sufficient in each
failure domain to rebuild a missing fragment. However, empirical testing has failure domain to rebuild a missing fragment. However, empirical testing has
shown encoding a schema with ``num_parity > num_data`` (such as ``10+18``) is shown encoding a schema with ``num_parity > num_data`` (such as ``10+18``) is
@ -326,10 +330,10 @@ The ``ec_duplication_factor`` option may be configured in `swift.conf` in each
.. warning:: .. warning::
The ``ec_duplication_factor`` option should only be set for experimental EC duplication is intended for use with Global EC policies. To ensure
and development purposes. EC Duplication is an experimental feature that independent availability of data in all regions, the
has some serious known issues which make it currently unsuitable for use in ``ec_duplication_factor`` option should only be used in conjunction with
production. :ref:`composite_rings`, as described in this document.
In this example, a ``10+4`` schema and a duplication factor of ``2`` will In this example, a ``10+4`` schema and a duplication factor of ``2`` will
result in ``(10+4)x2 = 28`` fragments being stored (we will use the shorthand result in ``(10+4)x2 = 28`` fragments being stored (we will use the shorthand
@ -342,25 +346,46 @@ respect to a ``10+18`` configuration not only because reads from data fragments
will be more common and more efficient, but also because a ``10+4x2`` can grow will be more common and more efficient, but also because a ``10+4x2`` can grow
into a ``10+4x3`` to expand into another region. into a ``10+4x3`` to expand into another region.
Known Issues EC duplication with composite rings
============ -----------------------------------
Unique Fragment Dispersion It is recommended that EC Duplication is used with :ref:`composite_rings` in
-------------------------- order to disperse duplicate fragments across regions.
Currently, Swift's ring placement does **not** guarantee the dispersion of When EC duplication is used, it is highly desirable to have one duplicate of
fragments' locations being robust to disaster recovery in the case each fragment placed in each region. This ensures that a set of
of Global EC. While the goal is to have one duplicate of each ``ec_num_data_fragments`` unique fragments (the minimum needed to reconstruct
fragment placed in each region, it is currently possible for duplicates of an object) can always be assembled from a single region. This in turn means
the same fragment to be placed in the same region (and consequently for that objects are robust in the event of an entire region becoming unavailable.
another region to have no duplicates of that fragment). Since a set of
``ec_num_data_fragments`` unique fragments is required to reconstruct an
object, a suboptimal distribution of duplicates across regions may, in some
cases, make it impossible to assemble such a set from a single region.
For example, if we have a Swift cluster with two regions, ``r1`` and ``r2``, This can be achieved by using a :ref:`composite ring <composite_rings>` with
the 12 fragments for an object in a ``4+2x2`` EC policy schema could have the following properties:
pathologically sub-optimal placement::
* The number of component rings in the composite ring is equal to the
``ec_duplication_factor`` for the policy.
* Each *component* ring has a number of ``replicas`` that is equal to the sum
of ``ec_num_data_fragments`` and ``ec_num_parity_fragments``.
* Each component ring is populated with devices in a unique region.
This arrangement results in each component ring in the composite ring, and
therefore each region, having one copy of each fragment.
For example, consider a Swift cluster with two regions, ``region1`` and
``region2`` and a ``4+2x2`` EC policy schema. This policy should use a
composite ring with two component rings, ``ring1`` and ``ring2``, having
devices exclusively in regions ``region1`` and ``region2`` respectively. Each
component ring should have ``replicas = 6``. As a result, the first 6
fragments for an object will always be placed in ``ring1`` (i.e. in
``region1``) and the second 6 duplicate fragments will always be placed in
``ring2`` (i.e. in ``region2``).
Conversely, a conventional ring spanning the two regions may give a suboptimal
distribution of duplicates across the regions; it is possible for duplicates of
the same fragment to be placed in the same region, and consequently for another
region to have no copies of that fragment. This may make it impossible to
assemble a set of ``ec_num_data_fragments`` unique fragments from a single
region. For example, the conventional ring could have a pathologically
sub-optimal placement such as::
r1 r1
<timestamp>#0#d.data <timestamp>#0#d.data
@ -377,43 +402,64 @@ pathologically sub-optimal placement::
<timestamp>#5#d.data <timestamp>#5#d.data
<timestamp>#5#d.data <timestamp>#5#d.data
In this case, ``r1`` has only the fragments with index ``0, 2, 4`` and ``r2`` In this case, the object cannot be reconstructed from a single region;
has the other 3 indexes, but we need 4 unique indexes to be able to rebuild an ``region1`` has only the fragments with index ``0, 2, 4`` and ``region2`` has
object in a single region. To resolve this issue, a composite ring feature is the other 3 indexes, but we need 4 unique indexes to be able to rebuild an
being developed which will provide the operator with greater control over object.
duplicate fragment placement::
https://review.openstack.org/#/c/271920/ .. _global_ec_known_issues:
Known Issues
============
Efficient Node Selection for Read Efficient Node Selection for Read
--------------------------------- ---------------------------------
Since EC policies requires a set of unique fragment indexes to decode the Proxy servers require a set of *unique* fragment indexes to decode the original
original object, it is increasingly likely with EC duplication that some object when handling a GET request to an EC policy. With a conventional EC
responses from backend storage nodes will include fragments which the proxy has policy, this is very likely to be the outcome of reading fragments from a
already received from another node. Currently Swift iterates over the nodes random selection of backend nodes. With an EC Duplication policy it is
ordered by a sorting method defined in the proxy server config (i.e. either significantly more likely that responses from a random selection of backend
shuffle, node_timing, or read_affinity) - but these configurations will nodes might include some duplicated fragments.
not offer optimal request patterns for EC policies with duplicated
fragments. In this case Swift may frequently issue more than the optimal
``ec_num_data_fragments`` backend requests in order to gather
``ec_num_data_fragments`` **unique** fragments, even if there are no failures
amongst the object-servers.
In addition to better placement and read affinity support, ideally node The recommended use of EC Duplication in combination with Composite Rings and
iteration for EC duplication policies could predict which nodes are likely proxy server read affinity is designed to mitigate for this; a proxy server
to hold duplicates and prioritize requests to the most suitable nodes. will first attempt to read fragments from nodes in its local region, which
are guaranteed to be unique with respect to each other. However, should enough
of those local reads fail to return a fragment, the proxy server may proceed to
read fragments from other regions. This can be relatively inefficient because
it is possible that nodes in other regions return fragments that are duplicates
of those the proxy server has already received. The proxy server will ignore
those responses and issue yet more requests to nodes in other regions.
Work is in progress to improve the proxy server node selection strategy such
that when it is necessary to read from other regions, nodes that are likely to
have useful fragments are preferred over those that are likely to return a
duplicate.
Efficient Cross Region Rebuild Efficient Cross Region Rebuild
------------------------------ ------------------------------
Since fragments are duplicated between regions it may in some cases be more Work is also in progress to improve the object-reconstructor efficiency for
attractive to restore failed fragments from their duplicates in another region Global EC policies. Unlike the proxy server, the reconstructor does not apply
instead of rebuilding them from other fragments in the local region. any read affinity settings when gathering fragments. It is therefore likely to
Conversely to avoid WAN transfer it may be more attractive to rebuild fragments receive duplicated fragments (i.e. make wasted backend GET requests) while
from local parity. During rebalance it will always be more attractive to performing *every* fragment reconstruction.
revert a fragment from it's old-primary to it's new primary rather than
rebuilding or transferring a duplicate from the remote region. Additionally, other reconstructor optimisations for Global EC are under
investigation:
* Since fragments are duplicated between regions it may in some cases be more
attractive to restore failed fragments from their duplicates in another
region instead of rebuilding them from other fragments in the local region.
* Conversely, to avoid WAN transfer it may be more attractive to rebuild
fragments from local parity.
* During rebalance it will always be more attractive to revert a fragment from
it's old-primary to it's new primary rather than rebuilding or transferring a
duplicate from the remote region.
************** **************
Under the Hood Under the Hood

View File

@ -17,9 +17,19 @@ cluster: region 1 in San Francisco (SF), and region 2 in New York
(NY). Each region shall contain within it 3 zones, numbered 1, 2, and (NY). Each region shall contain within it 3 zones, numbered 1, 2, and
3, for a total of 6 zones. 3, for a total of 6 zones.
.. _configuring_global_clusters:
--------------------------- ---------------------------
Configuring Global Clusters Configuring Global Clusters
--------------------------- ---------------------------
.. note::
The proxy-server configuration options described below can be given generic
settings in the ``[app:proxy-server]`` configuration section and/or given
specific settings for individual policies using
:ref:`proxy_server_per_policy_config`.
~~~~~~~~~~~~~ ~~~~~~~~~~~~~
read_affinity read_affinity
~~~~~~~~~~~~~ ~~~~~~~~~~~~~

View File

@ -350,6 +350,9 @@ Ring Builder Analyzer
--------------------- ---------------------
.. automodule:: swift.cli.ring_builder_analyzer .. automodule:: swift.cli.ring_builder_analyzer
.. _composite_rings:
--------------- ---------------
Composite Rings Composite Rings
--------------- ---------------