Merge "Update Global EC docs with reference to composite rings"
This commit is contained in:
commit
41c8f1330f
@ -92,7 +92,7 @@ advantage of many well-known C libraries such as:
|
|||||||
* Or write your own!
|
* Or write your own!
|
||||||
|
|
||||||
PyECLib uses a C based library called liberasurecode to implement the plug in
|
PyECLib uses a C based library called liberasurecode to implement the plug in
|
||||||
infrastructure; liberasure code is available at:
|
infrastructure; liberasurecode is available at:
|
||||||
|
|
||||||
* liberasurecode: https://github.com/openstack/liberasurecode
|
* liberasurecode: https://github.com/openstack/liberasurecode
|
||||||
|
|
||||||
@ -179,7 +179,7 @@ Performance Considerations
|
|||||||
|
|
||||||
In general, EC has different performance characteristics than replicated data.
|
In general, EC has different performance characteristics than replicated data.
|
||||||
EC requires substantially more CPU to read and write data, and is more suited
|
EC requires substantially more CPU to read and write data, and is more suited
|
||||||
for larger objects that are not frequently accessed (eg backups).
|
for larger objects that are not frequently accessed (e.g. backups).
|
||||||
|
|
||||||
Operators are encouraged to characterize the performance of various EC schemes
|
Operators are encouraged to characterize the performance of various EC schemes
|
||||||
and share their observations with the developer community.
|
and share their observations with the developer community.
|
||||||
@ -269,41 +269,45 @@ container that was created with the target durability policy.
|
|||||||
Global EC
|
Global EC
|
||||||
*********
|
*********
|
||||||
|
|
||||||
Since the initial release of EC, it has not been recommended that an EC scheme
|
The following recommendations are made when deploying an EC policy that spans
|
||||||
span beyond a single region. Initial performance and functional validation has
|
multiple regions in a :doc:`Global Cluster <overview_global_cluster>`:
|
||||||
shown that using sufficiently large parity schemas to ensure availability
|
|
||||||
across regions is inefficient, and rebalance is unoptimized across high latency
|
|
||||||
bandwidth constrained WANs.
|
|
||||||
|
|
||||||
Region support for EC polices is under development! `EC Duplication` provides
|
* The global EC policy should use :ref:`ec_duplication` in conjunction with a
|
||||||
a foundation for this.
|
:ref:`Composite Ring <composite_rings>`, as described below.
|
||||||
|
* Proxy servers should be :ref:`configured to use read affinity
|
||||||
|
<configuring_global_clusters>` to prefer reading from their local region for
|
||||||
|
the global EC policy. :ref:`proxy_server_per_policy_config` allows this to be
|
||||||
|
configured for individual policies.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Before deploying a Global EC policy, consideration should be given to the
|
||||||
|
:ref:`global_ec_known_issues`, in particular the relatively poor
|
||||||
|
performance anticipated from the object-reconstructor.
|
||||||
|
|
||||||
|
.. _ec_duplication:
|
||||||
|
|
||||||
EC Duplication
|
EC Duplication
|
||||||
==============
|
==============
|
||||||
|
|
||||||
.. warning::
|
|
||||||
|
|
||||||
EC Duplication is an experimental feature that has some serious known
|
|
||||||
issues which make it currently unsuitable for use in production.
|
|
||||||
|
|
||||||
EC Duplication enables Swift to make duplicated copies of fragments of erasure
|
EC Duplication enables Swift to make duplicated copies of fragments of erasure
|
||||||
coded objects. If an EC storage policy is configured with a non-default
|
coded objects. If an EC storage policy is configured with a non-default
|
||||||
``ec_duplication_factor`` of ``N > 1``, then the policy will create ``N``
|
``ec_duplication_factor`` of ``N > 1``, then the policy will create ``N``
|
||||||
duplicates of each unique fragment that is returned from the configured EC
|
duplicates of each unique fragment that is returned from the configured EC
|
||||||
engine.
|
engine.
|
||||||
|
|
||||||
Duplication of EC fragments is optimal for EC storage policies which require
|
Duplication of EC fragments is optimal for Global EC storage policies, which
|
||||||
dispersion of fragment data across failure domains. Without duplication, common
|
require dispersion of fragment data across failure domains. Without fragment
|
||||||
EC parameters will not distribute enough unique fragments between large failure
|
duplication, common EC parameters will not distribute enough unique fragments
|
||||||
domains to allow for a rebuild using fragments from any one domain. For
|
between large failure domains to allow for a rebuild using fragments from any
|
||||||
example a uniformly distributed ``10+4`` EC policy schema would place 7
|
one domain. For example a uniformly distributed ``10+4`` EC policy schema
|
||||||
fragments in each of two failure domains, which is less in each failure domain
|
would place 7 fragments in each of two failure domains, which is less in each
|
||||||
than the 10 fragments needed to rebuild a missing fragment.
|
failure domain than the 10 fragments needed to rebuild a missing fragment.
|
||||||
|
|
||||||
Without duplication support, an EC policy schema must be adjusted to include
|
Without fragment duplication, an EC policy schema must be adjusted to include
|
||||||
additional parity fragments in order to guarantee the number of fragments in
|
additional parity fragments in order to guarantee the number of fragments in
|
||||||
each failure domain is greater than the number required to rebuild. For
|
each failure domain is greater than the number required to rebuild. For
|
||||||
example, a uniformally distributed ``10+18`` EC policy schema would place 14
|
example, a uniformly distributed ``10+18`` EC policy schema would place 14
|
||||||
fragments in each of two failure domains, which is more than sufficient in each
|
fragments in each of two failure domains, which is more than sufficient in each
|
||||||
failure domain to rebuild a missing fragment. However, empirical testing has
|
failure domain to rebuild a missing fragment. However, empirical testing has
|
||||||
shown encoding a schema with ``num_parity > num_data`` (such as ``10+18``) is
|
shown encoding a schema with ``num_parity > num_data`` (such as ``10+18``) is
|
||||||
@ -326,10 +330,10 @@ The ``ec_duplication_factor`` option may be configured in `swift.conf` in each
|
|||||||
|
|
||||||
.. warning::
|
.. warning::
|
||||||
|
|
||||||
The ``ec_duplication_factor`` option should only be set for experimental
|
EC duplication is intended for use with Global EC policies. To ensure
|
||||||
and development purposes. EC Duplication is an experimental feature that
|
independent availability of data in all regions, the
|
||||||
has some serious known issues which make it currently unsuitable for use in
|
``ec_duplication_factor`` option should only be used in conjunction with
|
||||||
production.
|
:ref:`composite_rings`, as described in this document.
|
||||||
|
|
||||||
In this example, a ``10+4`` schema and a duplication factor of ``2`` will
|
In this example, a ``10+4`` schema and a duplication factor of ``2`` will
|
||||||
result in ``(10+4)x2 = 28`` fragments being stored (we will use the shorthand
|
result in ``(10+4)x2 = 28`` fragments being stored (we will use the shorthand
|
||||||
@ -342,25 +346,46 @@ respect to a ``10+18`` configuration not only because reads from data fragments
|
|||||||
will be more common and more efficient, but also because a ``10+4x2`` can grow
|
will be more common and more efficient, but also because a ``10+4x2`` can grow
|
||||||
into a ``10+4x3`` to expand into another region.
|
into a ``10+4x3`` to expand into another region.
|
||||||
|
|
||||||
Known Issues
|
EC duplication with composite rings
|
||||||
============
|
-----------------------------------
|
||||||
|
|
||||||
Unique Fragment Dispersion
|
It is recommended that EC Duplication is used with :ref:`composite_rings` in
|
||||||
--------------------------
|
order to disperse duplicate fragments across regions.
|
||||||
|
|
||||||
Currently, Swift's ring placement does **not** guarantee the dispersion of
|
When EC duplication is used, it is highly desirable to have one duplicate of
|
||||||
fragments' locations being robust to disaster recovery in the case
|
each fragment placed in each region. This ensures that a set of
|
||||||
of Global EC. While the goal is to have one duplicate of each
|
``ec_num_data_fragments`` unique fragments (the minimum needed to reconstruct
|
||||||
fragment placed in each region, it is currently possible for duplicates of
|
an object) can always be assembled from a single region. This in turn means
|
||||||
the same fragment to be placed in the same region (and consequently for
|
that objects are robust in the event of an entire region becoming unavailable.
|
||||||
another region to have no duplicates of that fragment). Since a set of
|
|
||||||
``ec_num_data_fragments`` unique fragments is required to reconstruct an
|
|
||||||
object, a suboptimal distribution of duplicates across regions may, in some
|
|
||||||
cases, make it impossible to assemble such a set from a single region.
|
|
||||||
|
|
||||||
For example, if we have a Swift cluster with two regions, ``r1`` and ``r2``,
|
This can be achieved by using a :ref:`composite ring <composite_rings>` with
|
||||||
the 12 fragments for an object in a ``4+2x2`` EC policy schema could have
|
the following properties:
|
||||||
pathologically sub-optimal placement::
|
|
||||||
|
* The number of component rings in the composite ring is equal to the
|
||||||
|
``ec_duplication_factor`` for the policy.
|
||||||
|
* Each *component* ring has a number of ``replicas`` that is equal to the sum
|
||||||
|
of ``ec_num_data_fragments`` and ``ec_num_parity_fragments``.
|
||||||
|
* Each component ring is populated with devices in a unique region.
|
||||||
|
|
||||||
|
This arrangement results in each component ring in the composite ring, and
|
||||||
|
therefore each region, having one copy of each fragment.
|
||||||
|
|
||||||
|
For example, consider a Swift cluster with two regions, ``region1`` and
|
||||||
|
``region2`` and a ``4+2x2`` EC policy schema. This policy should use a
|
||||||
|
composite ring with two component rings, ``ring1`` and ``ring2``, having
|
||||||
|
devices exclusively in regions ``region1`` and ``region2`` respectively. Each
|
||||||
|
component ring should have ``replicas = 6``. As a result, the first 6
|
||||||
|
fragments for an object will always be placed in ``ring1`` (i.e. in
|
||||||
|
``region1``) and the second 6 duplicate fragments will always be placed in
|
||||||
|
``ring2`` (i.e. in ``region2``).
|
||||||
|
|
||||||
|
Conversely, a conventional ring spanning the two regions may give a suboptimal
|
||||||
|
distribution of duplicates across the regions; it is possible for duplicates of
|
||||||
|
the same fragment to be placed in the same region, and consequently for another
|
||||||
|
region to have no copies of that fragment. This may make it impossible to
|
||||||
|
assemble a set of ``ec_num_data_fragments`` unique fragments from a single
|
||||||
|
region. For example, the conventional ring could have a pathologically
|
||||||
|
sub-optimal placement such as::
|
||||||
|
|
||||||
r1
|
r1
|
||||||
<timestamp>#0#d.data
|
<timestamp>#0#d.data
|
||||||
@ -377,43 +402,64 @@ pathologically sub-optimal placement::
|
|||||||
<timestamp>#5#d.data
|
<timestamp>#5#d.data
|
||||||
<timestamp>#5#d.data
|
<timestamp>#5#d.data
|
||||||
|
|
||||||
In this case, ``r1`` has only the fragments with index ``0, 2, 4`` and ``r2``
|
In this case, the object cannot be reconstructed from a single region;
|
||||||
has the other 3 indexes, but we need 4 unique indexes to be able to rebuild an
|
``region1`` has only the fragments with index ``0, 2, 4`` and ``region2`` has
|
||||||
object in a single region. To resolve this issue, a composite ring feature is
|
the other 3 indexes, but we need 4 unique indexes to be able to rebuild an
|
||||||
being developed which will provide the operator with greater control over
|
object.
|
||||||
duplicate fragment placement::
|
|
||||||
|
|
||||||
https://review.openstack.org/#/c/271920/
|
.. _global_ec_known_issues:
|
||||||
|
|
||||||
|
Known Issues
|
||||||
|
============
|
||||||
|
|
||||||
Efficient Node Selection for Read
|
Efficient Node Selection for Read
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
Since EC policies requires a set of unique fragment indexes to decode the
|
Proxy servers require a set of *unique* fragment indexes to decode the original
|
||||||
original object, it is increasingly likely with EC duplication that some
|
object when handling a GET request to an EC policy. With a conventional EC
|
||||||
responses from backend storage nodes will include fragments which the proxy has
|
policy, this is very likely to be the outcome of reading fragments from a
|
||||||
already received from another node. Currently Swift iterates over the nodes
|
random selection of backend nodes. With an EC Duplication policy it is
|
||||||
ordered by a sorting method defined in the proxy server config (i.e. either
|
significantly more likely that responses from a random selection of backend
|
||||||
shuffle, node_timing, or read_affinity) - but these configurations will
|
nodes might include some duplicated fragments.
|
||||||
not offer optimal request patterns for EC policies with duplicated
|
|
||||||
fragments. In this case Swift may frequently issue more than the optimal
|
|
||||||
``ec_num_data_fragments`` backend requests in order to gather
|
|
||||||
``ec_num_data_fragments`` **unique** fragments, even if there are no failures
|
|
||||||
amongst the object-servers.
|
|
||||||
|
|
||||||
In addition to better placement and read affinity support, ideally node
|
The recommended use of EC Duplication in combination with Composite Rings and
|
||||||
iteration for EC duplication policies could predict which nodes are likely
|
proxy server read affinity is designed to mitigate for this; a proxy server
|
||||||
to hold duplicates and prioritize requests to the most suitable nodes.
|
will first attempt to read fragments from nodes in its local region, which
|
||||||
|
are guaranteed to be unique with respect to each other. However, should enough
|
||||||
|
of those local reads fail to return a fragment, the proxy server may proceed to
|
||||||
|
read fragments from other regions. This can be relatively inefficient because
|
||||||
|
it is possible that nodes in other regions return fragments that are duplicates
|
||||||
|
of those the proxy server has already received. The proxy server will ignore
|
||||||
|
those responses and issue yet more requests to nodes in other regions.
|
||||||
|
|
||||||
|
Work is in progress to improve the proxy server node selection strategy such
|
||||||
|
that when it is necessary to read from other regions, nodes that are likely to
|
||||||
|
have useful fragments are preferred over those that are likely to return a
|
||||||
|
duplicate.
|
||||||
|
|
||||||
Efficient Cross Region Rebuild
|
Efficient Cross Region Rebuild
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
||||||
Since fragments are duplicated between regions it may in some cases be more
|
Work is also in progress to improve the object-reconstructor efficiency for
|
||||||
attractive to restore failed fragments from their duplicates in another region
|
Global EC policies. Unlike the proxy server, the reconstructor does not apply
|
||||||
instead of rebuilding them from other fragments in the local region.
|
any read affinity settings when gathering fragments. It is therefore likely to
|
||||||
Conversely to avoid WAN transfer it may be more attractive to rebuild fragments
|
receive duplicated fragments (i.e. make wasted backend GET requests) while
|
||||||
from local parity. During rebalance it will always be more attractive to
|
performing *every* fragment reconstruction.
|
||||||
revert a fragment from it's old-primary to it's new primary rather than
|
|
||||||
rebuilding or transferring a duplicate from the remote region.
|
Additionally, other reconstructor optimisations for Global EC are under
|
||||||
|
investigation:
|
||||||
|
|
||||||
|
* Since fragments are duplicated between regions it may in some cases be more
|
||||||
|
attractive to restore failed fragments from their duplicates in another
|
||||||
|
region instead of rebuilding them from other fragments in the local region.
|
||||||
|
|
||||||
|
* Conversely, to avoid WAN transfer it may be more attractive to rebuild
|
||||||
|
fragments from local parity.
|
||||||
|
|
||||||
|
* During rebalance it will always be more attractive to revert a fragment from
|
||||||
|
it's old-primary to it's new primary rather than rebuilding or transferring a
|
||||||
|
duplicate from the remote region.
|
||||||
|
|
||||||
|
|
||||||
**************
|
**************
|
||||||
Under the Hood
|
Under the Hood
|
||||||
|
@ -17,9 +17,19 @@ cluster: region 1 in San Francisco (SF), and region 2 in New York
|
|||||||
(NY). Each region shall contain within it 3 zones, numbered 1, 2, and
|
(NY). Each region shall contain within it 3 zones, numbered 1, 2, and
|
||||||
3, for a total of 6 zones.
|
3, for a total of 6 zones.
|
||||||
|
|
||||||
|
.. _configuring_global_clusters:
|
||||||
|
|
||||||
---------------------------
|
---------------------------
|
||||||
Configuring Global Clusters
|
Configuring Global Clusters
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The proxy-server configuration options described below can be given generic
|
||||||
|
settings in the ``[app:proxy-server]`` configuration section and/or given
|
||||||
|
specific settings for individual policies using
|
||||||
|
:ref:`proxy_server_per_policy_config`.
|
||||||
|
|
||||||
~~~~~~~~~~~~~
|
~~~~~~~~~~~~~
|
||||||
read_affinity
|
read_affinity
|
||||||
~~~~~~~~~~~~~
|
~~~~~~~~~~~~~
|
||||||
|
@ -350,6 +350,9 @@ Ring Builder Analyzer
|
|||||||
---------------------
|
---------------------
|
||||||
.. automodule:: swift.cli.ring_builder_analyzer
|
.. automodule:: swift.cli.ring_builder_analyzer
|
||||||
|
|
||||||
|
|
||||||
|
.. _composite_rings:
|
||||||
|
|
||||||
---------------
|
---------------
|
||||||
Composite Rings
|
Composite Rings
|
||||||
---------------
|
---------------
|
||||||
|
Loading…
Reference in New Issue
Block a user