When running in multiprocess mode, the object reconstructor would
periodically aggregate its workers' recon data into a single recon
measurement. However, at the end of the run, all that was left in
recon was the last periodic measurement; any work that took place
after that point was not recored in the aggregate. However, it was
recorded in the per-disk stats that the worker processes emitted.
This commit adds a final recon aggregation after the worker processes
have finished.
Change-Id: Ia6a3a931e9e7a23824765b2ab111a5492e509be8
When we're pushing data to a remote node using ssync, we end up
walking the entire partition's directory tree. We were already
removing reclaimable (i.e. old) tombstones and non-durable EC data
files plus their containing hash dirs, but we were leaving the suffix
dirs around for future removal, and we weren't cleaning up partition
dirs at all. Now we remove as much of the directory structure as we
can, even up to the partition dir, as soon as we observe that it's
empty.
Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1706321
Much like the multiprocess object replicator, the reconstructor runs
multiple concurrent worker processes who all log to the same
destination. We re-use the same solution: prepend a prefix with the
worker index and the pid to all the logs emitted from each worker
process.
Example log line:
[worker 12/24 pid=8539] I did a thing
Change-Id: Ie2f98201193952be4d387bbb01c7c6fccc017a8a
The object reconstructor will now fork all available worker processes
when operating on a subset of local devices.
Example:
A system has 24 disks, named "d1" through "d24"
reconstructor_workers = 8
invoked with --override-devices=d1,d2,d3,d4,d5,d6
In this case, the reconstructor will now use 6 worker processes, one
per disk. The old behavior was to use 2 worker processes, one for d1,
d3, and d5 and the other for d2, d4, and d6 (because 24 / 8 = 3, so we
assigned 3 disks per worker before creating another).
I think the new behavior better matches operators' expectations. If I
give a concurrent program six tasks to do and tell it to operate on up
to eight at a time, I'd expect it to do all six tasks at once, not run
two concurrent batches of three tasks apiece.
This has no effect when --override-devices is not specified. When
operating on all local devices instead of a subset, the new and old
code produce the same result.
The reconstructor's behavior now matches the object replicator's
behavior.
Change-Id: Ib308c156c77b9b92541a12dd7e9b1a8ea8307a30
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.
This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.
Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.
As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.
The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?
Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.
Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).
References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.
THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS
With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.
So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.
`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.
The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.
Co-Authored-By: John Dickinson <me@not.mn>
Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.
This behavior is consistent with the object-server's checks in diskfile.
Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
If you're watching (new) node's reconstruction_last time to ensure a
cycle finishes since the last ring rebalance you won't ever see
reconstructors with no devices drop recon stats.
Change-Id: I84c07fc6841119b00d1a74078fe53f4ce637187b
When a fragment of an expired object was missing, the reconstructor
ssync job would send a DELETE sub-request. This leads to situation
where, for the same object and timestamp, some nodes have a data file,
while others can have a tombstone file.
This patch forces the reconstructor to reconstruct a data file, even
for expired objects. DELETE requests are only sent for tombstoned
objects.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Closes-Bug: #1652323
Change-Id: I7f90b732c3268cb852b64f17555c631d668044a8
We have a test for get_local_devices, but let's make some broader
assertions as well.
Related-Bug: #1707595
Change-Id: Ifa696207ffdb3b39650dfeaa3e7c6cfda94050db
Since the related change, object-reconstructor gathers the local devices
for ec policy via get_local_devices method but the method causes TypeError when
attempting *reduce* for empty set list. the list can be empty when no EC config
found in swift.conf.
This patch fixes the get_local_devices to return empty set even when no ec config
in swift.conf without errors.
Co-Authored-By: Kirill Zaitsev <k.zaitsev@me.com>
Change-Id: Ic121fb547966787a43f9eae83c91bb2bf640c4be
Related-Change: 701a172afac37229b85ea762f20428f6f422d29b
Closes-Bug: #1707595
Add a test that verifies that get_all_devices does
fetch devices from the ring.
Related-Change: I28925a37f3985c9082b5a06e76af4dc3ec813abe
Change-Id: Ie2f83694f14f9a614b5276bbb859b9a3c0ec5dcb
This change adds a new Strategy concept to the daemon module similar to
how we manage WSGI workers. We need to leverage multiple python
processes to get the concurrency properties we need. More workers will
rebalance much faster on dense chassis with many devices.
Currently the default is still only one process, and no workers. Set
reconstructor_workers in the [object-reconstructor] section to some
whole number <= the number of devices on a node to get that many
reconstructor workers.
Each worker will operate on a different subset of disks.
Once mode works as before, but tends to want to update recon drops a
little bit more.
If you change the rings, the strategy will shutdown workers and spawn
new ones.
You can kill the worker pids and the daemon strategy will respawn them.
New per-disk reconstructor stats are dumped to recon under the
object_reconstruction_per_disk key. To maintain legacy compatibility
and replication monitoring based on cycle times they are aggregated
every stats_interval (default 5 mins).
Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe
While reviewing https://review.openstack.org/#/c/464982/ I noticed that
assertions like
self.assertEqual(len(captured_ssync), 4)
aren't terribly helpful when they fail.
Change-Id: I5c5df7ed60e58c1d1bca5a5bfef9352a39a41f2f
The reconstructor handoffs_only needs to aggressively avoid erroneous
I/O related to rehash of primary suffixes.
While in handoffs_only mode the reconstructor won't even look at primary
partitions.
This has a *huge* impact on cycle time once the node has completed
processing handoffs; which results in a much faster and stronger signal
that that it's either time to rebalance again or turn off handoffs_only.
Related-Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898
Change-Id: If4bbb778d511efe13713590639c8b91615556f22
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
Revert tombstone only parts try to talk to all primary nodes - this
fixes it to randomize selection within part_nodes. Corresponding probe
test is modified to reflect this change.
The primary improvement of this patch is the reconstuctor at a handoff
node is being able to delete local tombstones when it succeeds to sync
to less than all primary nodes. (Before this patch, it requires all
nodes are responsible for the REVERT requests)
The number of primary nodes to communicate with the reconstructor can be
in dicsussion more but, right now with this patch, it's (replicas - k + 1)
that is able to prevent stale read.
*BONUS*
- Fix mis-testsetting (was setting less replicas than ec_k + ec_m)
for reconstructor ring in the unit test
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I05ce8fe75f1c4a7971cc8995b003df818b69b3c1
Closes-Bug: #1668857
We don't know the order in which connections will be made, so grab the
ip/port/dev from the mocked conn's request log.
Change-Id: I3b4486c99ad85173d5027b46e1c0613202d0f75a
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.
Change-Id: I3581c635c094a8c4339e9b770331a03eab704074
Object paths can have non-ascii characters. Device dicts will
have unicode values. Forming a string using both will cause the
object path to be coerced to UTF8, which currently causes a
UnicodeDecodeError. This causes _get_response() to not return
and the recosntructor hangs.
The call to _full_path() is moved outside of _get_response()
(where its result is used in the exception handler logging)
so that _get_response() will always return even if _full_path()
raises an exception.
Unit tests are refactored to split out a new class with those
tests using an object name and the _full_path method, so that
the class can be subclassed to use an object name with non-ascii
characters.
Existing probe tests are subclassed to repeat using non-ascii
chars in object paths.
Change-Id: I4c570c08c770636d57b1157e19d5b7034fd9ed4e
Closes-Bug: 1679175
SSYNC is designed to limit concurrent incoming connections in order to
prevent IO contention. The reconstructor should expect remote
replication servers to fail ssync_sender when the remote is too busy.
When the remote rejects SSYNC - it should avoid forcing additional IO
against the remote with a REPLICATE request which causes suffix
rehashing.
Suffix rehashing via REPLICATE verbs takes two forms:
1) a initial pre-flight call to REPLICATE /dev/part will cause a remote
primary to rehash any invalid suffixes and return a map for the local
sender to compare so that a sync can be performed on any mis-matched
suffixes.
2) a final call to REPLICATE /dev/part/suf1-suf2-suf3[-sufX[...]] will
cause the remote primary to rehash the *given* suffixes even if they are
*not* invalid. This is a requirement for rsync replication because
after a suffix is synced via rsync the contents of a suffix dir will
likely have changed and the remote server needs to update it hashes.pkl
to reflect the new data.
SSYNC does not *need* to send a post-sync REPLICATE request. Any
suffixes that are modified by the SSYNC protocol will call _finalize_put
under the hood as it is syncing. It is however not harmful and
potentially useful to go ahead refresh hashes after an SSYNC while the
inodes of those suffixes are warm in the cache.
However, that only makes sense if the SSYNC conversation actually synced
any suffixes - if SSYNC is rejected for concurrency before it ever got
started there is no value in the remote performing a rehash. It may be
that *another* reconstructor is pushing data into that same partition
and the suffixes will become immediately invalidated.
If a ssync_sender does not successful finish a sync the reconstructor
should skip the REPLICATE call entirely and move on to the next
partition without causing any useless remote IO.
Closes-Bug: #1665141
Change-Id: Ia72c407247e4525ef071a1728750850807ae8231
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.
This commit changes DiskFileManager _get_hashes() arguments from:
- partition_path, recalculate=None, do_listdir=False
to :
- device, partition, policy, recalculate=None, do_listdir=False
Callers are modified accordingly, in diskfile.py, reconstructor.py,
and replicator.py
Change-Id: I8e2d7075572e466ae2fa5ebef5e31d87eed90fec
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.
This commit changes DiskFileManager yield_hashes() returned values
from:
- object_path, object_hash, timestamps
to:
- object_hash, timestamps
object_path was not used by any caller.
Change-Id: I914fb1ec8ce7c9c26d22e1d07f03bd03f4504176
Suffix hash invalidations in hashes.invalid can be lost when two
concurrent calls to get_hashes consolidate the hashes of a new
partition with no hashes.pkl:
- suffix S has been invalidated and is listed in hashes.invalid
- process X calls get_hashes when there is no existing hashes.pkl
- process X removes hashes.invalids file in consolidate_hashes
- process X calculates the hash of suffix S (note, process X has
not yet written hashes.pkl)
- process Y invalidates suffix S, appends S to hashes.invalid, so the
hash of suffix S *should* be recalculated at some point
- process Z calls get_hashes->consolidate_hashes, deletes hashes.invalid
because there is still no hashes.pkl
- process Z fails
- process X writes hashes.pkl with stale hash value for suffix S
- the invalidation of S that was made by process Y is lost
The solution is to never remove hashes.invalid during consolidate_hashes
without first recording any invalid suffixes in hashes and writing hashes
to disk in hashes.pkl. This is already the behaviour when hashes.pkl
exists. The cost of an additional write to hashes.pkl, introduced by this
patch, is only incurred once, when get_hashes first encounters a
partition with no hashes.pkl.
Related-Change: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3
Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c
The Related-Change removed the frag_index expected from a
node response from the full_path included in log messages.
This made sense because the *expected* frag_index is not
necessarily the index that was actually received from the
node. However, it would be useful to include the *actual*
received frag_index in log messages.
This patch also:
- makes _full_path a module level function
- renames unique_index to be resp_frag_index to aid
understanding of the various indexes we deal with during
reconstruction.
Change-Id: Ic932835b3c1ed51a8456fce775fb59445fcb834b
Related-Change: I8096202f5f8d91296963f7a409a29d57fa7828e4
To address Alistair's comment at
https://review.openstack.org/#/c/219165.
This includes:
- Fix reconstructor log message to avoid redundant frag index info
- Fix incorrect FabricatedRing setting to have ec_k + ec_m replicas
- Use policy.ec_n_unique_fragments for testing frag index election
- Plus some various minor cleanup and docs additions
Huge refactoring around TestECMixin at the test/unit/proxy/test_server.py
is in https://review.openstack.org/#/c/440466/ to clarify the change.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I8096202f5f8d91296963f7a409a29d57fa7828e4
Tighten up test to verify that 404 response results in a None
return from reconstructor _get_response. Merge this test case
into the test_get_response method to make use of the do_test()
infrastructure. Add check for 503 response status.
Also, use assertFalse to verify empty log lines, since a failure
will then result in any log lines being shown in failure message.
Related-Change: Iba86b495a14c15fc6eca8bf8a7df7d110256b0af
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Ia83517e6d4c2f5eeb136abd4c04ddab639d40b9e
This is follow up for https://review.openstack.org/#/c/436522/
I'd like to use same assertion if it goes the same path.
Both Exception and Timeout will be in the exception log starts with
"Trying to GET". "Timeout" is an extra word appeared in the log.
And more, this adds assertions for the return value from the
get_response for error cases which should be as None.
Change-Id: Iba86b495a14c15fc6eca8bf8a7df7d110256b0af
Some part of the test coverage was omitted in related change
and some has been missing. This change fixes it.
Change-Id: I403b493bd8e59f6bcb586b4263a8e8c267728505
Related-Change-Id: I69e4c4baee64fd2192cbf5836b0803db1cc71705
Follow up for related change:
- fix typos
- use common helper methods
- refactor some tests to reduce duplicate code
Related-Change: Idd155401982a2c48110c30b480966a863f6bd305
Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429
This patch enables efficent PUT/GET for global distributed cluster[1].
Problem:
Erasure coding has the capability to decrease the amout of actual stored
data less then replicated model. For example, ec_k=6, ec_m=3 parameter
can be 1.5x of the original data which is smaller than 3x replicated.
However, unlike replication, erasure coding requires availability of at
least some ec_k fragments of the total ec_k + ec_m fragments to service
read (e.g. 6 of 9 in the case above). As such, if we stored the
EC object into a swift cluster on 2 geographically distributed data
centers which have the same volume of disks, it is likely the fragments
will be stored evenly (about 4 and 5) so we still need to access a
faraway data center to decode the original object. In addition, if one
of the data centers was lost in a disaster, the stored objects will be
lost forever, and we have to cry a lot. To ensure highly durable
storage, you would think of making *more* parity fragments (e.g.
ec_k=6, ec_m=10), unfortunately this causes *significant* performance
degradation due to the cost of mathmetical caluculation for erasure
coding encode/decode.
How this resolves the problem:
EC Fragment Duplication extends on the initial solution to add *more*
fragments from which to rebuild an object similar to the solution
described above. The difference is making *copies* of encoded fragments.
With experimental results[1][2], employing small ec_k and ec_m shows
enough performance to store/retrieve objects.
On PUT:
- Encode incomming object with small ec_k and ec_m <- faster!
- Make duplicated copies of the encoded fragments. The # of copies
are determined by 'ec_duplication_factor' in swift.conf
- Store all fragments in Swift Global EC Cluster
The duplicated fragments increase pressure on existing requirements
when decoding objects in service to a read request. All fragments are
stored with their X-Object-Sysmeta-Ec-Frag-Index. In this change, the
X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index
encoded by PyECLib, there *will* be duplicates. Anytime we must decode
the original object data, we must only consider the ec_k fragments as
unique according to their X-Object-Sysmeta-Ec-Frag-Index. On decode no
duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an
object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and
avoided if possible.
On GET:
This patch inclues following changes:
- Change GET Path to sort primary nodes grouping as subsets, so that
each subset will includes unique fragments
- Change Reconstructor to be more aware of possibly duplicate fragments
For example, with this change, a policy could be configured such that
swift.conf:
ec_num_data_fragments = 2
ec_num_parity_fragments = 1
ec_duplication_factor = 2
(object ring must have 6 replicas)
At Object-Server:
node index (from object ring): 0 1 2 3 4 5 <- keep node index for
reconstruct decision
X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual
fragment index for
backend (PyEClib)
Additional improvements to Global EC Cluster Support will require
features such as Composite Rings, and more efficient fragment
rebalance/reconstruction.
1: http://goo.gl/IYiNPk (Swift Design Spec Repository)
2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo)
Doc-Impact
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Idd155401982a2c48110c30b480966a863f6bd305
Now that we're shuffling parts before going through them, those stats no
longer make sense -- device completion would always be 100%.
Also, always use delete_partition for cleanup, so we have one place to
make improvements. This means we'll properly clean up non-numeric
directories.
Also also, put more I/O in the tpool in delete_partition.
Change-Id: Ie06bb16c130d46ccf887c8fcb252b8d018072d68
Related-Change: I69e4c4baee64fd2192cbf5836b0803db1cc71705
This is a follow-up for https://review.openstack.org/#/c/425493
This patch includes:
- Add more tests on the configuration with handoffs_first and
handoffs_only
- Remove unnecessary space in a warning log line. (2 places)
- Change test conf from True/False to "True"/"False" (string) because in
the conf dict, those value should be string.
Co-Authored-By: Janie Richling <jrichli@us.ibm.com>
Change-Id: Ida90c32d16481a15fa68c9fdb380932526c366f6
The handoffs_first mode in the replicator has the useful behavior of
processing all handoff parts across all disks until there aren't any
handoffs anymore on the node [1] and then it seemingly tries to drop
back into normal operation. In practice I've only ever heard of
handoffs_first used while rebalancing and turned off as soon as the
rebalance finishes - it's not recommended to run with handoffs_first
mode turned on and it emits a warning on startup if option is enabled.
The handoffs_first mode on the reconstructor doesn't work - it was
prioritizing handoffs *per-part* [2] - which is really unfortunate
because in the reconstructor during a rebalance it's often *much* more
attractive from an efficiency disk/network perspective to revert a
partition from a handoff than it is to rebuild an entire partition from
another primary using the other EC fragments in the cluster.
This change deprecates handoffs_first in favor of handoffs_only in the
reconstructor which is far more useful - and just like handoffs_first
mode in the replicator - it gives the operator the option of forcing the
consistency engine to focus on rebalance. The handoffs_only behavior is
somewhat consistent with the replicator's handoffs_first option (any
error on any handoff in the replicactor will make it essentially handoff
only forever) but the option does what you want and is named correctly
in the reconstructor.
For consistency with the replicator the reconstructor will mostly honor
the handoffs_first option, but if you set handoffs_only in the config it
always takes precedence. Having handoffs_first in your config always
results in a warning, but if handoff_only is not set and handoffs_first
is true the reconstructor will assume you need handoffs_only and behaves
as such.
When running in handoffs_only mode the reconstructor will start to log a
warning every cycle if you leave it running in handoffs_only after it
finishes reverting handoffs. However you should be monitoring on-disk
partitions and disable the option as soon as the cluster finishes the
full rebalance cycle.
1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator
handoffs_first "mode"
2. Unlike replication each partition in a EC policy can have a different
kind of job per frag_index, but the cardinality of jobs is typically
only one (either sync or revert) unless there's been a bunch of errors
during write and then handoffs partitions maybe hold a number of
different fragments.
Known-Issues:
handoffs_only is not documented outside of the example config, see lp
bug #1626290
Closes-Bug: #1653018
Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898