swift

Author	SHA1	Message	Date
Zuul	3de21d945b	Merge "Remove empty part dirs during ssync replication"	2018-06-23 02:19:18 +00:00
Samuel Merritt	ecf47553b5	Make final stats dump after reconstructor runs once When running in multiprocess mode, the object reconstructor would periodically aggregate its workers' recon data into a single recon measurement. However, at the end of the run, all that was left in recon was the last periodic measurement; any work that took place after that point was not recored in the aggregate. However, it was recorded in the per-disk stats that the worker processes emitted. This commit adds a final recon aggregation after the worker processes have finished. Change-Id: Ia6a3a931e9e7a23824765b2ab111a5492e509be8	2018-06-04 15:24:45 -07:00
Samuel Merritt	a19548b3e6	Remove empty part dirs during ssync replication When we're pushing data to a remote node using ssync, we end up walking the entire partition's directory tree. We were already removing reclaimable (i.e. old) tombstones and non-durable EC data files plus their containing hash dirs, but we were leaving the suffix dirs around for future removal, and we weren't cleaning up partition dirs at all. Now we remove as much of the directory structure as we can, even up to the partition dir, as soon as we observe that it's empty. Change-Id: I2849a757519a30684646f3a6f4467c21e9281707 Closes-Bug: 1706321	2018-05-01 17:18:22 -07:00
Samuel Merritt	26538d3f62	Make multiprocess reconstructor's logs more readable. Much like the multiprocess object replicator, the reconstructor runs multiple concurrent worker processes who all log to the same destination. We re-use the same solution: prepend a prefix with the worker index and the pid to all the logs emitted from each worker process. Example log line: [worker 12/24 pid=8539] I did a thing Change-Id: Ie2f98201193952be4d387bbb01c7c6fccc017a8a	2018-04-25 11:18:35 -07:00
Samuel Merritt	c4751d0d55	Make reconstructor go faster with --override-devices The object reconstructor will now fork all available worker processes when operating on a subset of local devices. Example: A system has 24 disks, named "d1" through "d24" reconstructor_workers = 8 invoked with --override-devices=d1,d2,d3,d4,d5,d6 In this case, the reconstructor will now use 6 worker processes, one per disk. The old behavior was to use 2 worker processes, one for d1, d3, and d5 and the other for d2, d4, and d6 (because 24 / 8 = 3, so we assigned 3 disks per worker before creating another). I think the new behavior better matches operators' expectations. If I give a concurrent program six tasks to do and tell it to operate on up to eight at a time, I'd expect it to do all six tasks at once, not run two concurrent batches of three tasks apiece. This has no effect when --override-devices is not specified. When operating on all local devices instead of a subset, the new and old code produce the same result. The reconstructor's behavior now matches the object replicator's behavior. Change-Id: Ib308c156c77b9b92541a12dd7e9b1a8ea8307a30	2018-04-25 11:18:35 -07:00
Samuel Merritt	728b4ba140	Add checksum to object extended attributes Currently, our integrity checking for objects is pretty weak when it comes to object metadata. If the extended attributes on a .data or .meta file get corrupted in such a way that we can still unpickle it, we don't have anything that detects that. This could be especially bad with encrypted etags; if the encrypted etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk, then send it to the client. Net effect is that the client sees a GET response with an ETag that doesn't match the MD5 of the object and Swift has no way of detecting and quarantining this object. Note that, with an unencrypted object, if the ETag metadatum gets mangled, then the object will be quarantined by the object server or auditor, whichever notices first. As part of this commit, I also ripped out some mocking of getxattr/setxattr in tests. It appears to be there to allow unit tests to run on systems where /tmp doesn't support xattrs. However, since the mock is keyed off of inode number and inode numbers get re-used, there's lots of leakage between different test runs. On a real FS, unlinking a file and then creating a new one of the same name will also reset the xattrs; this isn't the case with the mock. The mock was pretty old; Ubuntu 12.04 and up all support xattrs in /tmp, and recent Red Hat / CentOS releases do too. The xattr mock was added in 2011; maybe it was to support Ubuntu Lucid Lynx? Bonus: now you can pause a test with the debugger, inspect its files in /tmp, and actually see the xattrs along with the data. Since this patch now uses a real filesystem for testing filesystem operations, tests are skipped if the underlying filesystem does not support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4). References to "/tmp" have been replaced with calls to tempfile.gettempdir(). This will allow setting the TMPDIR envvar in test setup and getting an XFS filesystem instead of ext4 or tmpfs. THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS With this patch, every test environment will require TMPDIR to be using a filesystem that supports at least 4k of extended attributes. Neither ext4 nor tempfs support this. XFS is recommended. So why all the SkipTests? Why not simply raise an error? We still need the tests to run on the base image for OpenStack's CI system. Since we were previously mocking out xattr, there wasn't a problem, but we also weren't actually testing anything. This patch adds functionality to validate xattr data, so we need to drop the mock. `test.unit.skip_if_no_xattrs()` is also imported into `test.functional` so that functional tests can import it from the functional test namespace. The related OpenStack CI infrastructure changes are made in https://review.openstack.org/#/c/394600/. Co-Authored-By: John Dickinson <me@not.mn> Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808	2017-11-03 13:30:05 -04:00
Pavel Kvasnička	163fb4d52a	Always require device dir for containers For test purposes (e.g. saio probetests) even if mount_check is False, still require check_dir for account/container server storage when real mount points are not used. This behavior is consistent with the object-server's checks in diskfile. Co-Author: Clay Gerrard <clay.gerrard@gmail.com> Related lp bug #1693005 Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2 Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764	2017-09-01 10:32:12 -07:00
Clay Gerrard	63ca3a74ef	Drop reconstructor stats when worker has no devices If you're watching (new) node's reconstruction_last time to ensure a cycle finishes since the last ring rebalance you won't ever see reconstructors with no devices drop recon stats. Change-Id: I84c07fc6841119b00d1a74078fe53f4ce637187b	2017-08-21 17:50:10 +01:00
Romain LE DISEZ	69df458254	Allow to rebuild a fragment of an expired object When a fragment of an expired object was missing, the reconstructor ssync job would send a DELETE sub-request. This leads to situation where, for the same object and timestamp, some nodes have a data file, while others can have a tombstone file. This patch forces the reconstructor to reconstruct a data file, even for expired objects. DELETE requests are only sent for tombstoned objects. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Closes-Bug: #1652323 Change-Id: I7f90b732c3268cb852b64f17555c631d668044a8	2017-08-04 23:05:08 +02:00
Tim Burke	8d05325f03	Test reconstruct() with no EC policies We have a test for get_local_devices, but let's make some broader assertions as well. Related-Bug: #1707595 Change-Id: Ifa696207ffdb3b39650dfeaa3e7c6cfda94050db	2017-08-01 09:18:07 +01:00
Kota Tsuyuzaki	45cc1d02d0	Fix reconstructer to be able to run non ec policy environment Since the related change, object-reconstructor gathers the local devices for ec policy via get_local_devices method but the method causes TypeError when attempting reduce for empty set list. the list can be empty when no EC config found in swift.conf. This patch fixes the get_local_devices to return empty set even when no ec config in swift.conf without errors. Co-Authored-By: Kirill Zaitsev <k.zaitsev@me.com> Change-Id: Ic121fb547966787a43f9eae83c91bb2bf640c4be Related-Change: 701a172afac37229b85ea762f20428f6f422d29b Closes-Bug: #1707595	2017-07-31 18:46:22 +09:00
Alistair Coles	56a18ac9b7	Add unit test for ObjectReconstructor.is_healthy Add a test that verifies that get_all_devices does fetch devices from the ring. Related-Change: I28925a37f3985c9082b5a06e76af4dc3ec813abe Change-Id: Ie2f83694f14f9a614b5276bbb859b9a3c0ec5dcb	2017-07-27 14:14:26 +01:00
Clay Gerrard	701a172afa	Add multiple worker processes strategy to reconstructor This change adds a new Strategy concept to the daemon module similar to how we manage WSGI workers. We need to leverage multiple python processes to get the concurrency properties we need. More workers will rebalance much faster on dense chassis with many devices. Currently the default is still only one process, and no workers. Set reconstructor_workers in the [object-reconstructor] section to some whole number <= the number of devices on a node to get that many reconstructor workers. Each worker will operate on a different subset of disks. Once mode works as before, but tends to want to update recon drops a little bit more. If you change the rings, the strategy will shutdown workers and spawn new ones. You can kill the worker pids and the daemon strategy will respawn them. New per-disk reconstructor stats are dumped to recon under the object_reconstruction_per_disk key. To maintain legacy compatibility and replication monitoring based on cycle times they are aggregated every stats_interval (default 5 mins). Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe	2017-07-26 16:55:10 -07:00
Jenkins	83b62b4f39	Merge "Add Timestamp.now() helper"	2017-07-18 03:27:50 +00:00
Tim Burke	20a0d67340	Clean up some assertions in test_reconstructor While reviewing https://review.openstack.org/#/c/464982/ I noticed that assertions like self.assertEqual(len(captured_ssync), 4) aren't terribly helpful when they fail. Change-Id: I5c5df7ed60e58c1d1bca5a5bfef9352a39a41f2f	2017-07-10 15:32:07 -07:00
Jenkins	8472cb6538	Merge "Fix sporadic failures in test_reconstructor.py"	2017-07-10 17:18:53 +00:00
Jenkins	ded0de7aa5	Merge "Don't rehash primaries in reconstructor handoffs_only mode"	2017-07-10 16:16:00 +00:00
Clay Gerrard	44c63c6990	Don't rehash primaries in reconstructor handoffs_only mode The reconstructor handoffs_only needs to aggressively avoid erroneous I/O related to rehash of primary suffixes. While in handoffs_only mode the reconstructor won't even look at primary partitions. This has a huge impact on cycle time once the node has completed processing handoffs; which results in a much faster and stronger signal that that it's either time to rebalance again or turn off handoffs_only. Related-Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898 Change-Id: If4bbb778d511efe13713590639c8b91615556f22	2017-07-07 15:16:00 -07:00
Jenkins	e94b383655	Merge "Add support to increase object ring partition power"	2017-07-05 14:40:42 +00:00
Christian Schwede	e1140666d6	Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb	2017-06-15 15:08:48 -07:00
Jenkins	d46b0f29f9	Merge "Limit number of revert tombstone SSYNC requests"	2017-06-08 18:10:08 +00:00
Mahati Chamarthy	188c07e12a	Limit number of revert tombstone SSYNC requests Revert tombstone only parts try to talk to all primary nodes - this fixes it to randomize selection within part_nodes. Corresponding probe test is modified to reflect this change. The primary improvement of this patch is the reconstuctor at a handoff node is being able to delete local tombstones when it succeeds to sync to less than all primary nodes. (Before this patch, it requires all nodes are responsible for the REVERT requests) The number of primary nodes to communicate with the reconstructor can be in dicsussion more but, right now with this patch, it's (replicas - k + 1) that is able to prevent stale read. BONUS - Fix mis-testsetting (was setting less replicas than ec_k + ec_m) for reconstructor ring in the unit test Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I05ce8fe75f1c4a7971cc8995b003df818b69b3c1 Closes-Bug: #1668857	2017-06-08 07:07:42 +00:00
Tim Burke	75290ec6ec	Fix sporadic failures in test_reconstructor.py We don't know the order in which connections will be made, so grab the ip/port/dev from the mocked conn's request log. Change-Id: I3b4486c99ad85173d5027b46e1c0613202d0f75a	2017-05-02 14:00:34 -07:00
Jenkins	d7a6d6e1e9	Merge "Do not sync suffixes when remote rejects reconstructor revert"	2017-05-01 20:38:07 +00:00
Tim Burke	85d6cd30be	Add Timestamp.now() helper Often, we want the current timestamp. May as well improve the ergonomics a bit and provide a class method for it. Change-Id: I3581c635c094a8c4339e9b770331a03eab704074	2017-04-27 14:19:00 -07:00
Alistair Coles	83750cf79c	Fix UnicodeDecodeError in reconstructor _full_path function Object paths can have non-ascii characters. Device dicts will have unicode values. Forming a string using both will cause the object path to be coerced to UTF8, which currently causes a UnicodeDecodeError. This causes _get_response() to not return and the recosntructor hangs. The call to _full_path() is moved outside of _get_response() (where its result is used in the exception handler logging) so that _get_response() will always return even if _full_path() raises an exception. Unit tests are refactored to split out a new class with those tests using an object name and the _full_path method, so that the class can be subclassed to use an object name with non-ascii characters. Existing probe tests are subclassed to repeat using non-ascii chars in object paths. Change-Id: I4c570c08c770636d57b1157e19d5b7034fd9ed4e Closes-Bug: 1679175	2017-04-18 14:07:01 +01:00
Jenkins	a22208043f	Merge "Modify _get_hashes() arguments to be more generic"	2017-04-10 22:50:11 +00:00
Jenkins	b3e69acb43	Merge "Fix race when consolidating new partition"	2017-04-08 00:55:23 +00:00
Clay Gerrard	a0fcca1e05	Do not sync suffixes when remote rejects reconstructor revert SSYNC is designed to limit concurrent incoming connections in order to prevent IO contention. The reconstructor should expect remote replication servers to fail ssync_sender when the remote is too busy. When the remote rejects SSYNC - it should avoid forcing additional IO against the remote with a REPLICATE request which causes suffix rehashing. Suffix rehashing via REPLICATE verbs takes two forms: 1) a initial pre-flight call to REPLICATE /dev/part will cause a remote primary to rehash any invalid suffixes and return a map for the local sender to compare so that a sync can be performed on any mis-matched suffixes. 2) a final call to REPLICATE /dev/part/suf1-suf2-suf3[-sufX[...]] will cause the remote primary to rehash the given suffixes even if they are not invalid. This is a requirement for rsync replication because after a suffix is synced via rsync the contents of a suffix dir will likely have changed and the remote server needs to update it hashes.pkl to reflect the new data. SSYNC does not need to send a post-sync REPLICATE request. Any suffixes that are modified by the SSYNC protocol will call _finalize_put under the hood as it is syncing. It is however not harmful and potentially useful to go ahead refresh hashes after an SSYNC while the inodes of those suffixes are warm in the cache. However, that only makes sense if the SSYNC conversation actually synced any suffixes - if SSYNC is rejected for concurrency before it ever got started there is no value in the remote performing a rehash. It may be that another reconstructor is pushing data into that same partition and the suffixes will become immediately invalidated. If a ssync_sender does not successful finish a sync the reconstructor should skip the REPLICATE call entirely and move on to the next partition without causing any useless remote IO. Closes-Bug: #1665141 Change-Id: Ia72c407247e4525ef071a1728750850807ae8231	2017-04-06 17:37:34 +01:00
Alexandre Lécuyer	95905b0174	Modify _get_hashes() arguments to be more generic Some public functions in the diskfile manager expect or return full file paths. It implies a filesystem diskfile implementation. To make it easier to plug alternate diskfile implementations, patch functions to take more generic arguments. This commit changes DiskFileManager _get_hashes() arguments from: - partition_path, recalculate=None, do_listdir=False to : - device, partition, policy, recalculate=None, do_listdir=False Callers are modified accordingly, in diskfile.py, reconstructor.py, and replicator.py Change-Id: I8e2d7075572e466ae2fa5ebef5e31d87eed90fec	2017-03-29 14:57:40 +02:00
Alexandre Lécuyer	cff7455a68	Remove unused returned value object_path from yield_hashes() Some public functions in the diskfile manager expect or return full file paths. It implies a filesystem diskfile implementation. To make it easier to plug alternate diskfile implementations, patch functions to take more generic arguments. This commit changes DiskFileManager yield_hashes() returned values from: - object_path, object_hash, timestamps to: - object_hash, timestamps object_path was not used by any caller. Change-Id: I914fb1ec8ce7c9c26d22e1d07f03bd03f4504176	2017-03-27 17:18:28 +02:00
Alistair Coles	52a23ddb3c	Fix race when consolidating new partition Suffix hash invalidations in hashes.invalid can be lost when two concurrent calls to get_hashes consolidate the hashes of a new partition with no hashes.pkl: - suffix S has been invalidated and is listed in hashes.invalid - process X calls get_hashes when there is no existing hashes.pkl - process X removes hashes.invalids file in consolidate_hashes - process X calculates the hash of suffix S (note, process X has not yet written hashes.pkl) - process Y invalidates suffix S, appends S to hashes.invalid, so the hash of suffix S should be recalculated at some point - process Z calls get_hashes->consolidate_hashes, deletes hashes.invalid because there is still no hashes.pkl - process Z fails - process X writes hashes.pkl with stale hash value for suffix S - the invalidation of S that was made by process Y is lost The solution is to never remove hashes.invalid during consolidate_hashes without first recording any invalid suffixes in hashes and writing hashes to disk in hashes.pkl. This is already the behaviour when hashes.pkl exists. The cost of an additional write to hashes.pkl, introduced by this patch, is only incurred once, when get_hashes first encounters a partition with no hashes.pkl. Related-Change: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3 Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c	2017-03-20 12:49:57 +00:00
Alistair Coles	56349e022d	Include received frag_index in reconstructor log warnings The Related-Change removed the frag_index expected from a node response from the full_path included in log messages. This made sense because the expected frag_index is not necessarily the index that was actually received from the node. However, it would be useful to include the actual received frag_index in log messages. This patch also: - makes _full_path a module level function - renames unique_index to be resp_frag_index to aid understanding of the various indexes we deal with during reconstruction. Change-Id: Ic932835b3c1ed51a8456fce775fb59445fcb834b Related-Change: I8096202f5f8d91296963f7a409a29d57fa7828e4	2017-03-17 09:18:17 +00:00
Kota Tsuyuzaki	a2f4046624	Small fixes for ec duplication To address Alistair's comment at https://review.openstack.org/#/c/219165. This includes: - Fix reconstructor log message to avoid redundant frag index info - Fix incorrect FabricatedRing setting to have ec_k + ec_m replicas - Use policy.ec_n_unique_fragments for testing frag index election - Plus some various minor cleanup and docs additions Huge refactoring around TestECMixin at the test/unit/proxy/test_server.py is in https://review.openstack.org/#/c/440466/ to clarify the change. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I8096202f5f8d91296963f7a409a29d57fa7828e4	2017-03-16 21:59:56 -07:00
Alistair Coles	54fe738a0e	Add assertions to test_reconstructor test_get_response Tighten up test to verify that 404 response results in a None return from reconstructor _get_response. Merge this test case into the test_get_response method to make use of the do_test() infrastructure. Add check for 503 response status. Also, use assertFalse to verify empty log lines, since a failure will then result in any log lines being shown in failure message. Related-Change: Iba86b495a14c15fc6eca8bf8a7df7d110256b0af Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: Ia83517e6d4c2f5eeb136abd4c04ddab639d40b9e	2017-03-08 10:31:02 +00:00
Jenkins	7f953ce0b9	Merge "Follow up the reconstructor test coverage"	2017-03-07 15:10:03 +00:00
Jenkins	30524392f8	Merge "Follow up on reconstructor handoffs_only"	2017-03-07 06:45:09 +00:00
Kota Tsuyuzaki	0e44770991	Follow up the reconstructor test coverage This is follow up for https://review.openstack.org/#/c/436522/ I'd like to use same assertion if it goes the same path. Both Exception and Timeout will be in the exception log starts with "Trying to GET". "Timeout" is an extra word appeared in the log. And more, this adds assertions for the return value from the get_response for error cases which should be as None. Change-Id: Iba86b495a14c15fc6eca8bf8a7df7d110256b0af	2017-03-05 18:02:09 -08:00
Jenkins	cf1c44dff0	Merge "Fixups for EC frag duplication tests"	2017-03-03 23:08:34 +00:00
Jenkins	3891721d59	Merge "Cleanup reconstructor tests"	2017-03-03 21:28:32 +00:00
Kota Tsuyuzaki	54347f92ed	Cleanup reconstructor tests Fixes: * assertTrue(xxxx in yyyyy) -> assertIn(xxxx, yyyy) * assertTrue(xxxx > yyyy) -> assertGreater(xxxx, yyyy) Change-Id: I353ec389f9abed3427951cd473d7c3ebcbca1669	2017-03-03 00:57:13 -08:00
Mahati Chamarthy	96f8b957ee	Increase test coverage for reconstructor Some part of the test coverage was omitted in related change and some has been missing. This change fixes it. Change-Id: I403b493bd8e59f6bcb586b4263a8e8c267728505 Related-Change-Id: I69e4c4baee64fd2192cbf5836b0803db1cc71705	2017-02-28 00:54:11 +05:30
Jenkins	1f36b5dd16	Merge "EC Fragment Duplication - Foundational Global EC Cluster Support"	2017-02-26 06:26:08 +00:00
Alistair Coles	e4972f5ac7	Fixups for EC frag duplication tests Follow up for related change: - fix typos - use common helper methods - refactor some tests to reduce duplicate code Related-Change: Idd155401982a2c48110c30b480966a863f6bd305 Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429	2017-02-25 20:40:04 -08:00
Kota Tsuyuzaki	40ba7f6172	EC Fragment Duplication - Foundational Global EC Cluster Support This patch enables efficent PUT/GET for global distributed cluster[1]. Problem: Erasure coding has the capability to decrease the amout of actual stored data less then replicated model. For example, ec_k=6, ec_m=3 parameter can be 1.5x of the original data which is smaller than 3x replicated. However, unlike replication, erasure coding requires availability of at least some ec_k fragments of the total ec_k + ec_m fragments to service read (e.g. 6 of 9 in the case above). As such, if we stored the EC object into a swift cluster on 2 geographically distributed data centers which have the same volume of disks, it is likely the fragments will be stored evenly (about 4 and 5) so we still need to access a faraway data center to decode the original object. In addition, if one of the data centers was lost in a disaster, the stored objects will be lost forever, and we have to cry a lot. To ensure highly durable storage, you would think of making more parity fragments (e.g. ec_k=6, ec_m=10), unfortunately this causes significant performance degradation due to the cost of mathmetical caluculation for erasure coding encode/decode. How this resolves the problem: EC Fragment Duplication extends on the initial solution to add more fragments from which to rebuild an object similar to the solution described above. The difference is making copies of encoded fragments. With experimental results[1][2], employing small ec_k and ec_m shows enough performance to store/retrieve objects. On PUT: - Encode incomming object with small ec_k and ec_m <- faster! - Make duplicated copies of the encoded fragments. The # of copies are determined by 'ec_duplication_factor' in swift.conf - Store all fragments in Swift Global EC Cluster The duplicated fragments increase pressure on existing requirements when decoding objects in service to a read request. All fragments are stored with their X-Object-Sysmeta-Ec-Frag-Index. In this change, the X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index encoded by PyECLib, there will be duplicates. Anytime we must decode the original object data, we must only consider the ec_k fragments as unique according to their X-Object-Sysmeta-Ec-Frag-Index. On decode no duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and avoided if possible. On GET: This patch inclues following changes: - Change GET Path to sort primary nodes grouping as subsets, so that each subset will includes unique fragments - Change Reconstructor to be more aware of possibly duplicate fragments For example, with this change, a policy could be configured such that swift.conf: ec_num_data_fragments = 2 ec_num_parity_fragments = 1 ec_duplication_factor = 2 (object ring must have 6 replicas) At Object-Server: node index (from object ring): 0 1 2 3 4 5 <- keep node index for reconstruct decision X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual fragment index for backend (PyEClib) Additional improvements to Global EC Cluster Support will require features such as Composite Rings, and more efficient fragment rebalance/reconstruction. 1: http://goo.gl/IYiNPk (Swift Design Spec Repository) 2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo) Doc-Impact Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: Idd155401982a2c48110c30b480966a863f6bd305	2017-02-22 10:56:13 -08:00
Jenkins	1f3dd83f41	Merge "Remove per-device reconstruction stats"	2017-02-20 21:55:27 +00:00
Tim Burke	8973ceb31a	Remove per-device reconstruction stats Now that we're shuffling parts before going through them, those stats no longer make sense -- device completion would always be 100%. Also, always use delete_partition for cleanup, so we have one place to make improvements. This means we'll properly clean up non-numeric directories. Also also, put more I/O in the tpool in delete_partition. Change-Id: Ie06bb16c130d46ccf887c8fcb252b8d018072d68 Related-Change: I69e4c4baee64fd2192cbf5836b0803db1cc71705	2017-02-20 16:22:45 +00:00
Jenkins	cdd72dd34f	Merge "Deprecate broken handoffs_first in favor of handoffs_only"	2017-02-15 03:54:49 +00:00
Kota Tsuyuzaki	600db4841e	Follow up on reconstructor handoffs_only This is a follow-up for https://review.openstack.org/#/c/425493 This patch includes: - Add more tests on the configuration with handoffs_first and handoffs_only - Remove unnecessary space in a warning log line. (2 places) - Change test conf from True/False to "True"/"False" (string) because in the conf dict, those value should be string. Co-Authored-By: Janie Richling <jrichli@us.ibm.com> Change-Id: Ida90c32d16481a15fa68c9fdb380932526c366f6	2017-02-14 18:21:58 -08:00
Clay Gerrard	da557011ec	Deprecate broken handoffs_first in favor of handoffs_only The handoffs_first mode in the replicator has the useful behavior of processing all handoff parts across all disks until there aren't any handoffs anymore on the node [1] and then it seemingly tries to drop back into normal operation. In practice I've only ever heard of handoffs_first used while rebalancing and turned off as soon as the rebalance finishes - it's not recommended to run with handoffs_first mode turned on and it emits a warning on startup if option is enabled. The handoffs_first mode on the reconstructor doesn't work - it was prioritizing handoffs per-part [2] - which is really unfortunate because in the reconstructor during a rebalance it's often much more attractive from an efficiency disk/network perspective to revert a partition from a handoff than it is to rebuild an entire partition from another primary using the other EC fragments in the cluster. This change deprecates handoffs_first in favor of handoffs_only in the reconstructor which is far more useful - and just like handoffs_first mode in the replicator - it gives the operator the option of forcing the consistency engine to focus on rebalance. The handoffs_only behavior is somewhat consistent with the replicator's handoffs_first option (any error on any handoff in the replicactor will make it essentially handoff only forever) but the option does what you want and is named correctly in the reconstructor. For consistency with the replicator the reconstructor will mostly honor the handoffs_first option, but if you set handoffs_only in the config it always takes precedence. Having handoffs_first in your config always results in a warning, but if handoff_only is not set and handoffs_first is true the reconstructor will assume you need handoffs_only and behaves as such. When running in handoffs_only mode the reconstructor will start to log a warning every cycle if you leave it running in handoffs_only after it finishes reverting handoffs. However you should be monitoring on-disk partitions and disable the option as soon as the cluster finishes the full rebalance cycle. 1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator handoffs_first "mode" 2. Unlike replication each partition in a EC policy can have a different kind of job per frag_index, but the cardinality of jobs is typically only one (either sync or revert) unless there's been a bunch of errors during write and then handoffs partitions maybe hold a number of different fragments. Known-Issues: handoffs_only is not documented outside of the example config, see lp bug #1626290 Closes-Bug: #1653018 Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898	2017-02-13 21:13:29 -08:00

1 2

84 Commits