swift

Author	SHA1	Message	Date
Zuul	b05b27c0b6	Merge "Add note about rsync_bwlimit suffixes"	2022-08-30 22:53:02 +00:00
Zuul	24acc6e56b	Merge "Add backend rate limiting middleware"	2022-08-30 07:18:57 +00:00
Tim Burke	a9177a4b9d	Add note about rsync_bwlimit suffixes Change-Id: I019451e118d3bd7263a52cf4bf354d0d0d2b4607	2022-08-26 08:54:06 -07:00
Zuul	73b2730f71	Merge "Add ring_ip option to object services"	2022-06-06 21:04:48 +00:00
Clay Gerrard	12bc79bf01	Add ring_ip option to object services This will be used when finding their own devices in rings, defaulting to the bind_ip. Notably, this allows services to be containerized while servers_per_port is enabled: * For the object-server, the ring_ip should be set to the host ip and will be used to discover which ports need binding. Sockets will still be bound to the bind_ip (likely 0.0.0.0), with the assumption that the host will publish ports 1:1. * For the replicator and reconstructor, the ring_ip will be used to discover which devices should be replicated. While bind_ip could previously be used for this, it would have required a separate config from the object-server. Also rename object deamon's bind_ip attribute to ring_ip so that it's more obvious wherever we're using the IP for ring lookups instead of socket binding. Co-Authored-By: Tim Burke <tim.burke@gmail.com> Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218	2022-06-02 16:31:29 -05:00
Zuul	d1f2e82556	Merge "replicator: Log rsync file transfers less"	2022-05-27 18:32:46 +00:00
Alistair Coles	ccaf49a00c	Add backend rate limiting middleware This is a fairly blunt tool: ratelimiting is per device and applied independently in each worker, but this at least provides some limit to disk IO on backend servers. GET, HEAD, PUT, POST, DELETE, UPDATE and REPLICATE methods may be rate-limited. Only requests with a path starting '<device>/<partition>', where <partition> can be cast to an integer, will be rate-limited. Other requests, including, for example, recon requests with paths such as 'recon/version', are unconditionally forwarded to the next app in the pipeline. OPTIONS and SSYNC methods are not rate-limited. Note that SSYNC sub-requests are passed directly to the object server app and will not pass though this middleware. Change-Id: I78b59a081698a6bff0d74cbac7525e28f7b5d7c1	2022-05-20 14:40:00 +01:00
Tim Burke	7e69176817	replicator: Log rsync file transfers less - Drop log level for successful rsyncs to debug; ops don't usually care. - Add an option to skip "send" lines entirely -- in a large cluster, during a meaningful expansion, there's too much information getting logged; it's just wasting disk space. Note that we already have similar filtering for directory creation; that's been present since the initial commit of Swift code. Drive-by: make it a little more clear that more than one suffix was likely replicated when logging about success. Change-Id: I02ba67e77e3378b2c2c8c682d5d230d31cd1bfa9	2022-04-28 12:35:00 -07:00
Tim Burke	043e0163ed	Clarify that rsync_io_timeout is also used for contimeout Change-Id: I5e4a270add2a625e6d5cb0ae9468313ddc88a81b	2022-04-28 10:07:50 -07:00
Alistair Coles	51da2543ca	object-updater: defer ratelimited updates Previously, objects updates that could not be sent immediately due to per-container/bucket ratelimiting [1] would be skipped and re-tried during the next updater cycle. There could potentially be a period of time at the end of a cycle when the updater slept, having completed a sweep of the on-disk async pending files, despite having skipped updates during the cycle. Skipped updates would then be read from disk again during the next cycle. With this change the updater will defer skipped updates to an in-memory queue (up to a configurable maximum number) until the sweep of async pending files has completed, and then trickle out deferred updates until the cycle's interval expires. This increases the useful work done in the current cycle and reduces the amount of repeated disk IO during the next cycle. The deferrals queue is bounded in size and will evict least recently read updates in order to accept more recently read updates. This reduces the probablility that a deferred update has been made obsolete by newer on-disk async pending files while waiting in the deferrals queue. The deferrals queue is implemented as a collection of per-bucket queues so that updates can be drained from the queues in the order that buckets cease to be ratelimited. [1] Related-Change: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Change-Id: I95e58df9f15c5f9d552b8f4c4989a474f52262f4	2022-02-21 10:56:23 +00:00
Clay Gerrard	de88862981	Finer grained ratelimit for update Throw our stream of async_pendings through a hash ring; if the virtual bucket gets hot just start leaving the updates on the floor and move on. It's off by default; and if you use it you're probably going to leave a bunch of async updates pointed at a small set of containers in the queue for the next sweep every sweep (so maybe turn it off at some point) Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed	2022-01-06 12:47:09 -08:00
Alistair Coles	8ee631ccee	reconstructor: restrict max objects per revert job Previously the ssync Sender would attempt to revert all objects in a partition within a single SSYNC request. With this change the reconstructor daemon option max_objects_per_revert can be used to limit the number of objects reverted inside a single SSYNC request for revert type jobs i.e. when reverting handoff partitions. If more than max_objects_per_revert are available, the remaining objects will remain in the sender partition and will not be reverted until the next call to ssync.Sender, which would currrently be the next time the reconstructor visits that handoff partition. Note that the option only applies to handoff revert jobs, not to sync jobs. Change-Id: If81760c80a4692212e3774e73af5ce37c02e8aff	2021-12-03 12:43:23 +00:00
Alistair Coles	2696a79f09	reconstructor: retire nondurable_purge_delay option The nondurable_purge_delay option was introduced in [1] to prevent the reconstructor removing non-durable data files on handoffs that were about to be made durable. The DiskFileManager commit_window option has since been introduced [2] which specifies a similar time window during which non-durable data files should not be removed. The commit_window option can be re-used by the reconstructor, making the nondurable_purge_delay option redundant. The nondurable_purge_delay option has not been available in any tagged release and is therefore removed with no backwards compatibility. [1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0 [2] Related-Change: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13 Change-Id: I1589a7517b7375fcc21472e2d514f26986bf5079	2021-07-19 21:18:06 +01:00
Alistair Coles	bbaed18e9b	diskfile: don't remove recently written non-durables DiskFileManager will remove any stale files during cleanup_ondisk_files(): these include tombstones and nondurable EC data fragments whose timestamps are older than reclaim_age. It can usually be safely assumed that a non-durable data fragment older than reclaim_age is not going to become durable. However, if an agent PUTs objects with specified older X-Timestamps (for example the reconciler or container-sync) then there is a window of time during which the object server has written an old non-durable data file but has not yet committed it to make it durable. Previously, if another process (for example the reconstructor) called cleanup_ondisk_files during this window then the non-durable data file would be removed. The subsequent attempt to commit the data file would then result in a traceback due to there no longer being a data file to rename, and of course the data file is lost. This patch modifies cleanup_ondisk_files to not remove old, otherwise stale, non-durable data files that were only written to disk in the preceding 'commit_window' seconds. 'commit_window' is configurable for the object server and defaults to 60.0 seconds. Closes-Bug: #1936508 Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0 Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13	2021-07-19 21:18:02 +01:00
Alistair Coles	2fd5b87dc5	reconstructor: make quarantine delay configurable Previously the reconstructor would quarantine isolated durable fragments that were more than reclaim_age old. This patch adds a quarantine_age option for the reconstructor which defaults to reclaim_age but can be used to configure the age that a fragment must reach before quarantining. Change-Id: I867f3ea0cf60620c576da0c1f2c65cec2cf19aa0	2021-07-06 16:41:08 +01:00
Zuul	653daf73ed	Merge "relinker: tolerate existing tombstone with same timestamp"	2021-07-02 22:59:34 +00:00
Zuul	2efd4316a6	Merge "Make dark data watcher ignore the newly updated objects"	2021-07-02 20:52:55 +00:00
Alistair Coles	574897ae27	relinker: tolerate existing tombstone with same timestamp It is possible for the current and next part power locations to both have existing tombstones with different inodes when the relinker tries to relink. This can be caused, for example, by concurrent reconciler DELETEs that specify the same timestamp. The relinker previously failed to relink and reported an error when encountering this situation. With this patch the relinker will tolerate an existing tombstone with the same filename but different inode in the next part power location. Since [1] the relinker had special case handling for EEXIST errors caused by a different inode tombstone already existing in the next partition power location: the relinker would check to see if the existing next part power tombstone linked to a tombstone in a previous part power (i.e. < current part power) location, and if so tolerate the EEXIST. This special case handling is no longer necessary because the relinker will now tolerate an EEXIST when linking a tombstone provided the two files have the same timestamp. There is therefore no need to search previous part power locations for a tombstone that does link with the next part power location. The link_check_limit is no longer used but the --link-check-limit command line option is still allowed (although ignored) for backwards compatibility. [1] Related-Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c Change-Id: I07ffee3b4ba6c7ff6c206beaf6b8f746fe365c2b Closes-Bug: #1934142	2021-07-02 12:14:47 +01:00
Pete Zaitcev	95e0316451	Make dark data watcher ignore the newly updated objects When objects are freshly uploaded, they may take a little time to appear in container listings, producing false positives. Because we needed to test this, we also reworked/added the tests and fixed some issues, including adding an EC fragment (thanks to Alistair's code). Closes-Bug: 1925782 Change-Id: Ieafa72a496328f7a487ca7062da6253994a5a07d Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>	2021-06-30 16:38:57 -05:00
Alistair Coles	2934818d60	reconstructor: Delay purging reverted non-durable datafiles The reconstructor may revert a non-durable datafile on a handoff concurrently with an object server PUT that is about to make the datafile durable. This could previously lead to the reconstructor deleting the recently written datafile before the object-server attempts to rename it to a durable datafile, and consequently a traceback in the object server. The reconstructor will now only remove reverted nondurable datafiles that are older (according to mtime) than a period set by a new nondurable_purge_delay option (defaults to 60 seconds). More recent nondurable datafiles may be made durable or will remain on the handoff until a subsequent reconstructor cycle. Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0	2021-06-24 09:33:06 +01:00
Zuul	b3def185c6	Merge "Allow floats for all intervals"	2021-05-25 18:15:56 +00:00
Zuul	5ec3826246	Merge "Quarantine stale EC fragments after checking handoffs"	2021-05-11 22:44:16 +00:00
Alistair Coles	46ea3aeae8	Quarantine stale EC fragments after checking handoffs If the reconstructor finds a fragment that appears to be stale then it will now quarantine the fragment. Fragments are considered stale if insufficient fragments at the same timestamp can be found to rebuild missing fragments, and the number found is less than or equal to a new reconstructor 'quarantine_threshold' config option. Before quarantining a fragment the reconstructor will attempt to fetch fragments from handoff nodes in addition to the usual primary nodes. The handoff requests are limited by a new 'request_node_count' config option. 'quarantine_threshold' defaults to zero i.e. no fragments will be quarantined. 'request node count' defaults to '2 * replicas'. Closes-Bug: 1655608 Change-Id: I08e1200291833dea3deba32cdb364baa99dc2816	2021-05-10 20:45:17 +01:00
Matthew Oliver	4ce907a4ae	relinker: Add /recon/relinker endpoint and drop progress stats To further benefit the stats capturing for the relinker, drop partition progress to a new relinker.recon recon cache and add a new recon endpoint: GET /recon/relinker To gather get live relinking progress data: $ curl http://127.0.0.3:6030/recon/relinker \|python -mjson.tool { "devices": { "sdb3": { "parts_done": 523, "policies": { "1": { "next_part_power": 11, "start_time": 1618998724.845616, "stats": { "errors": 0, "files": 1630, "hash_dirs": 1630, "linked": 1630, "policies": 1, "removed": 0 }, "timestamp": 1618998730.24672, "total_parts": 1029, "total_time": 5.400741815567017 }}, "start_time": 1618998724.845946, "stats": { "errors": 0, "files": 836, "hash_dirs": 836, "linked": 836, "removed": 0 }, "timestamp": 1618998730.24672, "total_parts": 523, "total_time": 5.400741815567017 }, "sdb7": { "parts_done": 506, "policies": { "1": { "next_part_power": 11, "part_power": 10, "parts_done": 506, "start_time": 1618998724.845616, "stats": { "errors": 0, "files": 794, "hash_dirs": 794, "linked": 794, "removed": 0 }, "step": "relink", "timestamp": 1618998730.166175, "total_parts": 506, "total_time": 5.320528984069824 } }, "start_time": 1618998724.845616, "stats": { "errors": 0, "files": 794, "hash_dirs": 794, "linked": 794, "removed": 0 }, "timestamp": 1618998730.166175, "total_parts": 506, "total_time": 5.320528984069824 } }, "workers": { "100": { "drives": ["sda1"], "return_code": 0, "timestamp": 1618998730.166175} }} Also, add a constant DEFAULT_RECON_CACHE_PATH to help fix failing tests by mocking recon_cache_path, so that errors are not logged due to dump_recon_cache exceptions. Mock recon_cache_path more widely and assert no error logs more widely. Change-Id: I625147dadd44f008a7c48eb5d6ac1c54c4c0ef05	2021-05-10 16:13:32 +01:00
Tim Burke	c374a7a851	Allow floats for all intervals Change-Id: I91e9bc02d94fe7ea6e89307305705c383087845a	2021-05-05 15:30:21 -07:00
Tim Burke	abfa6bee72	relinker: Parallelize per disk Add a new option, workers, that works more or less like the same option from background daemons. Disks will be distributed across N worker sub-processes so we can make the best use of the I/O available. While we're at it, log final stats at warning if there were errors. Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I039d2b8861f69a64bd9d2cdf68f1f534c236b2ba	2021-04-05 12:15:56 -07:00
Alistair Coles	3bdd01cf4a	relinker: retry links from older part powers If a previous partition power increase failed to cleanup all files in their old partition locations, then during the next partition power increase the relinker may find the same file to relink in more than one source partition. This currently leads to an error log due to the second relink attempt getting an EEXIST error. With this patch, when an EEXIST is raised, the relinker will attempt to create/verify a link from older partition power locations to the next part power location, and if such a link is found then suppress the error log. During the relink step, if an alternative link is verified and if a file is found that is neither linked to the next partition power location nor in the current part power location, then the file is removed during the relink step. That prevents the same EEXIST occuring again during the cleanup step when it may no longer be possible to verify that an alternative link exists. For example, consider identical filenames in the N+1th, Nth and N-1th partition power locations, with the N+1th being linked to the Nth: - During relink, the Nth location is visited and its link is verified. Then the N-1th location is visited and an EEXIST error is encountered, but the new check verifies that a link exists to the Nth location, which is OK. - During cleanup the locations are visited in the same order, but files are removed so that the Nth location file no longer exists when the N-1th location is visited. If the N-1th location still has a conflicting file then existence of an alternative link to the Nth location can no longer be verified, so an error would be raised. Therefore, the N-1th location file must be removed during relink. The error is only suppressed for tombstones. The number of partition power location that the relinker will look back over may be configured using the link_check_limit option in a conf file or --link-check-limit on the command line, and defaults to 2. Closes-Bug: 1921718 Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c	2021-04-01 18:56:57 +01:00
Tim Burke	53c0fc3403	relinker: Add option to ratelimit relinking Sure, you could use stuff like ionice or cgroups to limit relinker I/O, but sometimes a nice simple blunt instrument is handy. Change-Id: I7fe29c7913a9e09bdf7a787ccad8bba2c77cf995	2021-02-11 11:31:39 -08:00
Tim Burke	1b7dd34d38	relinker: Allow conf files for configuration Swap out the standard logger stuff in place of --logfile. Keep --device as a CLI-only option. Everything else is pretty standard stuff that ought to be in [DEFAULT]. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I32f979f068592eaac39dcc6807b3114caeaaa814	2021-02-08 14:39:27 -08:00
Samuel Merritt	b971280907	Let developers/operators add watchers to object audit Swift operators may find it useful to operate on each object in their cluster in some way. This commit provides them a way to hook into the object auditor with a simple, clearly-defined boundary so that they can iterate over their objects without additional disk IO. For example, a cluster operator may want to ensure a semantic consistency with all SLO segments accounted in their manifests, or locate objects that aren't in container listings. Now that Swift has encryption support, this could be used to locate unencrypted objects. The list goes on. This commit makes the auditor locate, via entry points, the watchers named in its config file. A watcher is a class with at least these four methods: __init__(self, conf, logger, kwargs) start(self, audit_type, kwargs) see_object(self, object_metadata, data_file_path, kwargs) end(self, kwargs) The auditor will call watcher.start(audit_type) at the start of an audit pass, watcher.see_object(...) for each object audited, and watcher.end() at the end of an audit pass. All method arguments are passed as keyword args. This version of the API is implemented on the context of the auditor itself, without spawning any additional processes. If the plugins are not working well -- hang, crash, or leak -- it's easier to debug them when there's no additional complication of processes that run by themselves. In addition, we include a reference implementation of plugin for the watcher API, as a help to plugin writers. Change-Id: I1be1faec53b2cdfaabf927598f1460e23c206b0a	2020-12-26 17:16:14 -06:00
Tim Burke	918ab8543e	Use socket_timeout kwarg instead of useless eventlet.wsgi.WRITE_TIMEOUT No version of eventlet that I'm aware of hasany sort of support for eventlet.wsgi.WRITE_TIMEOUT; I don't know why we've been setting that. On the other hand, the socket_timeout argument for eventlet.wsgi.Server has been supported for a while -- since 0.14 in 2013. Drive-by: Fix up handling of sub-second client_timeouts. Change-Id: I1dca3c3a51a83c9d5212ee5a0ad2ba1343c68cf9 Related-Change: I1d4d028ac5e864084a9b7537b140229cb235c7a3 Related-Change: I433c97df99193ec31c863038b9b6fd20bb3705b8	2020-11-11 14:23:40 -08:00
Zuul	b9a404b4d1	Merge "ec: Add an option to write fragments with legacy crc"	2020-11-02 23:03:49 +00:00
Clay Gerrard	b05ad82959	Add tasks_per_second option to expirer This allows operators to throttle expirers as needed. Partial-Bug: #1784753 Change-Id: If75dabb431bddd4ad6100e41395bb6c31a4ce569	2020-10-23 10:24:52 -05:00
Tim Burke	599f63e762	ec: Add an option to write fragments with legacy crc When upgrading from liberasurecode<=1.5.0, you may want to continue writing legacy CRCs until all nodes are upgraded and capabale of reading fragments with zlib CRCs. Starting in liberasurecode>=1.6.2, we can use the environment variable LIBERASURECODE_WRITE_LEGACY_CRC to control whether we write zlib or legacy CRCs, but for many operators it's easier to manage swift configs than environment variables. Add a new option, write_legacy_ec_crc, to the proxy-server app and object-reconstructor; if set to true, ensure legacy frags are written. Note that more daemons instantiate proxy-server apps than just the proxy-server. The complete set of impacted daemons should be: * proxy-server * object-reconstructor * container-reconciler * any users of internal-client.conf UpgradeImpact ============= To ensure a smooth liberasurecode upgrade: 1. Determine whether your cluster writes legacy or zlib CRCs. Depending on the order in which shared libraries are loaded, your servers may already be reading and writing zlib CRCs, even with old liberasurecode. In that case, no special action is required and WRITING LEGACY CRCS DURING THE UPGRADE WILL CAUSE AN OUTAGE. Just upgrade liberasurecode normally. See the closed bug for more information and a script to determine which CRC is used. 2. On all nodes, ensure Swift is upgraded to a version that includes write_legacy_ec_crc support and write_legacy_ec_crc is enabled on all daemons. 3. On each node, upgrade liberasurecode and restart Swift services. Because of (2), they will continue writing legacy CRCs which will still be readable by nodes that have not yet upgraded. 4. Once all nodes are upgraded, remove the write_legacy_ec_crc option from all configs across all nodes. After restarting daemons, they will write zlib CRCs which will also be readable by all nodes. Change-Id: Iff71069f808623453c0ff36b798559015e604c7d Related-Bug: #1666320 Closes-Bug: #1886088 Depends-On: https://review.opendev.org/#/c/738959/	2020-09-30 16:49:59 -07:00
Tim Burke	9eb81f6e69	Allow replication servers to handle all request methods Previously, the replication_server setting could take one of three states: * If unspecified, the server would handle all available methods. * If "true", "yes", "on", etc. it would only handle replication methods (REPLICATE, SSYNC). * If any other value (including blank), it would only handle non-replication methods. However, because SSYNC tunnels PUTs, POSTs, and DELETEs through the same object-server app that's responding to SSYNC, setting `replication_server = true` would break the protocol. This has been the case ever since ssync was introduced. Now, get rid of that second state -- operators can still set `replication_server = false` as a principle-of-least-privilege guard to ensure proxy-servers can't make replication requests, but replication servers will be able to serve all traffic. This will allow replication servers to be used as general internal-to-the-cluster endpoints, leaving non-replication servers to handle client-driven traffic. Closes-Bug: #1446873 Change-Id: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87	2020-07-23 09:11:07 -07:00
Clay Gerrard	4601548dab	Deprecate per-service auto_create_account_prefix If we move it to constraints it's more globally accessible in our code, but more importantly it's more obvious to ops that everything breaks if you try to mis-configure different values per-service. Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee	2020-01-05 09:53:30 -06:00
Tim Burke	39a54fecdc	py3: add swift-dsvm-functional-py3 job Note that keystone wants to stick some UTF-8 encoded bytes into memcached, but we want to store it as JSON... or something? Also, make sure we can hit memcache for containers with invalid UTF-8. Although maybe it'd be better to catch that before we ever try memcache? Change-Id: I1fbe133c8ec73ef6644ecfcbb1931ddef94e0400	2019-06-21 22:31:18 -07:00
Clay Gerrard	34bd4f7fa3	Clarify usage of dequeue_from_legacy option Change-Id: Iae9aa7a91b9afc19cb8613b5bc31de463b853dde	2019-05-05 03:20:34 +00:00
Kazuhiro MIYAHARA	443f029a58	Enable to configure object-expirer in object-server.conf To prepare for object-expirer's general task queue feature [1], this patch enables to configure object-expirer in object-server.conf. Object-expirer.conf can be used in the same manner as before, but deprecated. If both of object-server.conf with "object-expirer" section and object-expirer.conf are in a node, only object-server.conf is used. Object-expirer.conf is used only if all object-server.conf doesn't have "object-expirer" section. There are two differences between "object-expirer.conf" style and "object-server.conf" style. The first difference is `dequeue_from_legacy` default value. `dequeue_from_legacy` defines task queue mode. In "object-expirer.conf" style, the default mode is legacy queue. In "object-server.conf" style, the default mode is general queue. But general mode means no-op mode for now, because general task queue is not implemented yet. The second difference is internal client config. In "object-expirer.conf" style, config file of internal client is the object-expirer.conf itself. In "object-server.conf" style, config file of internal client is another file. [1]: https://review.openstack.org/#/c/517389/ Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Change-Id: Ib21568f9b9d8547da87a99d65ae73a550e9c3230	2019-05-04 15:45:02 +00:00
Gilles Biannic	a4cc353375	Make log format for requests configurable Add the log_msg_template option in proxy-server.conf and log_format in a/c/o-server.conf. It is a string parsable by Python's format() function. Some fields containing user data might be anonymized by using log_anonymization_method and log_anonymization_salt. Change-Id: I29e30ef45fe3f8a026e7897127ffae08a6a80cd9	2019-05-02 17:43:25 -06:00
Clay Gerrard	ea8e545a27	Rebuild frags for unmounted disks Change the behavior of the EC reconstructor to perform a fragment rebuild to a handoff node when a primary peer responds with 507 to the REPLICATE request. Each primary node in a EC ring will sync with exactly three primary peers, in addition to the left & right nodes we now select a third node from the far side of the ring. If any of these partners respond unmounted the reconstructor will rebuild it's fragments to a handoff node with the appropriate index. To prevent ssync (which is uninterruptible) receiving a 409 (Conflict) we must give the remote handoff node the correct backend_index for the fragments it will recieve. In the common case we will use determistically different handoffs for each fragment index to prevent multiple unmounted primary disks from forcing a single handoff node to hold more than one rebuilt fragment. Handoff nodes will continue to attempt to revert rebuilt handoff fragments to the appropriate primary until it is remounted or rebalanced. After a rebalance of EC rings (potentially removing unmounted/failed devices), it's most IO efficient to run in handoffs_only mode to avoid unnecessary rebuilds. Closes-Bug: #1510342 Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec	2019-02-08 18:04:55 +00:00
FatemaKhalid	cfeb32c66b	Adding keep_idle config value to socket User can cofigure KEEPIDLE time for sockets in TCP connection. The default value is the old value which is 600. Change-Id: Ib7fb166deb8a87ae4e97ba0671048b1ec079a2ef Closes-Bug:1759606	2018-09-15 01:30:53 +02:00
Samuel Merritt	d5c532a94e	object-updater: add concurrent updates The object updater now supports two configuration settings: "concurrency" and "updater_workers". The latter controls how many worker processes are spawned, while the former controls how many concurrent container updates are performed by each worker process. This should speed the processing of async_pendings. There is a change to the semantics of the configuration options. Previously, "concurrency" controlled the number of worker processes spawned, and "updater_workers" did not exist. I switched the meanings for consistency with other configuration options. In the object reconstructor, object replicator, object server, object expirer, container replicator, container server, account replicator, account server, and account reaper, "concurrency" refers to the number of concurrent tasks performed within one process (for reference, the container updater and object auditor use "concurrency" to mean number of processes). On upgrade, a node configured with concurrency=N will still handle async updates N-at-a-time, but will do so using only one process instead of N. UpgradeImpact: If you have a config file like this: [object-updater] concurrency = <N> and you want to take advantage of faster updates, then do this: [object-updater] concurrency = 8 # the default; you can omit this line updater_workers = <N> If you want updates to be processed exactly as before, do this: [object-updater] concurrency = 1 updater_workers = <N> Change-Id: I17e18088e61f664e1b9942d66423666d0cae1689	2018-06-13 17:39:34 -07:00
Thiago da Silva	36dbd38e48	Add s3api headers to allowed_headers by default Previously, these headers had to be added by operators to their object-server.conf when enabling swift3 middleware. Since s3api is now imported into swift we should go ahead and add these headers by default too. Change-Id: Ib82e175096716e42aecdab48f01f079e09da6a1d Signed-off-by: Thiago da Silva <thiago@redhat.com>	2018-05-29 16:02:50 -04:00
Zuul	3313392462	Merge "Import swift3 into swift repo as s3api middleware"	2018-04-30 16:00:56 +00:00
Kota Tsuyuzaki	636b922f3b	Import swift3 into swift repo as s3api middleware This attempts to import openstack/swift3 package into swift upstream repository, namespace. This is almost simple porting except following items. 1. Rename swift3 namespace to swift.common.middleware.s3api 1.1 Rename also some conflicted class names (e.g. Request/Response) 2. Port unittests to test/unit/s3api dir to be able to run on the gate. 3. Port functests to test/functional/s3api and setup in-process testing 4. Port docs to doc dir, then address the namespace change. 5. Use get_logger() instead of global logger instance 6. Avoid global conf instance Ex. fix various minor issue on those steps (e.g. packages, dependencies, deprecated things) The details and patch references in the work on feature/s3api are listed at https://trello.com/b/ZloaZ23t/s3api (completed board) Note that, because this is just a porting, no new feature is developed since the last swift3 release, and in the future work, Swift upstream may continue to work on remaining items for further improvements and the best compatibility of Amazon S3. Please read the new docs for your deployment and keep track to know what would be changed in the future releases. Change-Id: Ib803ea89cfee9a53c429606149159dd136c036fd Co-Authored-By: Thiago da Silva <thiago@redhat.com> Co-Authored-By: Tim Burke <tim.burke@gmail.com>	2018-04-27 15:53:57 +09:00
Samuel Merritt	c28004deb0	Multiprocess object replicator Add a multiprocess mode to the object replicator. Setting the "replicator_workers" setting to a positive value N will result in the replicator using up to N worker processes to perform replication tasks. At most one worker per disk will be spawned, so one can set replicator_workers=99999999 to always get one worker per disk regardless of the number of disks in each node. This is the same behavior that the object reconstructor has. Worker process logs will have a bit of information prepended so operators can tell which messages came from which worker. It looks like this: [worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining) The prefix is "[worker M/N pid=P] ", where M is the worker's index, N is the total number of workers, and P is the process ID. Every message from the replicator's logger will have the prefix; this includes messages from down in diskfile, but does not include things printed to stdout or stderr. Drive-by fix: don't dump recon stats when replicating only certain policies. When running the object replicator with replicator_workers > 0 and "--policies=X,Y,Z", the replicator would update recon stats after running. Since it only ran on a subset of objects, it should not update recon, much like it doesn't update recon when run with --devices or --partitions. Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5	2018-04-24 04:05:08 +00:00
Samuel Merritt	f64c00b00a	Improve object-updater's stats logging The object updater has five different stats, but its logging only told you two of them (successes and failures), and it only told you after finishing all the async_pendings for a device. If you have a cluster that's been sick and has millions upon millions of async_pendings laying around, then your object-updaters are frustratingly silent. I've seen one cluster with around 8 million async_pendings per disk where the object-updaters only emitted stats every 12 hours. Yes, if you have StatsD logging set up properly, you can go look at your graphs and get real-time feedback on what it's doing. If you don't have that, all you get is a frustrating silence. Now, the object updater tells you all of its stats (successes, failures, quarantines due to bad pickles, unlinks, and errors), and it tells you incremental progress every five minutes. The logging at the end of a pass remains and has been expanded to also include all stats. Also included is a small change to what counts as an error: unmounted drives no longer do. The goal is that only abnormal things count as errors, like permission problems, malformed filenames, and so on. These are things that should never happen, but if they do, may require operator intervention. Drives fail, so logging an error upon encountering an unmounted drive is not useful. Change-Id: Idbddd507f0b633d14dffb7a9834fce93a10359ab	2018-01-17 13:59:23 -08:00
Romain LE DISEZ	e199192cae	Replace replication_one_per_device by custom count This commit replaces boolean replication_one_per_device by an integer replication_concurrency_per_device. The new configuration parameter is passed to utils.lock_path() which now accept as an argument a limit for the number of locks that can be acquired for a specific path. Instead of trying to lock path/.lock, utils.lock_path() now tries to lock files path/.lock-X, where X is in the range (0, N), N being the limit for the number of locks allowed for the path. The default value of limit is set to 1. Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572	2017-10-24 16:17:41 +01:00
shangxiaobj	c93c0c0c6e	[Trivialfix]Fix typos in swift Fix typos that found in swift. Change-Id: I52fad1a4882cec4456f22174b46d54e42ec66d97	2017-08-04 07:50:10 +00:00

1 2 3

143 Commits