This will be used when finding their own devices in rings, defaulting to
the bind_ip.
Notably, this allows services to be containerized while servers_per_port
is enabled:
* For the object-server, the ring_ip should be set to the host ip and
will be used to discover which ports need binding. Sockets will still
be bound to the bind_ip (likely 0.0.0.0), with the assumption that the
host will publish ports 1:1.
* For the replicator and reconstructor, the ring_ip will be used to
discover which devices should be replicated. While bind_ip could
previously be used for this, it would have required a separate config
from the object-server.
Also rename object deamon's bind_ip attribute to ring_ip so that it's
more obvious wherever we're using the IP for ring lookups instead of
socket binding.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218
This is a fairly blunt tool: ratelimiting is per device and
applied independently in each worker, but this at least provides
some limit to disk IO on backend servers.
GET, HEAD, PUT, POST, DELETE, UPDATE and REPLICATE methods may be
rate-limited.
Only requests with a path starting '<device>/<partition>', where
<partition> can be cast to an integer, will be rate-limited. Other
requests, including, for example, recon requests with paths such as
'recon/version', are unconditionally forwarded to the next app in the
pipeline.
OPTIONS and SSYNC methods are not rate-limited. Note that
SSYNC sub-requests are passed directly to the object server app
and will not pass though this middleware.
Change-Id: I78b59a081698a6bff0d74cbac7525e28f7b5d7c1
- Drop log level for successful rsyncs to debug; ops don't usually care.
- Add an option to skip "send" lines entirely -- in a large cluster,
during a meaningful expansion, there's too much information getting
logged; it's just wasting disk space.
Note that we already have similar filtering for directory creation;
that's been present since the initial commit of Swift code.
Drive-by: make it a little more clear that more than one suffix was
likely replicated when logging about success.
Change-Id: I02ba67e77e3378b2c2c8c682d5d230d31cd1bfa9
Previously, objects updates that could not be sent immediately due to
per-container/bucket ratelimiting [1] would be skipped and re-tried
during the next updater cycle. There could potentially be a period of
time at the end of a cycle when the updater slept, having completed a
sweep of the on-disk async pending files, despite having skipped
updates during the cycle. Skipped updates would then be read from disk
again during the next cycle.
With this change the updater will defer skipped updates to an
in-memory queue (up to a configurable maximum number) until the sweep
of async pending files has completed, and then trickle out deferred
updates until the cycle's interval expires. This increases the useful
work done in the current cycle and reduces the amount of repeated disk
IO during the next cycle.
The deferrals queue is bounded in size and will evict least recently
read updates in order to accept more recently read updates. This
reduces the probablility that a deferred update has been made obsolete
by newer on-disk async pending files while waiting in the deferrals
queue.
The deferrals queue is implemented as a collection of per-bucket
queues so that updates can be drained from the queues in the order
that buckets cease to be ratelimited.
[1] Related-Change: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I95e58df9f15c5f9d552b8f4c4989a474f52262f4
Throw our stream of async_pendings through a hash ring; if the virtual
bucket gets hot just start leaving the updates on the floor and move on.
It's off by default; and if you use it you're probably going to leave a
bunch of async updates pointed at a small set of containers in the queue
for the next sweep every sweep (so maybe turn it off at some point)
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
Previously the ssync Sender would attempt to revert all objects in a
partition within a single SSYNC request. With this change the
reconstructor daemon option max_objects_per_revert can be used to limit
the number of objects reverted inside a single SSYNC request for revert
type jobs i.e. when reverting handoff partitions.
If more than max_objects_per_revert are available, the remaining objects
will remain in the sender partition and will not be reverted until the
next call to ssync.Sender, which would currrently be the next time the
reconstructor visits that handoff partition.
Note that the option only applies to handoff revert jobs, not to sync
jobs.
Change-Id: If81760c80a4692212e3774e73af5ce37c02e8aff
The nondurable_purge_delay option was introduced in [1] to prevent the
reconstructor removing non-durable data files on handoffs that were
about to be made durable. The DiskFileManager commit_window option has
since been introduced [2] which specifies a similar time window during
which non-durable data files should not be removed. The commit_window
option can be re-used by the reconstructor, making the
nondurable_purge_delay option redundant.
The nondurable_purge_delay option has not been available in any tagged
release and is therefore removed with no backwards compatibility.
[1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
[2] Related-Change: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
Change-Id: I1589a7517b7375fcc21472e2d514f26986bf5079
DiskFileManager will remove any stale files during
cleanup_ondisk_files(): these include tombstones and nondurable EC
data fragments whose timestamps are older than reclaim_age. It can
usually be safely assumed that a non-durable data fragment older than
reclaim_age is not going to become durable. However, if an agent PUTs
objects with specified older X-Timestamps (for example the reconciler
or container-sync) then there is a window of time during which the
object server has written an old non-durable data file but has not yet
committed it to make it durable.
Previously, if another process (for example the reconstructor) called
cleanup_ondisk_files during this window then the non-durable data file
would be removed. The subsequent attempt to commit the data file would
then result in a traceback due to there no longer being a data file to
rename, and of course the data file is lost.
This patch modifies cleanup_ondisk_files to not remove old, otherwise
stale, non-durable data files that were only written to disk in the
preceding 'commit_window' seconds. 'commit_window' is configurable for
the object server and defaults to 60.0 seconds.
Closes-Bug: #1936508
Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
Previously the reconstructor would quarantine isolated durable
fragments that were more than reclaim_age old. This patch adds a
quarantine_age option for the reconstructor which defaults to
reclaim_age but can be used to configure the age that a fragment must
reach before quarantining.
Change-Id: I867f3ea0cf60620c576da0c1f2c65cec2cf19aa0
It is possible for the current and next part power locations to
both have existing tombstones with different inodes when the
relinker tries to relink. This can be caused, for example, by
concurrent reconciler DELETEs that specify the same timestamp.
The relinker previously failed to relink and reported an error when
encountering this situation. With this patch the relinker will
tolerate an existing tombstone with the same filename but different
inode in the next part power location.
Since [1] the relinker had special case handling for EEXIST errors
caused by a different inode tombstone already existing in the next
partition power location: the relinker would check to see if the
existing next part power tombstone linked to a tombstone in a previous
part power (i.e. < current part power) location, and if so tolerate
the EEXIST.
This special case handling is no longer necessary because the relinker
will now tolerate an EEXIST when linking a tombstone provided the two
files have the same timestamp. There is therefore no need to search
previous part power locations for a tombstone that does link with the
next part power location.
The link_check_limit is no longer used but the --link-check-limit
command line option is still allowed (although ignored) for backwards
compatibility.
[1] Related-Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
Change-Id: I07ffee3b4ba6c7ff6c206beaf6b8f746fe365c2b
Closes-Bug: #1934142
When objects are freshly uploaded, they may take a little time
to appear in container listings, producing false positives.
Because we needed to test this, we also reworked/added the tests
and fixed some issues, including adding an EC fragment (thanks
to Alistair's code).
Closes-Bug: 1925782
Change-Id: Ieafa72a496328f7a487ca7062da6253994a5a07d
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
The reconstructor may revert a non-durable datafile on a handoff
concurrently with an object server PUT that is about to make the
datafile durable. This could previously lead to the reconstructor
deleting the recently written datafile before the object-server
attempts to rename it to a durable datafile, and consequently a
traceback in the object server.
The reconstructor will now only remove reverted nondurable datafiles
that are older (according to mtime) than a period set by a new
nondurable_purge_delay option (defaults to 60 seconds). More recent
nondurable datafiles may be made durable or will remain on the handoff
until a subsequent reconstructor cycle.
Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
If the reconstructor finds a fragment that appears to be stale then it
will now quarantine the fragment. Fragments are considered stale if
insufficient fragments at the same timestamp can be found to rebuild
missing fragments, and the number found is less than or equal to a new
reconstructor 'quarantine_threshold' config option.
Before quarantining a fragment the reconstructor will attempt to fetch
fragments from handoff nodes in addition to the usual primary nodes.
The handoff requests are limited by a new 'request_node_count'
config option.
'quarantine_threshold' defaults to zero i.e. no fragments will be
quarantined. 'request node count' defaults to '2 * replicas'.
Closes-Bug: 1655608
Change-Id: I08e1200291833dea3deba32cdb364baa99dc2816
Add a new option, workers, that works more or less like the same option
from background daemons. Disks will be distributed across N worker
sub-processes so we can make the best use of the I/O available.
While we're at it, log final stats at warning if there were errors.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I039d2b8861f69a64bd9d2cdf68f1f534c236b2ba
If a previous partition power increase failed to cleanup all files in
their old partition locations, then during the next partition power
increase the relinker may find the same file to relink in more than
one source partition. This currently leads to an error log due to the
second relink attempt getting an EEXIST error.
With this patch, when an EEXIST is raised, the relinker will attempt
to create/verify a link from older partition power locations to the
next part power location, and if such a link is found then suppress
the error log.
During the relink step, if an alternative link is verified and if a
file is found that is neither linked to the next partition power
location nor in the current part power location, then the file is
removed during the relink step. That prevents the same EEXIST occuring
again during the cleanup step when it may no longer be possible to
verify that an alternative link exists.
For example, consider identical filenames in the N+1th, Nth and N-1th
partition power locations, with the N+1th being linked to the Nth:
- During relink, the Nth location is visited and its link is
verified. Then the N-1th location is visited and an EEXIST error
is encountered, but the new check verifies that a link exists to
the Nth location, which is OK.
- During cleanup the locations are visited in the same order, but
files are removed so that the Nth location file no longer exists
when the N-1th location is visited. If the N-1th location still
has a conflicting file then existence of an alternative link to
the Nth location can no longer be verified, so an error would be
raised. Therefore, the N-1th location file must be removed during
relink.
The error is only suppressed for tombstones. The number of partition
power location that the relinker will look back over may be configured
using the link_check_limit option in a conf file or --link-check-limit
on the command line, and defaults to 2.
Closes-Bug: 1921718
Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
Sure, you could use stuff like ionice or cgroups to limit relinker I/O,
but sometimes a nice simple blunt instrument is handy.
Change-Id: I7fe29c7913a9e09bdf7a787ccad8bba2c77cf995
Swap out the standard logger stuff in place of --logfile. Keep --device
as a CLI-only option. Everything else is pretty standard stuff that
ought to be in [DEFAULT].
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I32f979f068592eaac39dcc6807b3114caeaaa814
Swift operators may find it useful to operate on each object in their
cluster in some way. This commit provides them a way to hook into the
object auditor with a simple, clearly-defined boundary so that they
can iterate over their objects without additional disk IO.
For example, a cluster operator may want to ensure a semantic
consistency with all SLO segments accounted in their manifests,
or locate objects that aren't in container listings. Now that Swift
has encryption support, this could be used to locate unencrypted
objects. The list goes on.
This commit makes the auditor locate, via entry points, the watchers
named in its config file.
A watcher is a class with at least these four methods:
__init__(self, conf, logger, **kwargs)
start(self, audit_type, **kwargs)
see_object(self, object_metadata, data_file_path, **kwargs)
end(self, **kwargs)
The auditor will call watcher.start(audit_type) at the start of an
audit pass, watcher.see_object(...) for each object audited, and
watcher.end() at the end of an audit pass. All method arguments are
passed as keyword args.
This version of the API is implemented on the context of the
auditor itself, without spawning any additional processes.
If the plugins are not working well -- hang, crash, or leak --
it's easier to debug them when there's no additional complication
of processes that run by themselves.
In addition, we include a reference implementation of plugin for
the watcher API, as a help to plugin writers.
Change-Id: I1be1faec53b2cdfaabf927598f1460e23c206b0a
No version of eventlet that I'm aware of hasany sort of support for
eventlet.wsgi.WRITE_TIMEOUT; I don't know why we've been setting that.
On the other hand, the socket_timeout argument for eventlet.wsgi.Server
has been supported for a while -- since 0.14 in 2013.
Drive-by: Fix up handling of sub-second client_timeouts.
Change-Id: I1dca3c3a51a83c9d5212ee5a0ad2ba1343c68cf9
Related-Change: I1d4d028ac5e864084a9b7537b140229cb235c7a3
Related-Change: I433c97df99193ec31c863038b9b6fd20bb3705b8
When upgrading from liberasurecode<=1.5.0, you may want to continue
writing legacy CRCs until all nodes are upgraded and capabale of reading
fragments with zlib CRCs.
Starting in liberasurecode>=1.6.2, we can use the environment variable
LIBERASURECODE_WRITE_LEGACY_CRC to control whether we write zlib or
legacy CRCs, but for many operators it's easier to manage swift configs
than environment variables. Add a new option, write_legacy_ec_crc, to the
proxy-server app and object-reconstructor; if set to true, ensure legacy
frags are written.
Note that more daemons instantiate proxy-server apps than just the
proxy-server. The complete set of impacted daemons should be:
* proxy-server
* object-reconstructor
* container-reconciler
* any users of internal-client.conf
UpgradeImpact
=============
To ensure a smooth liberasurecode upgrade:
1. Determine whether your cluster writes legacy or zlib CRCs. Depending
on the order in which shared libraries are loaded, your servers may
already be reading and writing zlib CRCs, even with old
liberasurecode. In that case, no special action is required and
WRITING LEGACY CRCS DURING THE UPGRADE WILL CAUSE AN OUTAGE.
Just upgrade liberasurecode normally. See the closed bug for more
information and a script to determine which CRC is used.
2. On all nodes, ensure Swift is upgraded to a version that includes
write_legacy_ec_crc support and write_legacy_ec_crc is enabled on
all daemons.
3. On each node, upgrade liberasurecode and restart Swift services.
Because of (2), they will continue writing legacy CRCs which will
still be readable by nodes that have not yet upgraded.
4. Once all nodes are upgraded, remove the write_legacy_ec_crc option
from all configs across all nodes. After restarting daemons, they
will write zlib CRCs which will also be readable by all nodes.
Change-Id: Iff71069f808623453c0ff36b798559015e604c7d
Related-Bug: #1666320
Closes-Bug: #1886088
Depends-On: https://review.opendev.org/#/c/738959/
Previously, the replication_server setting could take one of three
states:
* If unspecified, the server would handle all available methods.
* If "true", "yes", "on", etc. it would only handle replication
methods (REPLICATE, SSYNC).
* If any other value (including blank), it would only handle
non-replication methods.
However, because SSYNC tunnels PUTs, POSTs, and DELETEs through
the same object-server app that's responding to SSYNC, setting
`replication_server = true` would break the protocol. This has
been the case ever since ssync was introduced.
Now, get rid of that second state -- operators can still set
`replication_server = false` as a principle-of-least-privilege guard
to ensure proxy-servers can't make replication requests, but replication
servers will be able to serve all traffic. This will allow replication
servers to be used as general internal-to-the-cluster endpoints, leaving
non-replication servers to handle client-driven traffic.
Closes-Bug: #1446873
Change-Id: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87
If we move it to constraints it's more globally accessible in our code,
but more importantly it's more obvious to ops that everything breaks if
you try to mis-configure different values per-service.
Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee
Note that keystone wants to stick some UTF-8 encoded bytes into
memcached, but we want to store it as JSON... or something?
Also, make sure we can hit memcache for containers with invalid UTF-8.
Although maybe it'd be better to catch that before we ever try memcache?
Change-Id: I1fbe133c8ec73ef6644ecfcbb1931ddef94e0400
To prepare for object-expirer's general task queue feature [1],
this patch enables to configure object-expirer in object-server.conf.
Object-expirer.conf can be used in the same manner as before, but deprecated.
If both of object-server.conf with "object-expirer" section and
object-expirer.conf are in a node, only object-server.conf is used.
Object-expirer.conf is used only if all object-server.conf doesn't have
"object-expirer" section.
There are two differences between "object-expirer.conf" style and
"object-server.conf" style.
The first difference is `dequeue_from_legacy` default value.
`dequeue_from_legacy` defines task queue mode. In "object-expirer.conf"
style, the default mode is legacy queue. In "object-server.conf" style,
the default mode is general queue. But general mode means no-op mode
for now, because general task queue is not implemented yet.
The second difference is internal client config. In "object-expirer.conf"
style, config file of internal client is the object-expirer.conf itself.
In "object-server.conf" style, config file of internal client is
another file.
[1]: https://review.openstack.org/#/c/517389/
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: Ib21568f9b9d8547da87a99d65ae73a550e9c3230
Add the log_msg_template option in proxy-server.conf and log_format in
a/c/o-server.conf. It is a string parsable by Python's format()
function. Some fields containing user data might be anonymized by using
log_anonymization_method and log_anonymization_salt.
Change-Id: I29e30ef45fe3f8a026e7897127ffae08a6a80cd9
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.
Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring. If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.
To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve. In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.
Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced. After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.
Closes-Bug: #1510342
Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
User can cofigure KEEPIDLE time for sockets in TCP connection.
The default value is the old value which is 600.
Change-Id: Ib7fb166deb8a87ae4e97ba0671048b1ec079a2ef
Closes-Bug:1759606
The object updater now supports two configuration settings:
"concurrency" and "updater_workers". The latter controls how many
worker processes are spawned, while the former controls how many
concurrent container updates are performed by each worker
process. This should speed the processing of async_pendings.
There is a change to the semantics of the configuration
options. Previously, "concurrency" controlled the number of worker
processes spawned, and "updater_workers" did not exist. I switched the
meanings for consistency with other configuration options. In the
object reconstructor, object replicator, object server, object
expirer, container replicator, container server, account replicator,
account server, and account reaper, "concurrency" refers to the number
of concurrent tasks performed within one process (for reference, the
container updater and object auditor use "concurrency" to mean number
of processes).
On upgrade, a node configured with concurrency=N will still handle
async updates N-at-a-time, but will do so using only one process
instead of N.
UpgradeImpact:
If you have a config file like this:
[object-updater]
concurrency = <N>
and you want to take advantage of faster updates, then do this:
[object-updater]
concurrency = 8 # the default; you can omit this line
updater_workers = <N>
If you want updates to be processed exactly as before, do this:
[object-updater]
concurrency = 1
updater_workers = <N>
Change-Id: I17e18088e61f664e1b9942d66423666d0cae1689
Previously, these headers had to be added by operators to their
object-server.conf when enabling swift3 middleware. Since s3api
is now imported into swift we should go ahead and add these headers
by default too.
Change-Id: Ib82e175096716e42aecdab48f01f079e09da6a1d
Signed-off-by: Thiago da Silva <thiago@redhat.com>
This attempts to import openstack/swift3 package into swift upstream
repository, namespace. This is almost simple porting except following items.
1. Rename swift3 namespace to swift.common.middleware.s3api
1.1 Rename also some conflicted class names (e.g. Request/Response)
2. Port unittests to test/unit/s3api dir to be able to run on the gate.
3. Port functests to test/functional/s3api and setup in-process testing
4. Port docs to doc dir, then address the namespace change.
5. Use get_logger() instead of global logger instance
6. Avoid global conf instance
Ex. fix various minor issue on those steps (e.g. packages, dependencies,
deprecated things)
The details and patch references in the work on feature/s3api are listed
at https://trello.com/b/ZloaZ23t/s3api (completed board)
Note that, because this is just a porting, no new feature is developed since
the last swift3 release, and in the future work, Swift upstream may continue
to work on remaining items for further improvements and the best compatibility
of Amazon S3. Please read the new docs for your deployment and keep track to
know what would be changed in the future releases.
Change-Id: Ib803ea89cfee9a53c429606149159dd136c036fd
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
[worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
The object updater has five different stats, but its logging only told
you two of them (successes and failures), and it only told you after
finishing all the async_pendings for a device. If you have a cluster
that's been sick and has millions upon millions of async_pendings
laying around, then your object-updaters are frustratingly
silent. I've seen one cluster with around 8 million async_pendings per
disk where the object-updaters only emitted stats every 12 hours.
Yes, if you have StatsD logging set up properly, you can go look at
your graphs and get real-time feedback on what it's doing. If you
don't have that, all you get is a frustrating silence.
Now, the object updater tells you all of its stats (successes,
failures, quarantines due to bad pickles, unlinks, and errors), and it
tells you incremental progress every five minutes. The logging at the
end of a pass remains and has been expanded to also include all stats.
Also included is a small change to what counts as an error: unmounted
drives no longer do. The goal is that only abnormal things count as
errors, like permission problems, malformed filenames, and so
on. These are things that should never happen, but if they do, may
require operator intervention. Drives fail, so logging an error upon
encountering an unmounted drive is not useful.
Change-Id: Idbddd507f0b633d14dffb7a9834fce93a10359ab
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.
Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.
Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572