Most daemons have a "go as fast as you can then sleep for 30 seconds"
strategy towards resource utilization; the object-updater and
object-auditor however have some "X_per_second" options that allow
operators much better control over how they spend their I/O budget.
This change extends that pattern into the account-replicator,
container-replicator, and container-sharder which have been known to peg
CPUs when they're not IO limited.
Partial-Bug: #1784753
Change-Id: Ib7f2497794fa2f384a1a6ab500b657c624426384
In follow-up to the related change, mention the new
cors_expose_headers option (and other proxy-server.conf
options) in the CORS doc.
Add a test for the cors options being loaded into the
proxy server.
Improve CORS comments in docs.
Change-Id: I647d8f9e9cbd98de05443638628414b1e87d1a76
Related-Change: I5ca90a052f27c98a514a96ee2299bfa1b6d46334
User can cofigure KEEPIDLE time for sockets in TCP connection.
The default value is the old value which is 600.
Change-Id: Ib7fb166deb8a87ae4e97ba0671048b1ec079a2ef
Closes-Bug:1759606
The object updater now supports two configuration settings:
"concurrency" and "updater_workers". The latter controls how many
worker processes are spawned, while the former controls how many
concurrent container updates are performed by each worker
process. This should speed the processing of async_pendings.
There is a change to the semantics of the configuration
options. Previously, "concurrency" controlled the number of worker
processes spawned, and "updater_workers" did not exist. I switched the
meanings for consistency with other configuration options. In the
object reconstructor, object replicator, object server, object
expirer, container replicator, container server, account replicator,
account server, and account reaper, "concurrency" refers to the number
of concurrent tasks performed within one process (for reference, the
container updater and object auditor use "concurrency" to mean number
of processes).
On upgrade, a node configured with concurrency=N will still handle
async updates N-at-a-time, but will do so using only one process
instead of N.
UpgradeImpact:
If you have a config file like this:
[object-updater]
concurrency = <N>
and you want to take advantage of faster updates, then do this:
[object-updater]
concurrency = 8 # the default; you can omit this line
updater_workers = <N>
If you want updates to be processed exactly as before, do this:
[object-updater]
concurrency = 1
updater_workers = <N>
Change-Id: I17e18088e61f664e1b9942d66423666d0cae1689
Previously, these headers had to be added by operators to their
object-server.conf when enabling swift3 middleware. Since s3api
is now imported into swift we should go ahead and add these headers
by default too.
Change-Id: Ib82e175096716e42aecdab48f01f079e09da6a1d
Signed-off-by: Thiago da Silva <thiago@redhat.com>
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
[worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
The object updater has five different stats, but its logging only told
you two of them (successes and failures), and it only told you after
finishing all the async_pendings for a device. If you have a cluster
that's been sick and has millions upon millions of async_pendings
laying around, then your object-updaters are frustratingly
silent. I've seen one cluster with around 8 million async_pendings per
disk where the object-updaters only emitted stats every 12 hours.
Yes, if you have StatsD logging set up properly, you can go look at
your graphs and get real-time feedback on what it's doing. If you
don't have that, all you get is a frustrating silence.
Now, the object updater tells you all of its stats (successes,
failures, quarantines due to bad pickles, unlinks, and errors), and it
tells you incremental progress every five minutes. The logging at the
end of a pass remains and has been expanded to also include all stats.
Also included is a small change to what counts as an error: unmounted
drives no longer do. The goal is that only abnormal things count as
errors, like permission problems, malformed filenames, and so
on. These are things that should never happen, but if they do, may
require operator intervention. Drives fail, so logging an error upon
encountering an unmounted drive is not useful.
Change-Id: Idbddd507f0b633d14dffb7a9834fce93a10359ab
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.
Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.
Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
It was deprecated and we discussed on this topic in Denver PTG
for Queen cycle. Main motivation for this work is that deprecated
post_as_copy option and its gate blocks future symlink work.
Change-Id: I411893db1565864ed5beb6ae75c38b982a574476
This change adds a new Strategy concept to the daemon module similar to
how we manage WSGI workers. We need to leverage multiple python
processes to get the concurrency properties we need. More workers will
rebalance much faster on dense chassis with many devices.
Currently the default is still only one process, and no workers. Set
reconstructor_workers in the [object-reconstructor] section to some
whole number <= the number of devices on a node to get that many
reconstructor workers.
Each worker will operate on a different subset of disks.
Once mode works as before, but tends to want to update recon drops a
little bit more.
If you change the rings, the strategy will shutdown workers and spawn
new ones.
You can kill the worker pids and the daemon strategy will respawn them.
New per-disk reconstructor stats are dumped to recon under the
object_reconstruction_per_disk key. To maintain legacy compatibility
and replication monitoring based on cycle times they are aggregated
every stats_interval (default 5 mins).
Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe
Previously it was hard to navigate to a particular config section in
the deployment guide, and not possible to provide a link directly to
one section.
This patch makes each config section a heading so that it appears in
navigation tables and can be easily linked to. A list of config
sections is also added at the start of each server section.
Change-Id: Iecb0637fde521600a9163fa66b3dbdc176a71dff
Related-Bug: #1626290
When deleting objects in multi-region swift delpoyment with write
affinity configured, users always get 404 when deleting object before
it's replcated to approriate nodes.
This patch adds a config item 'write_affinity_handoff_delete_count' so
that operator could define how many local handoff nodes should swift
send request to get more candidates for the final response, or by
default just leave it to swift to calculate the appropriate number.
Change-Id: Ic4ef82e4fc1a91c85bdbc6bf41705a76f16d1341
Closes-Bug: #1503161
If you're running servers_per_port > 0 and threads_per_disk = 0 (as it
should be with servers_per_port on), each object-server process will
have 20 IO threads waiting around to service eventlet.tpool
calls. This is far too many; with servers_per_port, there's no real
benefit to having so many IO threads.
This commit makes it so that, when servers_per_port > 0, each object
server defaults to having one main thread and one IO thread.
Also, eventlet's tpool size is now configurable via the object-server
config file. If a tpool size is set, that's what we'll use regardless
of servers_per_port. This allows operators with an excess of threads
to remove some regardless of servers_per_port.
Change-Id: I8f8914b7e70f2510393eb7c5e6be9708631ac027
Closes-Bug: 1554233
container and object updaters sleeps "slowdown" (default 0.01) seconds
after every processed container/object. Because time.sleep call adds overhead,
use ratelimit_sleep from common.utils instead. Same as in auditor.
Change-Id: I362aa0f13c78ad03ce1f76ee0257b0646f981212
- add proxy server per policy config as an optional
step in the configuration of a policy, with link to
the deployment guide
- add reverse link from deployment guide per-policy
config doc section to storage policies docs
Drive-by fix an incorrect test comment
Change-Id: Ib95310193270a63c9d1e321c6e7de240e00b387f
Related-Change: I3f718f425f525baa80045ba067950c752bcaaefc
Instead, link to the middleware list and auth overview, as well as
referring readers to proxy-server.conf-sample
TempAuth-related content that was previously in the deployment guide has
been moved to TempAuth's own docs, which have been cleaned up a bit.
Change-Id: I00070bb09294362c069f7ee9426ac570bc1b3ddb
This is an alternative approach to that proposed in [1]
Adds support for optional per-policy config sections
to be added in proxy-server.conf. This is highly desirable
to allow per-policy affinity options to be set for use with
duplicated EC policies [2] and composite rings [3].
Certain options found in per-policy conf sections will
override their equivalents that may be set in the
[app:proxy-server] section. Currently the options
handled that way are:
sorting_method
read_affinity
write_affinity
write_affinity_node_count
For example:
[proxy-server:policy:0]
sorting_method = affinity
read_affinity = r1=100
write_affinity = r1
write_affinity_node_count = 1 * replicas
The corresponding attributes of the proxy-server Application
are now available from instances of an OverrideConf object
that is obtained from Application.get_policy_options(policy).
[1] Related-Change: I9104fc789ba85ab3ab5ccd34096125b482821389
[2] Related-Change: Idd155401982a2c48110c30b480966a863f6bd305
[3] Related-Change: I0d8928b55020592f8e75321d1f7678688301d797
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Change-Id: I3f718f425f525baa80045ba067950c752bcaaefc
Add entries for these options in the deployment guide and
make the text in proxy-server.conf-sample and man page
consistent.
Change-Id: I5854ddb3e5864ddbeaf9ac2c930bfafdb47517c3
An operator proposing a web UX to its customers might want to allow web
browser to access some headers by default (eg: X-Storage-Policy,
X-Container-Read, ...). This commit adds a new setting to the
proxy-server to allow some headers to be added cluster-wide to the CORS
header Access-Control-Expose-Headers.
Change-Id: I5ca90a052f27c98a514a96ee2299bfa1b6d46334
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.
I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age. This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.
There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.
UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup. All object
services must use the same reclaim_age. If you require a non-default
reclaim age it should be set in the [DEFAULT] section. If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.
If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade. If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.
Closes-Bug: #1626296
Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
The goal is to modify schedule priority and I/O scheduling class and
priority of daemon/server via configuration.
Setting is optional, default keeps current behaviour.
Use case:
Prioritize object-server to object-auditor, because all user's requests
needed to be served in peak hours and audit could wait.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
DocImpact
Change-Id: I1018a18f4706daabdb84574ffd9a58d831e68396
Deployment guide does not talk about the region. Also, it does not
specify that regions and zones need to be ints.
This patch adds brief description about region and changes numbers
to int. Also, adds region in the document that talks about ring data
struture.
Change-Id: I04ce42fb3e5c1f08e7f7ff6be23482cee8bdeb71
Partial-Bug: #1583551
In the swift deployment guide, region is missing from the syntax of
adding a new device to the swift-ring-builder.
This patch adds region in the syntax.
Change-Id: I43e247c92d461efd530c0f82ca3daddcb9e2ba5b
Closes-Bug: #1584127
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
This patch removes the threads_per_disk setting. It was already a deprecated
setting and by default set to 0, which effectively meant to not use a per-disk
thread pool at all. Users are encouraged to use servers_per_port instead.
DocImpact
Change-Id: Ie76be5c8a74d60a1330627caace19e06d1b9383c
Add the ability to set the fallocate_reserve value as a percentage.
This happens automatically when adding the '%' at the end of the value.
Having the ability to set a % of free space rather than a byte value is
useful especially when drive sizes are heterogenous.
The default for fallocate_reserve has been adjusted to 1%, having the
fallocate_reserve set seems sensible for all deploys and percentages are
far safer to default than byte values (across drives of any size).
Tests added for using fallocate_reserve as a percentage.
Duplicate tests for fallocate_reserve have been removed.
Docs updated to reflect the fallocate_reserve change.
Change-Id: I4aea613a708205c917e81d6b2861396655e73238
DiskFile already fills in the _ondisk_info attribute when it tries to open
a diskfile - even if the DiskFile's fileset is not valid or deleted.
During this process the rsync tempfiles would be discovered and logged,
but no-one would attempt to clean them up - even if they were really old.
Instead of logging and ignoring unexpected files when validate a DiskFile
fileset we'll add unexpected files to the unexpected key in the
_ondisk_info attribute.
With a little bit of re-organization in the auditor's object_audit method
to get things into a single return path we can add an unconditional check
for unexpected files and remove those that are "old enough".
Since the replicator will kill any rsync processes that are running longer
than the configured rsync_timeout we know that any rsync tempfiles older
than this can be deleted.
Split unlink_older_than in common.utils into two functions to allow an
explicit list of previously discovered paths to be passed in to avoid an
extra listdir. Since the getmtime handling already ignores OSError
there's less concern of race condition where a previous discovered
unexpected file is reaped by rsync while we're attempting to clean it up.
Update some doc on the new config option.
Closes-Bug: #1554005
Change-Id: Id67681cb77f605e3491b8afcb9c69d769e154283