For various reasons, an operator might want to use specifics nameservers
instead of the systems ones to resolve CNAME in cname_lookup. This patch
creates a new configuration variable nameservers which accepts a list of
nameservers separated by commas. If not specified or empty, systems
namservers are used as previously.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I34219e6ab7e45678c1a80ff76a1ac0730c64ddde
This is an alternative approach to that proposed in [1]
Adds support for optional per-policy config sections
to be added in proxy-server.conf. This is highly desirable
to allow per-policy affinity options to be set for use with
duplicated EC policies [2] and composite rings [3].
Certain options found in per-policy conf sections will
override their equivalents that may be set in the
[app:proxy-server] section. Currently the options
handled that way are:
sorting_method
read_affinity
write_affinity
write_affinity_node_count
For example:
[proxy-server:policy:0]
sorting_method = affinity
read_affinity = r1=100
write_affinity = r1
write_affinity_node_count = 1 * replicas
The corresponding attributes of the proxy-server Application
are now available from instances of an OverrideConf object
that is obtained from Application.get_policy_options(policy).
[1] Related-Change: I9104fc789ba85ab3ab5ccd34096125b482821389
[2] Related-Change: Idd155401982a2c48110c30b480966a863f6bd305
[3] Related-Change: I0d8928b55020592f8e75321d1f7678688301d797
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Change-Id: I3f718f425f525baa80045ba067950c752bcaaefc
Add entries for these options in the deployment guide and
make the text in proxy-server.conf-sample and man page
consistent.
Change-Id: I5854ddb3e5864ddbeaf9ac2c930bfafdb47517c3
An operator proposing a web UX to its customers might want to allow web
browser to access some headers by default (eg: X-Storage-Policy,
X-Container-Read, ...). This commit adds a new setting to the
proxy-server to allow some headers to be added cluster-wide to the CORS
header Access-Control-Expose-Headers.
Change-Id: I5ca90a052f27c98a514a96ee2299bfa1b6d46334
This patch enables efficent PUT/GET for global distributed cluster[1].
Problem:
Erasure coding has the capability to decrease the amout of actual stored
data less then replicated model. For example, ec_k=6, ec_m=3 parameter
can be 1.5x of the original data which is smaller than 3x replicated.
However, unlike replication, erasure coding requires availability of at
least some ec_k fragments of the total ec_k + ec_m fragments to service
read (e.g. 6 of 9 in the case above). As such, if we stored the
EC object into a swift cluster on 2 geographically distributed data
centers which have the same volume of disks, it is likely the fragments
will be stored evenly (about 4 and 5) so we still need to access a
faraway data center to decode the original object. In addition, if one
of the data centers was lost in a disaster, the stored objects will be
lost forever, and we have to cry a lot. To ensure highly durable
storage, you would think of making *more* parity fragments (e.g.
ec_k=6, ec_m=10), unfortunately this causes *significant* performance
degradation due to the cost of mathmetical caluculation for erasure
coding encode/decode.
How this resolves the problem:
EC Fragment Duplication extends on the initial solution to add *more*
fragments from which to rebuild an object similar to the solution
described above. The difference is making *copies* of encoded fragments.
With experimental results[1][2], employing small ec_k and ec_m shows
enough performance to store/retrieve objects.
On PUT:
- Encode incomming object with small ec_k and ec_m <- faster!
- Make duplicated copies of the encoded fragments. The # of copies
are determined by 'ec_duplication_factor' in swift.conf
- Store all fragments in Swift Global EC Cluster
The duplicated fragments increase pressure on existing requirements
when decoding objects in service to a read request. All fragments are
stored with their X-Object-Sysmeta-Ec-Frag-Index. In this change, the
X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index
encoded by PyECLib, there *will* be duplicates. Anytime we must decode
the original object data, we must only consider the ec_k fragments as
unique according to their X-Object-Sysmeta-Ec-Frag-Index. On decode no
duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an
object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and
avoided if possible.
On GET:
This patch inclues following changes:
- Change GET Path to sort primary nodes grouping as subsets, so that
each subset will includes unique fragments
- Change Reconstructor to be more aware of possibly duplicate fragments
For example, with this change, a policy could be configured such that
swift.conf:
ec_num_data_fragments = 2
ec_num_parity_fragments = 1
ec_duplication_factor = 2
(object ring must have 6 replicas)
At Object-Server:
node index (from object ring): 0 1 2 3 4 5 <- keep node index for
reconstruct decision
X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual
fragment index for
backend (PyEClib)
Additional improvements to Global EC Cluster Support will require
features such as Composite Rings, and more efficient fragment
rebalance/reconstruction.
1: http://goo.gl/IYiNPk (Swift Design Spec Repository)
2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo)
Doc-Impact
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Idd155401982a2c48110c30b480966a863f6bd305
Middleware domain_remap can work with cname_lookup middleware. This last
middleware accept that storage_domain is a list of domains. To be
consistent, domain_remap should have the same behavior.
Closes-Bug: #1664647
Change-Id: Iacc6619968cc7c677bf63e0b8d101a20c86ce599
The handoffs_first mode in the replicator has the useful behavior of
processing all handoff parts across all disks until there aren't any
handoffs anymore on the node [1] and then it seemingly tries to drop
back into normal operation. In practice I've only ever heard of
handoffs_first used while rebalancing and turned off as soon as the
rebalance finishes - it's not recommended to run with handoffs_first
mode turned on and it emits a warning on startup if option is enabled.
The handoffs_first mode on the reconstructor doesn't work - it was
prioritizing handoffs *per-part* [2] - which is really unfortunate
because in the reconstructor during a rebalance it's often *much* more
attractive from an efficiency disk/network perspective to revert a
partition from a handoff than it is to rebuild an entire partition from
another primary using the other EC fragments in the cluster.
This change deprecates handoffs_first in favor of handoffs_only in the
reconstructor which is far more useful - and just like handoffs_first
mode in the replicator - it gives the operator the option of forcing the
consistency engine to focus on rebalance. The handoffs_only behavior is
somewhat consistent with the replicator's handoffs_first option (any
error on any handoff in the replicactor will make it essentially handoff
only forever) but the option does what you want and is named correctly
in the reconstructor.
For consistency with the replicator the reconstructor will mostly honor
the handoffs_first option, but if you set handoffs_only in the config it
always takes precedence. Having handoffs_first in your config always
results in a warning, but if handoff_only is not set and handoffs_first
is true the reconstructor will assume you need handoffs_only and behaves
as such.
When running in handoffs_only mode the reconstructor will start to log a
warning every cycle if you leave it running in handoffs_only after it
finishes reverting handoffs. However you should be monitoring on-disk
partitions and disable the option as soon as the cluster finishes the
full rebalance cycle.
1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator
handoffs_first "mode"
2. Unlike replication each partition in a EC policy can have a different
kind of job per frag_index, but the cardinality of jobs is typically
only one (either sync or revert) unless there's been a bunch of errors
during write and then handoffs partitions maybe hold a number of
different fragments.
Known-Issues:
handoffs_only is not documented outside of the example config, see lp
bug #1626290
Closes-Bug: #1653018
Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.
I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age. This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.
There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.
UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup. All object
services must use the same reclaim_age. If you require a non-default
reclaim age it should be set in the [DEFAULT] section. If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.
If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade. If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.
Closes-Bug: #1626296
Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Before creating a static large object, we must verify that all of the
referenced segments exist. Previously, this was done sequentially; due
to latency between proxy and object nodes, clients must be careful to
either keep their segment count low or use very long (minute+) timeouts.
We mitigate this somewhat by enforcing a hard limit on segment count,
but even then, HEADing a thousand segments (the default limit) with an
average latency of (say) 100ms will require more than a minute and a
half.
Further, the nested-SLO approach requires multiple requests from the
client -- as a result, Swift3 is in the position of enforcing a lower
limit than S3's 10,000 (which will break some clients) or requiring that
clients have timeouts on the order of 15-20 minutes (!).
Now, we'll perform the segment HEADs in parallel, with a concurrency
factor set by the operator. This is very similar to (and builds upon)
the parallel-bulk-delete work. By default, two HEAD requests will be
allowed at a time.
As a side-effect, we'll also only ever HEAD a path once per manifest.
Previously, if a manifest alternated between two paths repeatedly (for
instance, because the user wanted to splice together various ranges from
two sub-SLOs), then each entry in the manifest would trigger a fresh
HEAD request.
Upgrade Consideration
=====================
If operators would like to preserve the prior (single-threaded) SLO
creation behavior, they must add the following line to their
[filter:slo] proxy config section:
concurrency = 1
This may be done prior to upgrading Swift.
UpgradeImpact
Closes-Bug: #1637133
Related-Change: I128374d74a4cef7a479b221fd15eec785cc4694a
Change-Id: I567949567ecdbd94fa06d1dd5d3cdab0d97207b6
Fixies this problem:
* swift-drive-audit needs to be run by root, because only root have
"umount" permission
* swift-object servers typically runs as user swift
* if swift-drive-audit is run by root, /var/cache/swift/drive.recon is
owned by root, with 0o600
* recon middleware (inside swift-object-server) can't read this cache
file: swift-object: Error reading recon cache file
This patch adds "user" option to drive-audit config file. Recon cache
is chowned to this user.
Change-Id: Ibf20543ee690b7c5a37fabd1540fd5c0c7b638c9
This came to light because someone ran Tempest against a standard
installation of RDO, which helpfuly terminates SSL for Swift in
a pre-configured load-balancer. In such a case, staticweb has no
way to know what scheme to use and guesses wrong, causing Tempest
to fail.
Related upstream bug:
https://bugs.launchpad.net/mos/+bug/1537071
Change-Id: Ie15cf2aff4f7e6bcf68b67ae733c77bb9353587a
Closes-Bug: 1572011
Prior to the Mitaka release the install guides showed
services (including Swift) being in a default Keystone
domain which existed by default and has id=default. This
domain id is reflected in the proxy-server.conf-sample
authtoken options and also shown in man page and auth docs.
The Mitaka install guide shows a domain with *name* default
being created, and having a random UUID assigned, in which
services are created. This has caused confusion (see
discussion on linked bug report).
This patch does not change the sample options but does
add to the comments in order to emphasize that a user
may need to alter the options to match their Keystone
configuration.
Change-Id: I17bfcdbd983402eeb561bb704b8b1f1e27547c7d
Partial-Bug: #1604674
The goal is to modify schedule priority and I/O scheduling class and
priority of daemon/server via configuration.
Setting is optional, default keeps current behaviour.
Use case:
Prioritize object-server to object-auditor, because all user's requests
needed to be served in peak hours and audit could wait.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
DocImpact
Change-Id: I1018a18f4706daabdb84574ffd9a58d831e68396
Include a note in container-sync docs pointing to specific
configuration needed to be compatible with encryption.
Also remove the sample encryption root secret from
proxy-server.conf-sample and in-process test setup. Remove encryption
middleware from the default proxy pipeline.
Change-Id: Ibceac485813f3ac819a53e644995749735592a55
Adds encryption middlewares.
All object servers and proxy servers should be upgraded before
introducing encryption middleware.
Encryption middleware should be first introduced with the
encryption middleware disable_encryption option set to True.
Once all proxies have encryption middleware installed this
option may be set to False (the default).
Increases constraints.py:MAX_HEADER_COUNT by 4 to allow for
headers generated by encryption-related middleware.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Christian Cachin <cca@zurich.ibm.com>
Co-Authored-By: Mahati Chamarthy <mahati.chamarthy@gmail.com>
Co-Authored-By: Peter Chng <pchng@ca.ibm.com>
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Jonathan Hinson <jlhinson@us.ibm.com>
Co-Authored-By: Hamdi Roumani <roumani@ca.ibm.com>
UpgradeImpact
Change-Id: Ie6db22697ceb1021baaa6bddcf8e41ae3acb5376
If you have 2 swift regions served by the same keystone,
then the client cannot get the correct URL for the swift endpoint
without specifying a region_name.
Closes-Bug: 1587088
Change-Id: Iaab883386e125c3ca6b9554389e63df17267a135
Before, server-side deletes of static large objects could take a long
time to complete since the proxy would wait for a response to each
segment DELETE before starting the next DELETE request.
Now, operators can configure a concurrency factor for the slo and bulk
middlewares to allow up to N concurrent DELETE requests. By default, two
DELETE requests will be allowed at a time.
Note that objects and containers are now deleted in separate passes, to
reduce the likelihood of 409 Conflict responses when deleting
containers.
Upgrade Consideration
=====================
If operators have enabled the bulk or slo middlewares and would like to
preserve the prior (single-threaded) DELETE behavior, they must add the
following line to their [filter:slo] and [filter:bulk] proxy config
sections:
delete_concurrency = 1
This may be done prior to upgrading Swift.
UpgradeImpact
Closes-Bug: 1524454
Change-Id: I128374d74a4cef7a479b221fd15eec785cc4694a
Rewrite server side copy and 'object post as copy' feature as middleware to
simplify the PUT method in the object controller code. COPY is no longer
a verb implemented as public method in Proxy application.
The server side copy middleware is inserted to the left of dlo, slo and
versioned_writes middlewares in the proxy server pipeline. As a result,
dlo and slo copy_hooks are no longer required. SLO manifests are now
validated when copied so when copying a manifest to another account the
referenced segments must be readable in that account for the manifest
copy to succeed (previously this validation was not made, meaning the
manifest was copied but could be unusable if the segments were not
readable).
With this change, there should be no change in functionality or existing
behavior. This is asserted with (almost) no changes required to existing
functional tests.
Some notes (for operators):
* Middleware required to be auto-inserted before slo and dlo and
versioned_writes
* Turning off server side copy is not configurable.
* object_post_as_copy is no longer a configurable option of proxy server
but of this middleware. However, for smooth upgrade, config option set
in proxy server app is also read.
DocImpact: Introducing server side copy as middleware
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Change-Id: Ic96a92e938589a2f6add35a40741fd062f1c29eb
Signed-off-by: Prashanth Pai <ppai@redhat.com>
Signed-off-by: Thiago da Silva <thiago@redhat.com>
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
This patch removes the threads_per_disk setting. It was already a deprecated
setting and by default set to 0, which effectively meant to not use a per-disk
thread pool at all. Users are encouraged to use servers_per_port instead.
DocImpact
Change-Id: Ie76be5c8a74d60a1330627caace19e06d1b9383c
Add the ability to set the fallocate_reserve value as a percentage.
This happens automatically when adding the '%' at the end of the value.
Having the ability to set a % of free space rather than a byte value is
useful especially when drive sizes are heterogenous.
The default for fallocate_reserve has been adjusted to 1%, having the
fallocate_reserve set seems sensible for all deploys and percentages are
far safer to default than byte values (across drives of any size).
Tests added for using fallocate_reserve as a percentage.
Duplicate tests for fallocate_reserve have been removed.
Docs updated to reflect the fallocate_reserve change.
Change-Id: I4aea613a708205c917e81d6b2861396655e73238
DiskFile already fills in the _ondisk_info attribute when it tries to open
a diskfile - even if the DiskFile's fileset is not valid or deleted.
During this process the rsync tempfiles would be discovered and logged,
but no-one would attempt to clean them up - even if they were really old.
Instead of logging and ignoring unexpected files when validate a DiskFile
fileset we'll add unexpected files to the unexpected key in the
_ondisk_info attribute.
With a little bit of re-organization in the auditor's object_audit method
to get things into a single return path we can add an unconditional check
for unexpected files and remove those that are "old enough".
Since the replicator will kill any rsync processes that are running longer
than the configured rsync_timeout we know that any rsync tempfiles older
than this can be deleted.
Split unlink_older_than in common.utils into two functions to allow an
explicit list of previously discovered paths to be passed in to avoid an
extra listdir. Since the getmtime handling already ignores OSError
there's less concern of race condition where a previous discovered
unexpected file is reaped by rsync while we're attempting to clean it up.
Update some doc on the new config option.
Closes-Bug: #1554005
Change-Id: Id67681cb77f605e3491b8afcb9c69d769e154283
Updates docs to remove warnings that container sync only
works with object_post_as_copy=True. Since commit e91de49
container sync will also sync POST updates when using
object_post_as_copy=False.
Change-Id: I5cc3cc6e8f9ba2fef6f896f2b11d2a4e06825f7f
Swift now uses SSYNC verb instead of old REPLICATION verb for ssync
protocol. This patch replaces all docs written as REPLICATION into
SSYNC and fix a few words for explanation.
Change-Id: I1253210d4f49749e7d425d6252dd262b650d9548
This change adds 2 new parameters to enable and control concurrent GETs
in swift, these are 'concurrent_gets' and 'concurrency_timeout'.
'concurrent_gets' allows you to turn on or off concurrent GETs, when
on it will set the GET/HEAD concurrency to replica count. And in the
case of EC HEADs it will set it to ndata.
The proxy will then serve only the first valid source to respond.
This applies to all account, container and object GETs except
for EC. For EC only HEAD requests are effected.
It achieves this by changing the request sending mechanism to using
GreenAsyncPile and green threads with a time out between each
request.
'concurrency_timeout' is related to concurrent_gets. And is the
amount of time to wait before firing the next thread. A value of 0
will fire at the same time (fully concurrent), setting another value
will stagger the firing allowing you the ability to give a node a
shorter chance to respond before firing the next. This value is a float
and should be somewhere between 0 and node_timeout. The default is
conn_timeout. Meaning by default it will stagger the firing.
DocImpact
Implements: blueprint concurrent-reads
Change-Id: I789d39472ec48b22415ff9d9821b1eefab7da867
With the X-Timestamp validation added in commit e619411, end users
could upload objects with
X-Timestamp: 9999999999.99999_ffffffffffffffff
(the maximum value) and Swift would be unable to delete them.
Now, inbound X-Timestamp headers will be moved to
X-Backend-Inbound-X-Timestamp, effectively rendering them harmless.
The primary reason to allow X-Timestamp before was to prevent
Last-Modified changes for objects coming from either:
* container_sync or
* a migration from another storage system.
To enable the former use-case, the container_sync middleware will now
translate X-Backend-Inbound-X-Timestamp headers back to X-Timestamp
after verifying the request.
Additionally, a new option is added to the gatekeeper filter config:
# shunt_inbound_x_timestamp = true
To enable the latter use-case (or any other use-case not mentioned), set
this to false.
Upgrade Consideration
=====================
If your cluster workload requires that clients be allowed to specify
objects' X-Timestamp values, disable the shunt_inbound_x_timestamp
option before upgrading.
UpgradeImpact
Change-Id: I8799d5eb2ae9d795ba358bb422f69c70ee8ebd2c