143 Commits

Author SHA1 Message Date
Zuul
b05b27c0b6 Merge "Add note about rsync_bwlimit suffixes" 2022-08-30 22:53:02 +00:00
Zuul
24acc6e56b Merge "Add backend rate limiting middleware" 2022-08-30 07:18:57 +00:00
Tim Burke
a9177a4b9d Add note about rsync_bwlimit suffixes
Change-Id: I019451e118d3bd7263a52cf4bf354d0d0d2b4607
2022-08-26 08:54:06 -07:00
Zuul
73b2730f71 Merge "Add ring_ip option to object services" 2022-06-06 21:04:48 +00:00
Clay Gerrard
12bc79bf01 Add ring_ip option to object services
This will be used when finding their own devices in rings, defaulting to
the bind_ip.

Notably, this allows services to be containerized while servers_per_port
is enabled:

* For the object-server, the ring_ip should be set to the host ip and
  will be used to discover which ports need binding. Sockets will still
  be bound to the bind_ip (likely 0.0.0.0), with the assumption that the
  host will publish ports 1:1.

* For the replicator and reconstructor, the ring_ip will be used to
  discover which devices should be replicated. While bind_ip could
  previously be used for this, it would have required a separate config
  from the object-server.

Also rename object deamon's bind_ip attribute to ring_ip so that it's
more obvious wherever we're using the IP for ring lookups instead of
socket binding.

Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218
2022-06-02 16:31:29 -05:00
Zuul
d1f2e82556 Merge "replicator: Log rsync file transfers less" 2022-05-27 18:32:46 +00:00
Alistair Coles
ccaf49a00c Add backend rate limiting middleware
This is a fairly blunt tool: ratelimiting is per device and
applied independently in each worker, but this at least provides
some limit to disk IO on backend servers.

GET, HEAD, PUT, POST, DELETE, UPDATE and REPLICATE methods may be
rate-limited.

Only requests with a path starting '<device>/<partition>', where
<partition> can be cast to an integer, will be rate-limited. Other
requests, including, for example, recon requests with paths such as
'recon/version', are unconditionally forwarded to the next app in the
pipeline.

OPTIONS and SSYNC methods are not rate-limited. Note that
SSYNC sub-requests are passed directly to the object server app
and will not pass though this middleware.

Change-Id: I78b59a081698a6bff0d74cbac7525e28f7b5d7c1
2022-05-20 14:40:00 +01:00
Tim Burke
7e69176817 replicator: Log rsync file transfers less
- Drop log level for successful rsyncs to debug; ops don't usually care.
- Add an option to skip "send" lines entirely -- in a large cluster,
  during a meaningful expansion, there's too much information getting
  logged; it's just wasting disk space.

Note that we already have similar filtering for directory creation;
that's been present since the initial commit of Swift code.

Drive-by: make it a little more clear that more than one suffix was
likely replicated when logging about success.

Change-Id: I02ba67e77e3378b2c2c8c682d5d230d31cd1bfa9
2022-04-28 12:35:00 -07:00
Tim Burke
043e0163ed Clarify that rsync_io_timeout is also used for contimeout
Change-Id: I5e4a270add2a625e6d5cb0ae9468313ddc88a81b
2022-04-28 10:07:50 -07:00
Alistair Coles
51da2543ca object-updater: defer ratelimited updates
Previously, objects updates that could not be sent immediately due to
per-container/bucket ratelimiting [1] would be skipped and re-tried
during the next updater cycle. There could potentially be a period of
time at the end of a cycle when the updater slept, having completed a
sweep of the on-disk async pending files, despite having skipped
updates during the cycle. Skipped updates would then be read from disk
again during the next cycle.

With this change the updater will defer skipped updates to an
in-memory queue (up to a configurable maximum number) until the sweep
of async pending files has completed, and then trickle out deferred
updates until the cycle's interval expires. This increases the useful
work done in the current cycle and reduces the amount of repeated disk
IO during the next cycle.

The deferrals queue is bounded in size and will evict least recently
read updates in order to accept more recently read updates. This
reduces the probablility that a deferred update has been made obsolete
by newer on-disk async pending files while waiting in the deferrals
queue.

The deferrals queue is implemented as a collection of per-bucket
queues so that updates can be drained from the queues in the order
that buckets cease to be ratelimited.

[1] Related-Change: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I95e58df9f15c5f9d552b8f4c4989a474f52262f4
2022-02-21 10:56:23 +00:00
Clay Gerrard
de88862981 Finer grained ratelimit for update
Throw our stream of async_pendings through a hash ring; if the virtual
bucket gets hot just start leaving the updates on the floor and move on.

It's off by default; and if you use it you're probably going to leave a
bunch of async updates pointed at a small set of containers in the queue
for the next sweep every sweep (so maybe turn it off at some point)

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
2022-01-06 12:47:09 -08:00
Alistair Coles
8ee631ccee reconstructor: restrict max objects per revert job
Previously the ssync Sender would attempt to revert all objects in a
partition within a single SSYNC request. With this change the
reconstructor daemon option max_objects_per_revert can be used to limit
the number of objects reverted inside a single SSYNC request for revert
type jobs i.e. when reverting handoff partitions.

If more than max_objects_per_revert are available, the remaining objects
will remain in the sender partition and will not be reverted until the
next call to ssync.Sender, which would currrently be the next time the
reconstructor visits that handoff partition.

Note that the option only applies to handoff revert jobs, not to sync
jobs.

Change-Id: If81760c80a4692212e3774e73af5ce37c02e8aff
2021-12-03 12:43:23 +00:00
Alistair Coles
2696a79f09 reconstructor: retire nondurable_purge_delay option
The nondurable_purge_delay option was introduced in [1] to prevent the
reconstructor removing non-durable data files on handoffs that were
about to be made durable. The DiskFileManager commit_window option has
since been introduced [2] which specifies a similar time window during
which non-durable data files should not be removed. The commit_window
option can be re-used by the reconstructor, making the
nondurable_purge_delay option redundant.

The nondurable_purge_delay option has not been available in any tagged
release and is therefore removed with no backwards compatibility.

[1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
[2] Related-Change: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
Change-Id: I1589a7517b7375fcc21472e2d514f26986bf5079
2021-07-19 21:18:06 +01:00
Alistair Coles
bbaed18e9b diskfile: don't remove recently written non-durables
DiskFileManager will remove any stale files during
cleanup_ondisk_files(): these include tombstones and nondurable EC
data fragments whose timestamps are older than reclaim_age. It can
usually be safely assumed that a non-durable data fragment older than
reclaim_age is not going to become durable. However, if an agent PUTs
objects with specified older X-Timestamps (for example the reconciler
or container-sync) then there is a window of time during which the
object server has written an old non-durable data file but has not yet
committed it to make it durable.

Previously, if another process (for example the reconstructor) called
cleanup_ondisk_files during this window then the non-durable data file
would be removed. The subsequent attempt to commit the data file would
then result in a traceback due to there no longer being a data file to
rename, and of course the data file is lost.

This patch modifies cleanup_ondisk_files to not remove old, otherwise
stale, non-durable data files that were only written to disk in the
preceding 'commit_window' seconds. 'commit_window' is configurable for
the object server and defaults to 60.0 seconds.

Closes-Bug: #1936508
Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
2021-07-19 21:18:02 +01:00
Alistair Coles
2fd5b87dc5 reconstructor: make quarantine delay configurable
Previously the reconstructor would quarantine isolated durable
fragments that were more than reclaim_age old. This patch adds a
quarantine_age option for the reconstructor which defaults to
reclaim_age but can be used to configure the age that a fragment must
reach before quarantining.

Change-Id: I867f3ea0cf60620c576da0c1f2c65cec2cf19aa0
2021-07-06 16:41:08 +01:00
Zuul
653daf73ed Merge "relinker: tolerate existing tombstone with same timestamp" 2021-07-02 22:59:34 +00:00
Zuul
2efd4316a6 Merge "Make dark data watcher ignore the newly updated objects" 2021-07-02 20:52:55 +00:00
Alistair Coles
574897ae27 relinker: tolerate existing tombstone with same timestamp
It is possible for the current and next part power locations to
both have existing tombstones with different inodes when the
relinker tries to relink. This can be caused, for example, by
concurrent reconciler DELETEs that specify the same timestamp.

The relinker previously failed to relink and reported an error when
encountering this situation. With this patch the relinker will
tolerate an existing tombstone with the same filename but different
inode in the next part power location.

Since [1] the relinker had special case handling for EEXIST errors
caused by a different inode tombstone already existing in the next
partition power location: the relinker would check to see if the
existing next part power tombstone linked to a tombstone in a previous
part power (i.e. < current part power) location, and if so tolerate
the EEXIST.

This special case handling is no longer necessary because the relinker
will now tolerate an EEXIST when linking a tombstone provided the two
files have the same timestamp. There is therefore no need to search
previous part power locations for a tombstone that does link with the
next part power location.

The link_check_limit is no longer used but the --link-check-limit
command line option is still allowed (although ignored) for backwards
compatibility.

[1] Related-Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c

Change-Id: I07ffee3b4ba6c7ff6c206beaf6b8f746fe365c2b
Closes-Bug: #1934142
2021-07-02 12:14:47 +01:00
Pete Zaitcev
95e0316451 Make dark data watcher ignore the newly updated objects
When objects are freshly uploaded, they may take a little time
to appear in container listings, producing false positives.

Because we needed to test this, we also reworked/added the tests
and fixed some issues, including adding an EC fragment (thanks
to Alistair's code).

Closes-Bug: 1925782
Change-Id: Ieafa72a496328f7a487ca7062da6253994a5a07d
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
2021-06-30 16:38:57 -05:00
Alistair Coles
2934818d60 reconstructor: Delay purging reverted non-durable datafiles
The reconstructor may revert a non-durable datafile on a handoff
concurrently with an object server PUT that is about to make the
datafile durable.  This could previously lead to the reconstructor
deleting the recently written datafile before the object-server
attempts to rename it to a durable datafile, and consequently a
traceback in the object server.

The reconstructor will now only remove reverted nondurable datafiles
that are older (according to mtime) than a period set by a new
nondurable_purge_delay option (defaults to 60 seconds). More recent
nondurable datafiles may be made durable or will remain on the handoff
until a subsequent reconstructor cycle.

Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
2021-06-24 09:33:06 +01:00
Zuul
b3def185c6 Merge "Allow floats for all intervals" 2021-05-25 18:15:56 +00:00
Zuul
5ec3826246 Merge "Quarantine stale EC fragments after checking handoffs" 2021-05-11 22:44:16 +00:00
Alistair Coles
46ea3aeae8 Quarantine stale EC fragments after checking handoffs
If the reconstructor finds a fragment that appears to be stale then it
will now quarantine the fragment.  Fragments are considered stale if
insufficient fragments at the same timestamp can be found to rebuild
missing fragments, and the number found is less than or equal to a new
reconstructor 'quarantine_threshold' config option.

Before quarantining a fragment the reconstructor will attempt to fetch
fragments from handoff nodes in addition to the usual primary nodes.
The handoff requests are limited by a new 'request_node_count'
config option.

'quarantine_threshold' defaults to zero i.e. no fragments will be
quarantined. 'request node count' defaults to '2 * replicas'.

Closes-Bug: 1655608

Change-Id: I08e1200291833dea3deba32cdb364baa99dc2816
2021-05-10 20:45:17 +01:00
Matthew Oliver
4ce907a4ae relinker: Add /recon/relinker endpoint and drop progress stats
To further benefit the stats capturing for the relinker, drop partition
progress to a new relinker.recon recon cache and add a new recon endpoint:

  GET /recon/relinker

To gather get live relinking progress data:

  $ curl http://127.0.0.3:6030/recon/relinker |python -mjson.tool
  {
      "devices": {
          "sdb3": {
              "parts_done": 523,
              "policies": {
                  "1": {
                      "next_part_power": 11,
                      "start_time": 1618998724.845616,
                      "stats": {
                          "errors": 0,
                          "files": 1630,
                          "hash_dirs": 1630,
                          "linked": 1630,
                          "policies": 1,
                          "removed": 0
                      },
                      "timestamp": 1618998730.24672,
                      "total_parts": 1029,
                      "total_time": 5.400741815567017
                  }},
              "start_time": 1618998724.845946,
              "stats": {
                  "errors": 0,
                  "files": 836,
                  "hash_dirs": 836,
                  "linked": 836,
                  "removed": 0
              },
              "timestamp": 1618998730.24672,
              "total_parts": 523,
              "total_time": 5.400741815567017
          },
          "sdb7": {
              "parts_done": 506,
              "policies": {
                  "1": {
                      "next_part_power": 11,
                      "part_power": 10,
                      "parts_done": 506,
                      "start_time": 1618998724.845616,
                      "stats": {
                          "errors": 0,
                          "files": 794,
                          "hash_dirs": 794,
                          "linked": 794,
                          "removed": 0
                      },
                      "step": "relink",
                      "timestamp": 1618998730.166175,
                      "total_parts": 506,
                      "total_time": 5.320528984069824
                  }
              },
              "start_time": 1618998724.845616,
              "stats": {
                  "errors": 0,
                  "files": 794,
                  "hash_dirs": 794,
                  "linked": 794,
                  "removed": 0
              },
              "timestamp": 1618998730.166175,
              "total_parts": 506,
              "total_time": 5.320528984069824
          }
      },
      "workers": {
          "100": {
              "drives": ["sda1"],
              "return_code": 0,
              "timestamp": 1618998730.166175}
      }}

Also, add a constant DEFAULT_RECON_CACHE_PATH to help fix failing tests
by mocking recon_cache_path, so that errors are not logged due
to dump_recon_cache exceptions.

Mock recon_cache_path more widely and assert no error logs more
widely.

Change-Id: I625147dadd44f008a7c48eb5d6ac1c54c4c0ef05
2021-05-10 16:13:32 +01:00
Tim Burke
c374a7a851 Allow floats for all intervals
Change-Id: I91e9bc02d94fe7ea6e89307305705c383087845a
2021-05-05 15:30:21 -07:00
Tim Burke
abfa6bee72 relinker: Parallelize per disk
Add a new option, workers, that works more or less like the same option
from background daemons. Disks will be distributed across N worker
sub-processes so we can make the best use of the I/O available.

While we're at it, log final stats at warning if there were errors.

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I039d2b8861f69a64bd9d2cdf68f1f534c236b2ba
2021-04-05 12:15:56 -07:00
Alistair Coles
3bdd01cf4a relinker: retry links from older part powers
If a previous partition power increase failed to cleanup all files in
their old partition locations, then during the next partition power
increase the relinker may find the same file to relink in more than
one source partition. This currently leads to an error log due to the
second relink attempt getting an EEXIST error.

With this patch, when an EEXIST is raised, the relinker will attempt
to create/verify a link from older partition power locations to the
next part power location, and if such a link is found then suppress
the error log.

During the relink step, if an alternative link is verified and if a
file is found that is neither linked to the next partition power
location nor in the current part power location, then the file is
removed during the relink step. That prevents the same EEXIST occuring
again during the cleanup step when it may no longer be possible to
verify that an alternative link exists.

For example, consider identical filenames in the N+1th, Nth and N-1th
partition power locations, with the N+1th being linked to the Nth:

  - During relink, the Nth location is visited and its link is
    verified. Then the N-1th location is visited and an EEXIST error
    is encountered, but the new check verifies that a link exists to
    the Nth location, which is OK.

  - During cleanup the locations are visited in the same order, but
    files are removed so that the Nth location file no longer exists
    when the N-1th location is visited. If the N-1th location still
    has a conflicting file then existence of an alternative link to
    the Nth location can no longer be verified, so an error would be
    raised. Therefore, the N-1th location file must be removed during
    relink.

The error is only suppressed for tombstones. The number of partition
power location that the relinker will look back over may be configured
using the link_check_limit option in a conf file or --link-check-limit
on the command line, and defaults to 2.

Closes-Bug: 1921718
Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
2021-04-01 18:56:57 +01:00
Tim Burke
53c0fc3403 relinker: Add option to ratelimit relinking
Sure, you could use stuff like ionice or cgroups to limit relinker I/O,
but sometimes a nice simple blunt instrument is handy.

Change-Id: I7fe29c7913a9e09bdf7a787ccad8bba2c77cf995
2021-02-11 11:31:39 -08:00
Tim Burke
1b7dd34d38 relinker: Allow conf files for configuration
Swap out the standard logger stuff in place of --logfile. Keep --device
as a CLI-only option. Everything else is pretty standard stuff that
ought to be in [DEFAULT].

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I32f979f068592eaac39dcc6807b3114caeaaa814
2021-02-08 14:39:27 -08:00
Samuel Merritt
b971280907 Let developers/operators add watchers to object audit
Swift operators may find it useful to operate on each object in their
cluster in some way. This commit provides them a way to hook into the
object auditor with a simple, clearly-defined boundary so that they
can iterate over their objects without additional disk IO.

For example, a cluster operator may want to ensure a semantic
consistency with all SLO segments accounted in their manifests,
or locate objects that aren't in container listings. Now that Swift
has encryption support, this could be used to locate unencrypted
objects. The list goes on.

This commit makes the auditor locate, via entry points, the watchers
named in its config file.

A watcher is a class with at least these four methods:

   __init__(self, conf, logger, **kwargs)

   start(self, audit_type, **kwargs)

   see_object(self, object_metadata, data_file_path, **kwargs)

   end(self, **kwargs)

The auditor will call watcher.start(audit_type) at the start of an
audit pass, watcher.see_object(...) for each object audited, and
watcher.end() at the end of an audit pass. All method arguments are
passed as keyword args.

This version of the API is implemented on the context of the
auditor itself, without spawning any additional processes.
If the plugins are not working well -- hang, crash, or leak --
it's easier to debug them when there's no additional complication
of processes that run by themselves.

In addition, we include a reference implementation of plugin for
the watcher API, as a help to plugin writers.

Change-Id: I1be1faec53b2cdfaabf927598f1460e23c206b0a
2020-12-26 17:16:14 -06:00
Tim Burke
918ab8543e Use socket_timeout kwarg instead of useless eventlet.wsgi.WRITE_TIMEOUT
No version of eventlet that I'm aware of hasany sort of support for
eventlet.wsgi.WRITE_TIMEOUT; I don't know why we've been setting that.
On the other hand, the socket_timeout argument for eventlet.wsgi.Server
has been supported for a while -- since 0.14 in 2013.

Drive-by: Fix up handling of sub-second client_timeouts.

Change-Id: I1dca3c3a51a83c9d5212ee5a0ad2ba1343c68cf9
Related-Change: I1d4d028ac5e864084a9b7537b140229cb235c7a3
Related-Change: I433c97df99193ec31c863038b9b6fd20bb3705b8
2020-11-11 14:23:40 -08:00
Zuul
b9a404b4d1 Merge "ec: Add an option to write fragments with legacy crc" 2020-11-02 23:03:49 +00:00
Clay Gerrard
b05ad82959 Add tasks_per_second option to expirer
This allows operators to throttle expirers as needed.

Partial-Bug: #1784753
Change-Id: If75dabb431bddd4ad6100e41395bb6c31a4ce569
2020-10-23 10:24:52 -05:00
Tim Burke
599f63e762 ec: Add an option to write fragments with legacy crc
When upgrading from liberasurecode<=1.5.0, you may want to continue
writing legacy CRCs until all nodes are upgraded and capabale of reading
fragments with zlib CRCs.

Starting in liberasurecode>=1.6.2, we can use the environment variable
LIBERASURECODE_WRITE_LEGACY_CRC to control whether we write zlib or
legacy CRCs, but for many operators it's easier to manage swift configs
than environment variables. Add a new option, write_legacy_ec_crc, to the
proxy-server app and object-reconstructor; if set to true, ensure legacy
frags are written.

Note that more daemons instantiate proxy-server apps than just the
proxy-server. The complete set of impacted daemons should be:

  * proxy-server
  * object-reconstructor
  * container-reconciler
  * any users of internal-client.conf

UpgradeImpact
=============
To ensure a smooth liberasurecode upgrade:

 1. Determine whether your cluster writes legacy or zlib CRCs. Depending
    on the order in which shared libraries are loaded, your servers may
    already be reading and writing zlib CRCs, even with old
    liberasurecode. In that case, no special action is required and
    WRITING LEGACY CRCS DURING THE UPGRADE WILL CAUSE AN OUTAGE.
    Just upgrade liberasurecode normally. See the closed bug for more
    information and a script to determine which CRC is used.
 2. On all nodes, ensure Swift is upgraded to a version that includes
    write_legacy_ec_crc support and write_legacy_ec_crc is enabled on
    all daemons.
 3. On each node, upgrade liberasurecode and restart Swift services.
    Because of (2), they will continue writing legacy CRCs which will
    still be readable by nodes that have not yet upgraded.
 4. Once all nodes are upgraded, remove the write_legacy_ec_crc option
    from all configs across all nodes. After restarting daemons, they
    will write zlib CRCs which will also be readable by all nodes.

Change-Id: Iff71069f808623453c0ff36b798559015e604c7d
Related-Bug: #1666320
Closes-Bug: #1886088
Depends-On: https://review.opendev.org/#/c/738959/
2020-09-30 16:49:59 -07:00
Tim Burke
9eb81f6e69 Allow replication servers to handle all request methods
Previously, the replication_server setting could take one of three
states:

 * If unspecified, the server would handle all available methods.
 * If "true", "yes", "on", etc. it would only handle replication
   methods (REPLICATE, SSYNC).
 * If any other value (including blank), it would only handle
   non-replication methods.

However, because SSYNC tunnels PUTs, POSTs, and DELETEs through
the same object-server app that's responding to SSYNC, setting
`replication_server = true` would break the protocol. This has
been the case ever since ssync was introduced.

Now, get rid of that second state -- operators can still set
`replication_server = false` as a principle-of-least-privilege guard
to ensure proxy-servers can't make replication requests, but replication
servers will be able to serve all traffic. This will allow replication
servers to be used as general internal-to-the-cluster endpoints, leaving
non-replication servers to handle client-driven traffic.

Closes-Bug: #1446873
Change-Id: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87
2020-07-23 09:11:07 -07:00
Clay Gerrard
4601548dab Deprecate per-service auto_create_account_prefix
If we move it to constraints it's more globally accessible in our code,
but more importantly it's more obvious to ops that everything breaks if
you try to mis-configure different values per-service.

Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee
2020-01-05 09:53:30 -06:00
Tim Burke
39a54fecdc py3: add swift-dsvm-functional-py3 job
Note that keystone wants to stick some UTF-8 encoded bytes into
memcached, but we want to store it as JSON... or something?

Also, make sure we can hit memcache for containers with invalid UTF-8.
Although maybe it'd be better to catch that before we ever try memcache?

Change-Id: I1fbe133c8ec73ef6644ecfcbb1931ddef94e0400
2019-06-21 22:31:18 -07:00
Clay Gerrard
34bd4f7fa3 Clarify usage of dequeue_from_legacy option
Change-Id: Iae9aa7a91b9afc19cb8613b5bc31de463b853dde
2019-05-05 03:20:34 +00:00
Kazuhiro MIYAHARA
443f029a58 Enable to configure object-expirer in object-server.conf
To prepare for object-expirer's general task queue feature [1],
this patch enables to configure object-expirer in object-server.conf.
Object-expirer.conf can be used in the same manner as before, but deprecated.

If both of object-server.conf with "object-expirer" section and
object-expirer.conf are in a node, only object-server.conf is used.
Object-expirer.conf is used only if all object-server.conf doesn't have
"object-expirer" section.

There are two differences between "object-expirer.conf" style and
"object-server.conf" style.

The first difference is `dequeue_from_legacy` default value.
`dequeue_from_legacy` defines task queue mode. In "object-expirer.conf"
style, the default mode is legacy queue. In "object-server.conf" style,
the default mode is general queue. But general mode means no-op mode
for now, because general task queue is not implemented yet.

The second difference is internal client config. In "object-expirer.conf"
style, config file of internal client is the object-expirer.conf itself.
In "object-server.conf" style, config file of internal client is
another file.

[1]: https://review.openstack.org/#/c/517389/

Co-Authored-By: Matthew Oliver <matt@oliver.net.au>

Change-Id: Ib21568f9b9d8547da87a99d65ae73a550e9c3230
2019-05-04 15:45:02 +00:00
Gilles Biannic
a4cc353375 Make log format for requests configurable
Add the log_msg_template option in proxy-server.conf and log_format in
a/c/o-server.conf. It is a string parsable by Python's format()
function. Some fields containing user data might be anonymized by using
log_anonymization_method and log_anonymization_salt.

Change-Id: I29e30ef45fe3f8a026e7897127ffae08a6a80cd9
2019-05-02 17:43:25 -06:00
Clay Gerrard
ea8e545a27 Rebuild frags for unmounted disks
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.

Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring.  If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.

To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve.  In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.

Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced.  After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.

Closes-Bug: #1510342

Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
2019-02-08 18:04:55 +00:00
FatemaKhalid
cfeb32c66b Adding keep_idle config value to socket
User can cofigure KEEPIDLE time for sockets in TCP connection.
The default value is the old value which is 600.

Change-Id: Ib7fb166deb8a87ae4e97ba0671048b1ec079a2ef
Closes-Bug:1759606
2018-09-15 01:30:53 +02:00
Samuel Merritt
d5c532a94e object-updater: add concurrent updates
The object updater now supports two configuration settings:
"concurrency" and "updater_workers". The latter controls how many
worker processes are spawned, while the former controls how many
concurrent container updates are performed by each worker
process. This should speed the processing of async_pendings.

There is a change to the semantics of the configuration
options. Previously, "concurrency" controlled the number of worker
processes spawned, and "updater_workers" did not exist. I switched the
meanings for consistency with other configuration options. In the
object reconstructor, object replicator, object server, object
expirer, container replicator, container server, account replicator,
account server, and account reaper, "concurrency" refers to the number
of concurrent tasks performed within one process (for reference, the
container updater and object auditor use "concurrency" to mean number
of processes).

On upgrade, a node configured with concurrency=N will still handle
async updates N-at-a-time, but will do so using only one process
instead of N.

UpgradeImpact:

If you have a config file like this:

    [object-updater]
    concurrency = <N>

and you want to take advantage of faster updates, then do this:

    [object-updater]
    concurrency = 8  # the default; you can omit this line
    updater_workers = <N>

If you want updates to be processed exactly as before, do this:

    [object-updater]
    concurrency = 1
    updater_workers = <N>

Change-Id: I17e18088e61f664e1b9942d66423666d0cae1689
2018-06-13 17:39:34 -07:00
Thiago da Silva
36dbd38e48 Add s3api headers to allowed_headers by default
Previously, these headers had to be added by operators to their
object-server.conf when enabling swift3 middleware. Since s3api
is now imported into swift we should go ahead and add these headers
by default too.

Change-Id: Ib82e175096716e42aecdab48f01f079e09da6a1d
Signed-off-by: Thiago da Silva <thiago@redhat.com>
2018-05-29 16:02:50 -04:00
Zuul
3313392462 Merge "Import swift3 into swift repo as s3api middleware" 2018-04-30 16:00:56 +00:00
Kota Tsuyuzaki
636b922f3b Import swift3 into swift repo as s3api middleware
This attempts to import openstack/swift3 package into swift upstream
repository, namespace. This is almost simple porting except following items.

1. Rename swift3 namespace to swift.common.middleware.s3api
1.1 Rename also some conflicted class names (e.g. Request/Response)

2. Port unittests to test/unit/s3api dir to be able to run on the gate.

3. Port functests to test/functional/s3api and setup in-process testing

4. Port docs to doc dir, then address the namespace change.

5. Use get_logger() instead of global logger instance

6. Avoid global conf instance

Ex. fix various minor issue on those steps (e.g. packages, dependencies,
  deprecated things)

The details and patch references in the work on feature/s3api are listed
at https://trello.com/b/ZloaZ23t/s3api (completed board)

Note that, because this is just a porting, no new feature is developed since
the last swift3 release, and in the future work, Swift upstream may continue
to work on remaining items for further improvements and the best compatibility
of Amazon S3. Please read the new docs for your deployment and keep track to
know what would be changed in the future releases.

Change-Id: Ib803ea89cfee9a53c429606149159dd136c036fd
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
2018-04-27 15:53:57 +09:00
Samuel Merritt
c28004deb0 Multiprocess object replicator
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.

At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.

Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:

  [worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)

The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.

Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.

Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
2018-04-24 04:05:08 +00:00
Samuel Merritt
f64c00b00a Improve object-updater's stats logging
The object updater has five different stats, but its logging only told
you two of them (successes and failures), and it only told you after
finishing all the async_pendings for a device. If you have a cluster
that's been sick and has millions upon millions of async_pendings
laying around, then your object-updaters are frustratingly
silent. I've seen one cluster with around 8 million async_pendings per
disk where the object-updaters only emitted stats every 12 hours.

Yes, if you have StatsD logging set up properly, you can go look at
your graphs and get real-time feedback on what it's doing. If you
don't have that, all you get is a frustrating silence.

Now, the object updater tells you all of its stats (successes,
failures, quarantines due to bad pickles, unlinks, and errors), and it
tells you incremental progress every five minutes. The logging at the
end of a pass remains and has been expanded to also include all stats.

Also included is a small change to what counts as an error: unmounted
drives no longer do. The goal is that only abnormal things count as
errors, like permission problems, malformed filenames, and so
on. These are things that should never happen, but if they do, may
require operator intervention. Drives fail, so logging an error upon
encountering an unmounted drive is not useful.

Change-Id: Idbddd507f0b633d14dffb7a9834fce93a10359ab
2018-01-17 13:59:23 -08:00
Romain LE DISEZ
e199192cae Replace replication_one_per_device by custom count
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.

Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.

Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
2017-10-24 16:17:41 +01:00
shangxiaobj
c93c0c0c6e [Trivialfix]Fix typos in swift
Fix typos that found in swift.

Change-Id: I52fad1a4882cec4456f22174b46d54e42ec66d97
2017-08-04 07:50:10 +00:00