BaseDiskFileWriter will track md5 and expose upload_size, etag via the
chunks_finished method.
The BaseDiskFileReader already tracks the md5/etag via _iter_etag, for
parity we add a _chunks_etag to the BaseDiskFileReader.
Instead of returning the upload_size and hexdigest every call to write,
we return the tuple from chunks_finished.
Change-Id: I26c58719cff5fde941d0248c250a0204e0379ae5
DiskFile already exposes a reader method that creates the DiskFileReader
instance. Add a writer method for parity.
DiskFile currently only provides a context manager create - that will
open and close the DiskFileWriter. Add explicit open and close methods
to support more flexibility in how callers manage life-cycle on their
DiskFileWriter instances.
Diskfile confusingly manages some state for DiskFileWriter (e.g. fd,
tmppath, use_linkat). Encapsulate the DiskFileWriter state to improve
readability and reduce coupling (e.g. put_succeeced).
Change-Id: If18e0041680470a9c57a08e9ea9327acba8593df
...which helps us differentiate between a drive that's not mounted vs.
not a dir better in log messages. We were already doing that a bit in
diskfile.py, and it seems like a useful distinction; let's do it more.
While we're at it, remove some log translations.
Related-Change: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Related-Change: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: Ife0d34f9482adb4524d1ab1fe6c335c6b287c2fd
Partial-Bug: 1674543
When we're pushing data to a remote node using ssync, we end up
walking the entire partition's directory tree. We were already
removing reclaimable (i.e. old) tombstones and non-durable EC data
files plus their containing hash dirs, but we were leaving the suffix
dirs around for future removal, and we weren't cleaning up partition
dirs at all. Now we remove as much of the directory structure as we
can, even up to the partition dir, as soon as we observe that it's
empty.
Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1706321
auditor.py currently relies on POLICY[0] object_audit_location_generator()
to yield an AuditLocation for all policies on the object-server.
The changes in this patch are :
- object_audit_location_generator() yields AuditLocation only for the
requested policy
- audit_all_objects() calls object_audit_location_generator() for each
policy
Closes-Bug: 1534212
Change-Id: Ida92ba0a5e1e486a4f7132c6539460b38c34ec87
Bring under test
- test/unit/cli/test_dispersion_report.py
- test/unit/cli/test_info.py and
- test/unit/cli/test_relinker.py
I've verified that swift-*-info (at least) behave reasonably under
py3, even swift-object-info when there's non-utf8 metadata on the
data/meta file.
Change-Id: Ifed4b8059337c395e56f5e9f8d939c34fe4ff8dd
There's a window between when we list the files on disk and when we actually
try to open the .data file where another process could delete it. That should
raise a DiskFileNotExist error, not an IOError.
Change-Id: I1b43fef35949cb6f71997874e4e6b7646eeec8fe
Closes-Bug: 1712662
Since Python 2.7, unittest in the standard library has included mulitple
facilities for skipping tests by decorators as well as an exception.
Switch to that directly, rather than importing nose.
Change-Id: I4009033473ea24f0d0faed3670db844f40051f30
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.
This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.
Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.
As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.
The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?
Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.
Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).
References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.
THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS
With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.
So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.
`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.
The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.
Co-Authored-By: John Dickinson <me@not.mn>
Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
We added check_drive to the account/container servers to unify how all
the storage wsgi servers treat device dirs/mounts. Thus pushes that
unification down into the consistency engine.
Drive-by:
* use FakeLogger less
* clean up some repeititon in probe utility for device re-"mounting"
Related-Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
use assertRaises where applicable; use the with_tempdir
decorator
Related-Change: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
Change-Id: Ie83584fc22a5ec6e0a568e39c90caf577da29497
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.
Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.
Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
You can't modify the X-Static-Large-Object metadata with a POST, an
object being a SLO is a property of the .data file. Revert the change
from 4500ff which attempts to correctly handle X-Static-Large-Object
metadata on a POST, but is subject to a race if the most recent SLO
.data isn't available during the POST. Instead this change adjusts the
reading of metadata such that the X-Static-Large-Object metadata is
always preserved from the metadata on the datafile and bleeds through
a .meta if any.
Closes-bug: #1453807
Closes-bug: #1634723
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Change-Id: Ie48a38442559229a2993443ab0a04dc84717ca59
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.
This behavior is consistent with the object-server's checks in diskfile.
Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
Following OpenStack Style Guidelines:
[1] http://docs.openstack.org/developer/hacking/#unit-tests-and-assertraises
[H203] Unit test assertions tend to give better messages for more specific
assertions. As a result, assertIsNone(...) is preferred over
assertEqual(None, ...) and assertIs(..., None)
Change-Id: If4db8872c4f5705c1fff017c4891626e9ce4d1e4
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.
Change-Id: I3581c635c094a8c4339e9b770331a03eab704074
Fix a situation where SSYNC would fail to replicate a valid object because
the datafile contains an expired X-Delete-At information while a metafile
contains no X-Delete-At information. Example:
- 1454619054.02968.data => contains X-Delete-At: 1454619654
- 1454619056.04876.meta => does not contain X-Delete-At info
In this situation, the replicator tries to PUT the datafile, and then
to POST the metadata. Previously, if the receiver has the datafile but
current time is greater than the X-Delete-At, then it considers it to
be expired and requests no updates from the sender, so the metafile is
never synced. If the receiver does not have the datafile then it does
request updates from the sender, but the ssync PUT subrequest is
refused if the current time is greater than the X-Delete-At (the
object is expired). If the datafile is transfered, the ssync POST
subrequest fails because the object does not exist (expired).
This commit allows PUT and POST to works so that the object can be
replicated, by enabling the receiver object server to open expired
diskfiles when handling replication requests.
Closes-Bug: #1683689
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I919994ead2b20dbb6c5671c208823e8b7f513715
EC object metadata can currently have a mixture of bytestrings and
unicode. The ssync_sender.send_put() method raises an
UnicodeDecodeError when it attempts to concatenate the metadata
values, if any bytestring has non-ascii characters.
The root cause of this issue is that the object server uses unicode
for the keys of some object metadata items that are received in the
footer of an EC PUT request, whereas all other object metadata keys
and values are persisted as bytestrings.
This patch fixes the bug by changing diskfile write_metadata()
function to encode all unicode metadata keys and values as utf8
encoded bytes before writing to disk. To cope with existing objects
that have a mixture of unicode and bytestring metadata, the diskfile
read_metadata() function is also changed so that all returned unicode
metadata keys and values are utf8 encoded. This ensures that
ssync_sender.send_put() (and any other caller of diskfile
read_metadata) only reads bytestrings from object metadata.
Closes-Bug: #1678018
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Ic23c55754ee142f6f5388dcda592a3afc9845c39
IMHO we shouldn't ever trust the invalidations file so much we try to
skip a listdir when creating a hashes.pkl for the first time. There may
be some subtle races looking back on the related patch, and it's related
patches.
This just makes some assertions to help demonstrate we should maintain
the invariant of setting hashes to valid via listdir.
Change-Id: I767e34a405de7911e9596e038e58a9a29f57a8f8
Related-Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.
This commit changes DiskFileManager _get_hashes() arguments from:
- partition_path, recalculate=None, do_listdir=False
to :
- device, partition, policy, recalculate=None, do_listdir=False
Callers are modified accordingly, in diskfile.py, reconstructor.py,
and replicator.py
Change-Id: I8e2d7075572e466ae2fa5ebef5e31d87eed90fec
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.
This commit changes DiskFileManager yield_hashes() returned values
from:
- object_path, object_hash, timestamps
to:
- object_hash, timestamps
object_path was not used by any caller.
Change-Id: I914fb1ec8ce7c9c26d22e1d07f03bd03f4504176
DiskFileMixin and DiskFileManagerMixin has almost
identical setUp() and tearDown() methods, and both
inherit BaseDiskFileTestMixin, so this moves the common
code into the abstract superclass.
Also remove repeated declaration of vars in
test_diskfile.py:run_quarantine_invalids
and a duplicated qualified import in obj/test_server.py
Change-Id: Id99ba151c7802c3b61e483a7e766bf6f2b2fe3df
Suffix hash invalidations in hashes.invalid can be lost when two
concurrent calls to get_hashes consolidate the hashes of a new
partition with no hashes.pkl:
- suffix S has been invalidated and is listed in hashes.invalid
- process X calls get_hashes when there is no existing hashes.pkl
- process X removes hashes.invalids file in consolidate_hashes
- process X calculates the hash of suffix S (note, process X has
not yet written hashes.pkl)
- process Y invalidates suffix S, appends S to hashes.invalid, so the
hash of suffix S *should* be recalculated at some point
- process Z calls get_hashes->consolidate_hashes, deletes hashes.invalid
because there is still no hashes.pkl
- process Z fails
- process X writes hashes.pkl with stale hash value for suffix S
- the invalidation of S that was made by process Y is lost
The solution is to never remove hashes.invalid during consolidate_hashes
without first recording any invalid suffixes in hashes and writing hashes
to disk in hashes.pkl. This is already the behaviour when hashes.pkl
exists. The cost of an additional write to hashes.pkl, introduced by this
patch, is only incurred once, when get_hashes first encounters a
partition with no hashes.pkl.
Related-Change: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3
Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c
...and refactor two extremely similar tests to use a single
helper method - the only paramerization being the existence
or not of hashes.pkl at start of the test.
Change-Id: I601218a9a031e7fc77bc53ea735e89700ec1647d
Related-Change: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3
mtime and force_rewrite have a *long* tangled history starting back in
lp bug #1089140 that's been carried through many refactors.
Using force_rewrite on errors reading from the pickle has always been a
read-modify-write race; but maybe less bad than the infinite recursion
bug it fixed?
Using getmtime has always had somewhat dubious resolution for race
detection - the only way to be sure the content of the file is the same
as when we read it without locking is to open the file up and check.
Unfortunately, the ondisk data wasn't rich enough to disambiguate when
the ondisk state represented may have changed (e.g. when an invalidation
for a suffix currently being hashed is consolidated, or if all hashes
are invalid like after an error reading the hashes.pkl) - so we also add
a key with a timestamp for race detection and write down if the
dictionary has any valid suffix hashes.
Along the way, we accidentally fix a serious performance regression with
hash invalidations...
We currently rehash all invalid suffixes twice on REPLICATE calls.
First we consolidating hashes, marking all invalid suffixes as None
and then perform the first suffix rehashing.
And then also *every time* one more time again immediately as soon as we
get done with the first one we throw all that work we just did on the
floor and rehash ALL the invalid suffixes *again* a second time because
the race detector erroneously notices the hashes.pkl file has been
"modified while we were hashing".
But we're not in a race. We took the mtime before calling consolidate
hashes, and consolidate hashes modified the pickle when it wrote back the
invalid suffixes.
FWIW, since consolidate hashes operates under directory lock it can't
race - but we don't want suffix rehashing to hold the directory lock
that long so we use optimistic locking - i.e. we optimistically perform
the rehash w/o a lock and write back the update iif it hasn't changed
since read; if it has we retry the whole operation
UpgradeImpact:
If you upgrade and need to rollback - delete all hashes.pkl:
rm /srv/node*/*/object*/*/hashes.pkl
Anything on significance achived here was blatently plagerised from the
work of others:
Co-Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Related-Change-Id: I8f6bb89beaaca3beec2e6063299189f52e9eee51
Related-Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c
Change-Id: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3
The assumption that we don't need to write an entry in the invalidations
file when the hashes.pkl does not exist turned out to be a premature
optimization and also wrong.
Primarily we should recognize the creation of hashes.pkl is the first
thing that happens in a part when it lands on a new primary. The code
should be optimized toward the assumption of the most common disk state.
Also, in this case the extra stat calls to check if the hashes.pkl exists
were not only un-optimized - but introducing a race.
Consider the common case:
proc 1 | proc 2
-------------------------------|---------------------------
a) read then truncate journal |
b) do work | c) append to journal
d) apply "a" to index |
The index written at "d" may not (yet) reflect the entry writen by proc
2 at "c"; however, it's clearly in the journal so it's easy to see we're
safe.
Adding in the extra stat call for the index existence check increases
the state which can effect correctness.
proc 1 | proc 2
------------------------------|---------------------------
a) no index, truncate journal |
b) do work | b) iff index exists
| c) append to journal
d) apply (or create) index |
If step "c" doesn't happen because the index does not yet exist - the
update is clearly lost.
In our case we'd skip marking a suffix as invalid when the hashes.pkl
does not exist because we know "the next time we rehash" we'll have to
os.listdir suffixes anyway. But if another process is *currently*
rehashing (and has already done it's os.listdir) instead we've just
dropped an invalidation on the floor.
Don't do that.
Instead - write down the invalidation. The running rehash is welcome to
proceed on outdated information - as long as the next pass will grab and
hash the new suffix.
Known-Issue(s):
If the suffix already exists there's an even chance the running rehash
will hash in the very update for which we want to invalidate the suffix,
but that's ok it's idempotent.
Co-Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Co-Author: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Change-Id: I2b48238d9d684e831d9777a7b18f91a3cef57cd1
Closes-Bug: #1651530
Trivial follow up to related change to verify logger warning
with both starting conditions.
Related-Change: If712e4431322df5c3e84808ab2d815fd06c76426
Change-Id: Id8e87a1d27d27d8a5e77172396105946a6804336
REPLICATE calls where everything is in sync and no suffixes have been
invalidated are supposed to be pretty common and fairly cheap. If an
invalidations files is empty there's no need to perform a truncation
write operation which will presumably at some point have to flush.
Everyone owes Pavel a quarter for the one billion less write ops in prod
... and Matt a nickel for 20ms of not sleep back every unittest run
As a drive by I remove a crufty confusing OSError exception handler
around an open call that would be using O_CREAT in a directory that it
either just created or opened a file from - it wasn't going to raise
ENOENT. Similarly rather than loose sleep trying to reason about all
the crazy exceptions that actually *could* pop anywhere in this method,
instead I improve the logging where any such exception would be caught.
This way we can get the details we need to focus on only errors that
actually happen.
Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Co-Author: Matthew Oliver <matt@oliver.net.au>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Change-Id: If712e4431322df5c3e84808ab2d815fd06c76426
An existing test in diskfile established an prior art for a pattern to
create a diskfile with a different suffix - I'd like to make use of it
in new tests in multiple unrelated change sets.
Also add a test to demonstrate some existing robustness and prevent
regression.
Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Change-Id: I3a661fae5c7cfeb2dbcdb7f46941f55244d0b9ad
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.
I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age. This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.
There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.
UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup. All object
services must use the same reclaim_age. If you require a non-default
reclaim age it should be set in the [DEFAULT] section. If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.
If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade. If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.
Closes-Bug: #1626296
Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
This is just tests to demonstrate the current behavior - and provide
context of behavioral differences/consistency in the related change.
See the related change for context.
The related change slightly modifies the outcome of the
reclaim/REPLICATE test.
The related change has no effect on the config handling.
The related change greatly simplfies the reclaim_age option
cleanup_ondisk_files plumbing tests.
Related-Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I5b5f90bb898a335e6336f043710a05a44e3b810f