164 Commits

Author SHA1 Message Date
Jenkins
83b62b4f39 Merge "Add Timestamp.now() helper" 2017-07-18 03:27:50 +00:00
Jenkins
e94b383655 Merge "Add support to increase object ring partition power" 2017-07-05 14:40:42 +00:00
Christian Schwede
e1140666d6 Add support to increase object ring partition power
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.

1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.

2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.

3. The "next_part_power" flag will be removed.

This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree.  This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.

Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.

Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).

Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>

[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html

Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2017-06-15 15:08:48 -07:00
lingyongxu
ee9458a250 Using assertIsNone() instead of assertEqual(None)
Following OpenStack Style Guidelines:
[1] http://docs.openstack.org/developer/hacking/#unit-tests-and-assertraises
[H203] Unit test assertions tend to give better messages for more specific
assertions. As a result, assertIsNone(...) is preferred over
assertEqual(None, ...) and assertIs(..., None)

Change-Id: If4db8872c4f5705c1fff017c4891626e9ce4d1e4
2017-06-07 14:05:53 +08:00
Jenkins
45b17d89c7 Merge "Fix SSYNC failing to replicate unexpired object" 2017-05-31 22:49:55 +00:00
Tim Burke
85d6cd30be Add Timestamp.now() helper
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.

Change-Id: I3581c635c094a8c4339e9b770331a03eab704074
2017-04-27 14:19:00 -07:00
Romain LE DISEZ
38d35797df Fix SSYNC failing to replicate unexpired object
Fix a situation where SSYNC would fail to replicate a valid object because
the datafile contains an expired X-Delete-At information while a metafile
contains no X-Delete-At information. Example:
 - 1454619054.02968.data => contains X-Delete-At: 1454619654
 - 1454619056.04876.meta => does not contain X-Delete-At info

In this situation, the replicator tries to PUT the datafile, and then
to POST the metadata. Previously, if the receiver has the datafile but
current time is greater than the X-Delete-At, then it considers it to
be expired and requests no updates from the sender, so the metafile is
never synced. If the receiver does not have the datafile then it does
request updates from the sender, but the ssync PUT subrequest is
refused if the current time is greater than the X-Delete-At (the
object is expired). If the datafile is transfered, the ssync POST
subrequest fails because the object does not exist (expired).

This commit allows PUT and POST to works so that the object can be
replicated, by enabling the receiver object server to open expired
diskfiles when handling replication requests.

Closes-Bug: #1683689
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I919994ead2b20dbb6c5671c208823e8b7f513715
2017-04-26 11:29:40 +01:00
Jenkins
cce5482bd8 Merge "Fix encoding issue in ssync_sender.send_put()" 2017-04-19 19:37:38 +00:00
Jenkins
88bca22549 Merge "Follow up tests for get_hashes regression" 2017-04-19 17:32:54 +00:00
Romain LE DISEZ
091157fc7f Fix encoding issue in ssync_sender.send_put()
EC object metadata can currently have a mixture of bytestrings and
unicode.  The ssync_sender.send_put() method raises an
UnicodeDecodeError when it attempts to concatenate the metadata
values, if any bytestring has non-ascii characters.

The root cause of this issue is that the object server uses unicode
for the keys of some object metadata items that are received in the
footer of an EC PUT request, whereas all other object metadata keys
and values are persisted as bytestrings.

This patch fixes the bug by changing diskfile write_metadata()
function to encode all unicode metadata keys and values as utf8
encoded bytes before writing to disk. To cope with existing objects
that have a mixture of unicode and bytestring metadata, the diskfile
read_metadata() function is also changed so that all returned unicode
metadata keys and values are utf8 encoded. This ensures that
ssync_sender.send_put() (and any other caller of diskfile
read_metadata) only reads bytestrings from object metadata.

Closes-Bug: #1678018
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Ic23c55754ee142f6f5388dcda592a3afc9845c39
2017-04-19 18:05:52 +01:00
Clay Gerrard
b41f47f0e0 Follow up tests for get_hashes regression
IMHO we shouldn't ever trust the invalidations file so much we try to
skip a listdir when creating a hashes.pkl for the first time.  There may
be some subtle races looking back on the related patch, and it's related
patches.

This just makes some assertions to help demonstrate we should maintain
the invariant of setting hashes to valid via listdir.

Change-Id: I767e34a405de7911e9596e038e58a9a29f57a8f8
Related-Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c
2017-04-19 12:03:15 +01:00
Jenkins
a22208043f Merge "Modify _get_hashes() arguments to be more generic" 2017-04-10 22:50:11 +00:00
Jenkins
b3e69acb43 Merge "Fix race when consolidating new partition" 2017-04-08 00:55:23 +00:00
Alexandre Lécuyer
95905b0174 Modify _get_hashes() arguments to be more generic
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.

This commit changes DiskFileManager _get_hashes() arguments from:
  - partition_path, recalculate=None, do_listdir=False
to :
  - device, partition, policy, recalculate=None, do_listdir=False

Callers are modified accordingly, in diskfile.py, reconstructor.py,
and replicator.py

Change-Id: I8e2d7075572e466ae2fa5ebef5e31d87eed90fec
2017-03-29 14:57:40 +02:00
Alexandre Lécuyer
cff7455a68 Remove unused returned value object_path from yield_hashes()
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.

This commit changes DiskFileManager yield_hashes() returned values
from:
        - object_path, object_hash, timestamps
to:
        - object_hash, timestamps

object_path was not used by any caller.

Change-Id: I914fb1ec8ce7c9c26d22e1d07f03bd03f4504176
2017-03-27 17:18:28 +02:00
Alistair Coles
3b83bd42a6 Remove duplicate code in test_diskfile.py
DiskFileMixin and DiskFileManagerMixin has almost
identical setUp() and tearDown() methods, and both
inherit BaseDiskFileTestMixin, so this moves the common
code into the abstract superclass.

Also remove repeated declaration of vars in
test_diskfile.py:run_quarantine_invalids
and a duplicated qualified import in obj/test_server.py

Change-Id: Id99ba151c7802c3b61e483a7e766bf6f2b2fe3df
2017-03-21 11:37:33 +00:00
Alistair Coles
52a23ddb3c Fix race when consolidating new partition
Suffix hash invalidations in hashes.invalid can be lost when two
concurrent calls to get_hashes consolidate the hashes of a new
partition with no hashes.pkl:

- suffix S has been invalidated and is listed in hashes.invalid
- process X calls get_hashes when there is no existing hashes.pkl
- process X removes hashes.invalids file in consolidate_hashes
- process X calculates the hash of suffix S (note, process X has
  not yet written hashes.pkl)
- process Y invalidates suffix S, appends S to hashes.invalid, so the
  hash of suffix S *should* be recalculated at some point
- process Z calls get_hashes->consolidate_hashes, deletes hashes.invalid
  because there is still no hashes.pkl
- process Z fails
- process X writes hashes.pkl with stale hash value for suffix S
- the invalidation of S that was made by process Y is lost

The solution is to never remove hashes.invalid during consolidate_hashes
without first recording any invalid suffixes in hashes and writing hashes
to disk in hashes.pkl. This is already the behaviour when hashes.pkl
exists. The cost of an additional write to hashes.pkl, introduced by this
patch, is only incurred once, when get_hashes first encounters a
partition with no hashes.pkl.

Related-Change: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3

Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c
2017-03-20 12:49:57 +00:00
Alistair Coles
6249945a4f Fix misleading hash invalidations test comments
...and refactor two extremely similar tests to use a single
helper method - the only paramerization being the existence
or not of hashes.pkl at start of the test.

Change-Id: I601218a9a031e7fc77bc53ea735e89700ec1647d
Related-Change: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3
2017-02-09 09:49:32 +00:00
Clay Gerrard
aa71d7e77b Better optimistic lock in get_hashes
mtime and force_rewrite have a *long* tangled history starting back in
lp bug #1089140 that's been carried through many refactors.

Using force_rewrite on errors reading from the pickle has always been a
read-modify-write race; but maybe less bad than the infinite recursion
bug it fixed?

Using getmtime has always had somewhat dubious resolution for race
detection - the only way to be sure the content of the file is the same
as when we read it without locking is to open the file up and check.

Unfortunately, the ondisk data wasn't rich enough to disambiguate when
the ondisk state represented may have changed (e.g. when an invalidation
for a suffix currently being hashed is consolidated, or if all hashes
are invalid like after an error reading the hashes.pkl) - so we also add
a key with a timestamp for race detection and write down if the
dictionary has any valid suffix hashes.

Along the way, we accidentally fix a serious performance regression with
hash invalidations...

We currently rehash all invalid suffixes twice on REPLICATE calls.

First we consolidating hashes, marking all invalid suffixes as None
and then perform the first suffix rehashing.

And then also *every time* one more time again immediately as soon as we
get done with the first one we throw all that work we just did on the
floor and rehash ALL the invalid suffixes *again* a second time because
the race detector erroneously notices the hashes.pkl file has been
"modified while we were hashing".

But we're not in a race.  We took the mtime before calling consolidate
hashes, and consolidate hashes modified the pickle when it wrote back the
invalid suffixes.

FWIW, since consolidate hashes operates under directory lock it can't
race - but we don't want suffix rehashing to hold the directory lock
that long so we use optimistic locking - i.e. we optimistically perform
the rehash w/o a lock and write back the update iif it hasn't changed
since read; if it has we retry the whole operation

UpgradeImpact:

If you upgrade and need to rollback - delete all hashes.pkl:

    rm /srv/node*/*/object*/*/hashes.pkl

Anything on significance achived here was blatently plagerised from the
work of others:

Co-Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Related-Change-Id: I8f6bb89beaaca3beec2e6063299189f52e9eee51
Related-Change-Id: I08c8cf09282f737103e580c1f57923b399abe58c

Change-Id: Ia43ec2cf7ab715ec37f0044625a10aeb6420f6e3
2017-01-31 22:14:28 +00:00
Clay Gerrard
442cc1d16d Fix race in new partitions detecting new/invalid suffixes.
The assumption that we don't need to write an entry in the invalidations
file when the hashes.pkl does not exist turned out to be a premature
optimization and also wrong.

Primarily we should recognize the creation of hashes.pkl is the first
thing that happens in a part when it lands on a new primary.  The code
should be optimized toward the assumption of the most common disk state.

Also, in this case the extra stat calls to check if the hashes.pkl exists
were not only un-optimized - but introducing a race.

Consider the common case:

proc 1                         | proc 2
-------------------------------|---------------------------
a) read then truncate journal  |
b) do work                     | c) append to journal
d) apply "a" to index          |

The index written at "d" may not (yet) reflect the entry writen by proc
2 at "c"; however, it's clearly in the journal so it's easy to see we're
safe.

Adding in the extra stat call for the index existence check increases
the state which can effect correctness.

proc 1                        | proc 2
------------------------------|---------------------------
a) no index, truncate journal |
b) do work                    | b) iff index exists
                              | c) append to journal
d) apply (or create) index    |

If step "c" doesn't happen because the index does not yet exist - the
update is clearly lost.

In our case we'd skip marking a suffix as invalid when the hashes.pkl
does not exist because we know "the next time we rehash" we'll have to
os.listdir suffixes anyway.  But if another process is *currently*
rehashing (and has already done it's os.listdir) instead we've just
dropped an invalidation on the floor.

Don't do that.

Instead - write down the invalidation.  The running rehash is welcome to
proceed on outdated information - as long as the next pass will grab and
hash the new suffix.

Known-Issue(s):

If the suffix already exists there's an even chance the running rehash
will hash in the very update for which we want to invalidate the suffix,
but that's ok it's idempotent.

Co-Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Co-Author: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Change-Id: I2b48238d9d684e831d9777a7b18f91a3cef57cd1
Closes-Bug: #1651530
2017-01-23 16:09:43 +00:00
Jenkins
ffd099c26a Merge "Simplify get_different_suffix_df args" 2017-01-23 15:21:41 +00:00
Jenkins
4970277232 Merge "Optimize noop case for suffix rehash" 2017-01-21 01:28:00 +00:00
Jenkins
a1c88b906b Merge "Extract test pattern to helper" 2017-01-21 01:26:43 +00:00
Tim Burke
c33f2b49dd Simplify get_different_suffix_df args
Change-Id: I18a775c5b61c43c112f3658f9c27c7d4149ebbef
Related-Change: I3a661fae5c7cfeb2dbcdb7f46941f55244d0b9ad
2017-01-20 19:41:15 +00:00
Jenkins
f4f48f3501 Merge "Move documented reclaim_age option to correct location" 2017-01-16 16:35:52 +00:00
Jenkins
69b93dd011 Merge "Test current reclaim_age handling" 2017-01-14 00:58:12 +00:00
Clay Gerrard
95c7c0109b Optimize noop case for suffix rehash
REPLICATE calls where everything is in sync and no suffixes have been
invalidated are supposed to be pretty common and fairly cheap.  If an
invalidations files is empty there's no need to perform a truncation
write operation which will presumably at some point have to flush.

Everyone owes Pavel a quarter for the one billion less write ops in prod

... and Matt a nickel for 20ms of not sleep back every unittest run

As a drive by I remove a crufty confusing OSError exception handler
around an open call that would be using O_CREAT in a directory that it
either just created or opened a file from - it wasn't going to raise
ENOENT.  Similarly rather than loose sleep trying to reason about all
the crazy exceptions that actually *could* pop anywhere in this method,
instead I improve the logging where any such exception would be caught.
This way we can get the details we need to focus on only errors that
actually happen.

Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Co-Author: Matthew Oliver <matt@oliver.net.au>

Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0

Change-Id: If712e4431322df5c3e84808ab2d815fd06c76426
2017-01-13 09:53:22 +00:00
Clay Gerrard
e772cf95c6 Extract test pattern to helper
An existing test in diskfile established an prior art for a pattern to
create a diskfile with a different suffix - I'd like to make use of it
in new tests in multiple unrelated change sets.

Also add a test to demonstrate some existing robustness and prevent
regression.

Author: Pavel Kvasnička <pavel.kvasnicka@firma.seznam.cz>
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Related-Change-Id: I64cadb1a3feb4d819d545137eecfc295389794f0
Change-Id: I3a661fae5c7cfeb2dbcdb7f46941f55244d0b9ad
2017-01-13 01:52:25 -08:00
Mahati Chamarthy
69f7be99a6 Move documented reclaim_age option to correct location
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.

I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age.  This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.

There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.

UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup.  All object
services must use the same reclaim_age.  If you require a non-default
reclaim age it should be set in the [DEFAULT] section.  If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.

If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade.  If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.

Closes-Bug: #1626296

Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
2017-01-13 03:10:47 +00:00
Alistair Coles
1a8085fc41 Test current reclaim_age handling
This is just tests to demonstrate the current behavior - and provide
context of behavioral differences/consistency in the related change.

See the related change for context.

The related change slightly modifies the outcome of the
reclaim/REPLICATE test.

The related change has no effect on the config handling.

The related change greatly simplfies the reclaim_age option
cleanup_ondisk_files plumbing tests.

Related-Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Author: Clay Gerrard <clay.gerrard@gmail.com>

Change-Id: I5b5f90bb898a335e6336f043710a05a44e3b810f
2017-01-12 19:08:50 -08:00
Cao Xuan Hoang
a67bb2e249 Removes unnecessary utf-8 encoding
The following file(s) added utf-8 encoding but never used. So we can
remove them at all.

test/functional/test_access_control.py
test/unit/common/middleware/crypto/test_keymaster.py
test/unit/obj/test_diskfile.py

Change-Id: I00adc968872ebe9f9c0619a4e393e048c7c1a91e
2016-12-22 10:49:56 +07:00
Cao Xuan Hoang
d4da920d9b Use assertGreater(len(x), 0) instead of assertTrue(len(x) > 0)
assertGreater provides a nicer error message if it fails.

Change-Id: I5b045042b5991280a5b6a12ccde09fa733a19e26
2016-12-08 15:45:24 +07:00
Pavel Kvasnička
8ac432fff3 Fixed regression in consolidate_hashes
Occurs when a new file is stored to new suffix to not empty partition.
Then suffix is added to an invalidations file but not into hashes
pickle file. When a replication of this partition runs, replication of
suffix is completed on first and each 10th run of replicator. Rsync
runs on each new suffix because destination does not return hash of
new suffix although suffix content is in the same state.
This bug was introduced in 2.7.0

Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Change-Id: Ie2700f6e6171f2ecfa7d07b0f18b79e90cbf1c8a
Closes-Bug: #1634967
2016-11-25 11:40:48 +00:00
Jenkins
2b116cffd3 Merge "Include debug message for rsync tempfiles" 2016-11-11 21:20:59 +00:00
Christian Hugo
b69caeae7d Include debug message for rsync tempfiles
The warning message for rsync tempfiles was removed in the related
change.  However, because our regex match might result in a false
positive maybe it's still useful to log a debug message.  Instead of
silently ignoring rsync tempfiles, when running in debug we note the
file and how we classified it - but still no warning will occur.

I also consolidate our use of the regex for rsync tempfiles into the
diskfile module and move the negative test for the warning logger
message a little next to the positive test.

Change-Id: Idb2a1a76aa275c9c2e9bad8aceea913b8f5b1c71
Related-Change: I5a5d6e24710e4880776b32edcbc07021acf77676
2016-11-11 11:47:19 +00:00
Alistair Coles
2a75091c58 Make ECDiskFileReader check fragment metadata
This patch makes the ECDiskFileReader check the validity of EC
fragment metadata as it reads chunks from disk and quarantine a
diskfile with bad metadata. This in turn means that both the object
auditor and a proxy GET request will cause bad EC fragments to be
quarantined.

This change is motivated by bug 1631144 which may result in corrupt EC
fragments being written to disk but appear valid to the object auditor
md5 hash and content-length checks.

NotImplemented:

 * perform metadata check when a read starts on any frag_size
   boundary, not just at zero

Related-Bug: #1631144
Closes-Bug: #1633647

Change-Id: Ifa6a7f8aaca94c7d39f4aeb9d4fa3f59c4f6ee13
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
2016-11-01 13:11:02 -07:00
Christian Hugo
2bd8d050fb Suppress unexpected file warnings for rsync temp files
Do not log unexpected file warning for rsync temp files when the parsing
of the timespamp fails.  If the file passes a regex test suppress the
logger waring, but still return it as an unexpected file from
get_ondisk_files.

Closes-Bug: 1616504

Change-Id: I5a5d6e24710e4880776b32edcbc07021acf77676
2016-10-31 12:34:41 -07:00
Jenkins
20c143e7a3 Merge "Throttle update_auditor_status calls" 2016-10-13 04:11:05 +00:00
Christian Schwede
77f5b20124 Throttle update_auditor_status calls
If there are quite a few nearly empty partitions per disk you might see
some write load even if your cluster is unused. The auditor will update
the status file after every partition, and this might happen multiple
times within a second if there is not much data stored yet.

This patch throttles updates, and will only write out an updated status
if the file was last updated more than a minute ago.

Closes-Bug: 1631352
Change-Id: Ib61ec9cd945e6b2d28756f6ca47801674a7e6060
2016-10-12 07:13:43 +00:00
Alistair Coles
b13b49a27c EC - eliminate .durable files
Instead of using a separate .durable file to indicate
the durable status of a .data file, rename the .data
to include a durable marker in the filename. This saves
one inode for every EC fragment archive.

An EC policy PUT will, as before, first rename a temp
file to:

   <timestamp>#<frag_index>.data

but now, when the object is committed, that file will be
renamed:

   <timestamp>#<frag_index>#d.data

with the '#d' suffix marking the data file as durable.

Diskfile suffix hashing returns the same result when the
new durable-data filename or the legacy durable file is
found in an object directory. A fragment archive that has
been created on an upgraded object server will therefore
appear to be in the same state, as far as the consistency
engine is concerned, as the same fragment archive created
on an older object server.

Since legacy .durable files will still exist in deployed
clusters, many of the unit tests scenarios have been
duplicated for both new durable-data filenames and legacy
durable files.

Change-Id: I6f1f62d47be0b0ac7919888c77480a636f11f607
2016-10-10 18:11:02 +01:00
Jenkins
af98608c14 Merge "Delete old tombstones" 2016-09-22 21:29:10 +00:00
Mahati Chamarthy
81d4673674 Delete old tombstones
- Call invalidate_hash in auditor for reclaimable tombstones
- assert changed auditor behavior with a unit test
- driveby test: assert get_hashes behavior with a unit test

Co-Authored-By: Pete Zaitcev <zaitcev@redhat.com>
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Closes-Bug: #1301728
Change-Id: I3e99dc702d55a7424c6482969e03cb4afac854a4
2016-09-21 14:09:53 -07:00
Alistair Coles
44a861787a Enable object server to return non-durable data
This patch improves EC GET response handling:

- The proxy no longer requires all object servers to have a
  durable file for the fragment archive that they return in
  response to a GET. The proxy will now be satisfied if just
  one object server has a durable file at the same timestamp
  as fragments from other object servers.

  This means that the proxy can now successfully GET an
  object that had missing durable files when it was PUT.

- The proxy will now ensure that it has a quorum of *unique*
  fragment indexes from object servers before considering a
  GET to be successful.

- The proxy is now able to fetch multiple fragment archives
  having different indexes from the same node. This enables
  the proxy to successfully GET an object that has some
  fragments that have landed on the same node, for example
  after a rebalance.

This new behavior is facilitated by an exchange of new
headers on a GET request and response between the proxy and
object servers.

An object server now includes with a GET (or HEAD) response:

- X-Backend-Fragments: the value of this describes all
  fragment archive indexes that the server has for the
  object by encoding a map of the form: timestamp -> <list
  of fragment indexes>

- X-Backend-Durable-Timestamp: the value of this is the
  internal form of the timestamp of the newest durable file
  that was found, if any.

- X-Backend-Data-Timestamp: the value of this is the
  internal form of the timestamp of the data file that was
  used to construct the diskfile.

A proxy server now includes with a GET request:

- X-Backend-Fragment-Preferences: the value of this
  describes the proxy's current preference with respect to
  those fragments that it would have object servers
  return. It encodes a list of timestamp, and for each
  timestamp a list of fragment indexes that the proxy does
  NOT require (because it already has them).

  The presence of a X-Backend-Fragment-Preferences header
  (even one with an empty list as its value) will cause the
  object server to search for the most appropriate fragment
  to return, disregarding the existence or not of any
  durable file. The object server assumes that the proxy
  knows best.

Closes-Bug: 1469094
Closes-Bug: 1484598

Change-Id: I2310981fd1c4622ff5d1a739cbcc59637ffe3fc3
Co-Authored-By: Paul Luse <paul.e.luse@intel.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
2016-09-16 11:40:14 +01:00
Prashanth Pai
773edb4a5d Make object creation more atomic in Linux
Linux 3.11 introduced O_TMPFILE as a flag to open() sys call. This would
enable users to get a fd to an unnamed temporary file. As it's unnamed,
it does not require the caller to devise unique names. It is also not
accessible through any path. Hence, file creation is race-free.

This file is initially unreachable. It is then populated with data(write),
metadata(fsetxattr) and fsync'd before being atomically linked into the
filesystem in a fully formed state using linkat() sys call. Only after a
successful linkat() will the object file will be available for reference.

Caveats
* Unlike os.rename(), linkat() cannot overwrite destination path if it
  already exists. If path exists, we unlink and try again.
* XFS support for O_TMPFILE was only added in Linux 3.15.
* If client disconnects during object upload, although there is no
  incomplete/stale file on disk, the object directory would persist
  and is not cleaned up immediately.

Change-Id: I8402439fab3aba5d7af449b5e465f89332f606ec
Signed-off-by: Prashanth Pai <ppai@redhat.com>
2016-08-24 14:56:00 +05:30
Jenkins
79be80f126 Merge "pickle_async_update should create tmp_dir" 2016-07-05 18:40:08 +00:00
Alistair Coles
3ad003cf51 Enable middleware to set metadata on object POST
Adds a new form of system metadata for objects.

Sysmeta cannot be updated by an object POST because
that would cause all existing sysmeta to be deleted.
Crypto middleware will want to add 'system' metadata
to object metadata on PUTs and POSTs, but it is ok
for this metadata to be replaced en-masse on every
POST.

This patch introduces x-object-transient-sysmeta-*
that is persisted by object servers and returned
in GET and HEAD responses, just like user metadata,
without polluting the x-object-meta-* namespace.
All headers in this namespace will be filtered
inbound and outbound by the gatekeeper, so cannot
be set or read by clients.

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Janie Richling <jrichli@us.ibm.com>

Change-Id: I5075493329935ba6790543fc82ea6e039704811d
2016-06-28 11:00:33 +01:00
Or Ozeri
da4a59f8e2 pickle_async_update should create tmp_dir
While creating a probe test for the expirer daemon, I found
the following error scenario:

1. Introduce a new object server. Initially it doesn't have a tmp_dir.
2. Have the object-replicator replicate some objects, one of them
    with an expiration (X-Delete-At).
3. Send a DELETE request for the expired object.

While beginning to process the DELETE request, the fresh
object server still doesn't have a tmp_dir created.
Since the object has an old expiration value, the object server
will first call "delete_at_update", before creating a tombstone.
delete_at_update then must create an async_pending,
which will lead to an IO error, since tmp_dir doesn't exist.

As said, I have witnessed this in practice in the probe test I wrote
at https://review.openstack.org/#/c/326903/.

This patch changes pickle_async_update behavior to create
tmp_dir, in case it doesn't exist.

Change-Id: I88b0e5f75a2a28d6880694ff327ac2763c816d24
2016-06-16 11:14:14 +03:00
Jenkins
ffef6105cd Merge "Rename hash_cleanup_listdir tests" 2016-05-11 21:20:28 +00:00
Jenkins
f66898ae00 Merge "Remove ThreadPool class" 2016-05-11 01:37:40 +00:00
Alistair Coles
ba1a568f81 Rename hash_cleanup_listdir tests
hash_cleanup_listdir was removed in [1], this patch
renames all references to it in test_diskfile to refer to
the cleanup_ondisk_files method that is now tested directly.

Also remove the final references to the now non-existent
function in a few comments.

[1] I0b96dfde32b4c666eebda6e88228516dd693ef92

Change-Id: I1e151799fc2774de9a1af092afff875af24a630c
Related-Bug: #1550569
2016-05-09 11:10:21 +01:00