15 Commits

Author SHA1 Message Date
Samuel Merritt
728b4ba140 Add checksum to object extended attributes
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.

This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.

Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.

As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.

The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?

Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.

Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).

References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.

THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS

With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.

So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.

`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.

The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.

Co-Authored-By: John Dickinson <me@not.mn>

Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
2017-11-03 13:30:05 -04:00
Romain LE DISEZ
e199192cae Replace replication_one_per_device by custom count
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.

Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.

Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
2017-10-24 16:17:41 +01:00
Romain LE DISEZ
69df458254 Allow to rebuild a fragment of an expired object
When a fragment of an expired object was missing, the reconstructor
ssync job would send a DELETE sub-request. This leads to situation
where, for the same object and timestamp, some nodes have a data file,
while others can have a tombstone file.

This patch forces the reconstructor to reconstruct a data file, even
for expired objects. DELETE requests are only sent for tombstoned
objects.

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Closes-Bug: #1652323
Change-Id: I7f90b732c3268cb852b64f17555c631d668044a8
2017-08-04 23:05:08 +02:00
Jenkins
45b17d89c7 Merge "Fix SSYNC failing to replicate unexpired object" 2017-05-31 22:49:55 +00:00
Pete Zaitcev
5dfc3a75fb Open-code eventlet.listen()
Recently out gate started blowing up intermittently with a strange
case of ports mixed up. Sometimes a functional tests tries to
authorize on a port that's clearly an object server port, and
the like. As it turns out, eventlet developers added an unavoidable
SO_REUSEPORT into listen(), which makes listen(("localhost",0)
to reuse ports.

There's an issue about it:
 https://github.com/eventlet/eventlet/issues/411

This patch is working around the problem while eventlet people
consider the issue.

Change-Id: I67522909f96495a6a30e1acdb79835dce2189549
2017-05-11 01:39:14 -06:00
Romain LE DISEZ
38d35797df Fix SSYNC failing to replicate unexpired object
Fix a situation where SSYNC would fail to replicate a valid object because
the datafile contains an expired X-Delete-At information while a metafile
contains no X-Delete-At information. Example:
 - 1454619054.02968.data => contains X-Delete-At: 1454619654
 - 1454619056.04876.meta => does not contain X-Delete-At info

In this situation, the replicator tries to PUT the datafile, and then
to POST the metadata. Previously, if the receiver has the datafile but
current time is greater than the X-Delete-At, then it considers it to
be expired and requests no updates from the sender, so the metafile is
never synced. If the receiver does not have the datafile then it does
request updates from the sender, but the ssync PUT subrequest is
refused if the current time is greater than the X-Delete-At (the
object is expired). If the datafile is transfered, the ssync POST
subrequest fails because the object does not exist (expired).

This commit allows PUT and POST to works so that the object can be
replicated, by enabling the receiver object server to open expired
diskfiles when handling replication requests.

Closes-Bug: #1683689
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I919994ead2b20dbb6c5671c208823e8b7f513715
2017-04-26 11:29:40 +01:00
Kota Tsuyuzaki
b09360d447 Fix stats calculation in object-reconstructor
This patch fixes the object-reconstructor to calculate device_count
as the total number of local devices in all policies. Previously
Swift counts it for each policy but reconstruction_device_count
which means the number of devices actually swift needs to reconstruct
is counted as sum of ones for all polices.

With this patch, Swift will gather all local devices for all policies
at first, and then, collect parts for each devices as well as current.
To do so, we can see the statuses for remaining job/disks percentage via
stats_line output.

To enable this change, this patch also touchs the object replicator
to get a DiskFileManager via the DiskFileRouter class so that
DiskFileManager instances are policy specific. Currently the same
replication policy DiskFileManager class is always used, but this
change future proofs the replicator for possible other DiskFileManager
implementations.

The change also gives the ObjectReplicator a _df_router variable,
making it consistent with the ObjectReconstructor, and allowing a
common way for ssync.Sender to access DiskFileManager instances via
it's daemon's _df_router instance.

Also, remove the use of FakeReplicator from the ssync test suite. It
was not necessary and risked masking divergence between ssync and the
replicator and reconstructor daemon implementations.

Co-Author: Alistair Coles <alistair.coles@hpe.com>

Closes-Bug: #1488608
Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068
2016-12-12 21:26:54 -08:00
Alistair Coles
2a75091c58 Make ECDiskFileReader check fragment metadata
This patch makes the ECDiskFileReader check the validity of EC
fragment metadata as it reads chunks from disk and quarantine a
diskfile with bad metadata. This in turn means that both the object
auditor and a proxy GET request will cause bad EC fragments to be
quarantined.

This change is motivated by bug 1631144 which may result in corrupt EC
fragments being written to disk but appear valid to the object auditor
md5 hash and content-length checks.

NotImplemented:

 * perform metadata check when a read starts on any frag_size
   boundary, not just at zero

Related-Bug: #1631144
Closes-Bug: #1633647

Change-Id: Ifa6a7f8aaca94c7d39f4aeb9d4fa3f59c4f6ee13
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
2016-11-01 13:11:02 -07:00
Jenkins
264e728364 Merge "Prevent ssync writing bad fragment data to diskfile" 2016-10-14 23:29:29 +00:00
Alistair Coles
3218f8b064 Prevent ssync writing bad fragment data to diskfile
Previously, if a reconstructor sync type job failed to provide
sufficient bytes from a reconstructed fragment body iterator to match
the content-length that the ssync sender had already sent to the ssync
receiver, the sender would still proceed to send the next
subrequest. The ssync receiver might then write the start of the next
subrequest to the partially complete diskfile for the previous
subrequest (including writing subrequest headers to that diskfile)
until it has received content-length bytes.

Since a reconstructor ssync job does not send an ETag header (it
cannot because it does not know the ETag of a reconstructed fragment
until it has been sent) then the receiving object server does not
detect the "bad" data written to the fragment diskfile, and worse,
will label it with an ETag that matches the md5 sum of the bad
data. The bad fragment file will therefore appear good to the auditor.

There is no easy way for the ssync sender to communicate a lack of
source data to the receiver other than by disconnecting the
session. So this patch adds a check in the ssync sender that the sent
byte count is equal to the sent Content-Length header value for each
subrequest, and disconnect if a mismatch is detected.

The disconnect prevents the receiver finalizing the bad diskfile, but
also prevents subsequent fragments in the ssync job being sync'd until
the next cycle.

Closes-Bug: #1631144
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>

Change-Id: I54068906efdb9cd58fcdc6eae7c2163ea92afb9d
2016-10-13 17:15:10 +01:00
Alistair Coles
b13b49a27c EC - eliminate .durable files
Instead of using a separate .durable file to indicate
the durable status of a .data file, rename the .data
to include a durable marker in the filename. This saves
one inode for every EC fragment archive.

An EC policy PUT will, as before, first rename a temp
file to:

   <timestamp>#<frag_index>.data

but now, when the object is committed, that file will be
renamed:

   <timestamp>#<frag_index>#d.data

with the '#d' suffix marking the data file as durable.

Diskfile suffix hashing returns the same result when the
new durable-data filename or the legacy durable file is
found in an object directory. A fragment archive that has
been created on an upgraded object server will therefore
appear to be in the same state, as far as the consistency
engine is concerned, as the same fragment archive created
on an older object server.

Since legacy .durable files will still exist in deployed
clusters, many of the unit tests scenarios have been
duplicated for both new durable-data filenames and legacy
durable files.

Change-Id: I6f1f62d47be0b0ac7919888c77480a636f11f607
2016-10-10 18:11:02 +01:00
Victor Stinner
5677a04c8f Python 3: Fix usage of dict methods
* Replace "c = dict(a.items() + b.items())" with
  "c = dict(a); c.update(b)". It works on Python 2 and Python 3, and
  it may be a little bit more efficient on Python 2 (no need to
  create a temporary list of items).
* Replace "dict.values() + dict.values()" with
  "list(dict.values()) + list(dict.values())": on Python 3,
  dict.values() is a view which doesn't support a+b operator.

Change-Id: Id5a65628fe2fb7a02c713b758fcaa81154db28a0
2016-07-25 15:14:06 +02:00
Alistair Coles
d456d9e934 Don't ssync data when only a durable is missing
If an EC diskfile is missing its .durable file (for example
due to a partial PUT failure) then the ssync missing check
will fail to open the file and will consider it
missing. This can result in possible reconstruction of the
fragment archive (for a sync job) and definite transmission
of the fragment archive (for sync and revert jobs), which is
wasteful.

This patch makes the ssync receiver inspect the diskfile
state after attempting to open it, and if fragments exist at
the timestamp of the sender's diskfile, but a .durable file
is missing, then the receiver will commit the diskfile at
the sender's timestamp. As a result, there is no longer any
need to send a fragment archive.

Change-Id: I4766864fcc0a3553976e8fd85bbb2fc782f04abd
2016-03-04 15:39:52 +00:00
Alistair Coles
e91de49d68 Update container on fast-POST
This patch makes a number of changes to enable content-type
metadata to be updated when using the fast-POST mode of
operation, as proposed in the associated spec [1].

* the object server and diskfile are modified to allow
  content-type to be updated by a POST and the updated value
  to be stored in .meta files.

* the object server accepts PUTs and DELETEs with older
  timestamps than existing .meta files. This is to be
  consistent with replication that will leave a later .meta
  file in place when replicating a .data file.

* the diskfile interface is modified to provide accessor
  methods for the content-type and its timestamp.

* the naming of .meta files is modified to encode two
  timestamps when the .meta file contains a content-type value
  that was set prior to the latest metadata update; this
  enables consistency to be achieved when rsync is used for
  replication.

* ssync is modified to sync meta files when content-type
  differs between local and remote copies of objects.

* the object server issues container updates when handling
  POST requests, notifying the container server of the current
  immutable metadata (etag, size, hash, swift_bytes),
  content-type with their respective timestamps, and the
  mutable metadata timestamp.

* the container server maintains the most recently reported
  values for immutable metadata, content-type and mutable
  metadata, each with their respective timestamps, in a single
  db row.

* new probe tests verify that replication achieves eventual
  consistency of containers and objects after discrete updates
  to content-type and mutable metadata, and that container-sync
  sync's objects after fast-post updates.

[1] spec change-id: I60688efc3df692d3a39557114dca8c5490f7837e

Change-Id: Ia597cd460bb5fd40aa92e886e3e18a7542603d01
2016-03-03 14:25:10 +00:00
Alistair Coles
6858510b59 Re-organise ssync tests
We have some tests that exercise both the sender and receiver,
but are spread across test_ssync_sender.py and test_ssync_receiver.py.
This creates a new module test_ssync.py and moves the end-to-end tests
into there.

Change-Id: Iea3e9932734924453f7241432afda90abbc75c06
2015-11-05 14:50:28 +00:00