This is for the same reason that SLO got pulled into middleware, which
includes stuff like automatic retry of GETs on broken connection and
the multi-ring storage policy stuff.
The proxy will automatically insert the dlo middleware at an
appropriate place in the pipeline the same way it does with the
gatekeeper middleware. Clusters will still support DLOs after upgrade
even with an old config file that doesn't mention dlo at all.
Includes support for reading config values from the proxy server's
config section so that upgraded clusters continue to work as before.
Bonus fix: resolve 'after' vs. 'after_fn' in proxy's required filters
list. Having two was confusing, so I kept the more-general one.
DocImpact
blueprint multi-ring-large-objects
Change-Id: Ib3b3830c246816dd549fc74be98b4bc651e7bace
From Pete Zaitcev's work on PBes (https://review.openstack.org/47713):
Fixes up a problem with double mocking of Request.__init__, which
should be seen in stock code as well.
Change-Id: Ibf8103425404b03aaab9a4e97a3a62d0fb8dbb14
* Introduce a new privileged account header: X-Account-Access-Control
* Introduce JSON-based version 2 ACL syntax -- see below for discussion
* Implement account ACL authorization in TempAuth
X-Account-Access-Control Header
-------------------------------
Accounts now have a new privileged header to represent ACLs or any other
form of account-level access control. The value of the header is an opaque
string to be interpreted by the auth system, but it must be a JSON-encoded
dictionary. A reference implementation is given in TempAuth, with the
knowledge that historically other auth systems often use TempAuth as a
starting point.
The reference implementation describes three levels of account access:
"admin", "read-write", and "read-only". Adding new access control
features in a future patch (e.g. "write-only" account access) will
automatically be forward- and backward-compatible, due to the JSON
dictionary header format.
The privileged X-Account-Access-Control header may only be read or written
by a user with "swift_owner" status, traditionally the account owner but
now also any user on the "admin" ACL.
Access Levels:
Read-only access is intended to indicate to the auth system that this
list of identities can read everything (except privileged headers) in
the account. Specifically, a user with read-only account access can get
a list of containers in the account, list the contents of any container,
retrieve any object, and see the (non-privileged) headers of the
account, any container, or any object.
Read-write access is intended to indicate to the auth system that this
list of identities can read or write (or create) any container. A user
with read-write account access can create new containers, set any
unprivileged container headers, overwrite objects, delete containers,
etc. A read-write user can NOT set account headers (or perform any
PUT/POST/DELETE requests on the account).
Admin access is intended to indicate to the auth system that this list of
identities has "swift_owner" privileges. A user with admin account access
can do anything the account owner can, including setting account headers
and any privileged headers -- and thus changing the value of
X-Account-Access-Control and thereby granting read-only, read-write, or
admin access to other users.
The auth system is responsible for making decisions based on this header,
if it chooses to support its use. Therefore the above access level
descriptions are necessarily advisory only for other auth systems.
When setting the value of the header, callers are urged to use the new
format_acl() method, described below.
New ACL Format
--------------
The account ACLs introduce a new format for ACLs, rather than reusing the
existing format from X-Container-Read/X-Container-Write. There are several
reasons for this:
* Container ACL format does not support Unicode
* Container ACLs have a different structure than account ACLs
+ account ACLs have no concept of referrers or rlistings
+ accounts have additional "admin" access level
+ account access levels are structured as admin > rw > ro, which seems more
appropriate for how people access accounts, rather than reusing
container ACLs' orthogonal read and write access
In addition, the container ACL syntax is a bit arbitrary and highly custom,
so instead of parsing additional custom syntax, I'd rather propose a next
version and introduce a means for migration. The V2 ACL syntax has the
following benefits:
* JSON is a well-known standard syntax with parsers in all languages
* no artificial value restrictions (you can grant access to a user named
".rlistings" if you want)
* forward and backward compatibility: you may have extraneous keys, but
your attempt to parse the header won't raise an exception
I've introduced hooks in parse_acl and format_acl which currently default
to the old V1 syntax but tolerate the V2 syntax and can easily be flipped
to default to V2. I'm not changing the default or adding code to rewrite
V1 ACLs to V2, because this patch has suffered a lot of scope creep already,
but this seems like a sensible milestone in the migration.
TempAuth Account ACL Implementation
-----------------------------------
As stated above, core Swift is responsible for privileging the
X-Account-Access-Control header (making it only accessible to swift_owners),
for translating it to -sysmeta-* headers to trigger persistence by the
account server, and for including the header in the responses to requests
by privileged users. Core Swift puts no expectation on the *content* of
this header. Auth systems (including TempAuth) are responsible for
defining the content of the header and taking action based on it.
In addition to the changes described above, this patch defines a format
to be used by TempAuth for these headers in the common.middleware.acl
module, in the methods format_v2_acl() and parse_v2_acl(). This patch
also teaches TempAuth to take action based on the header contents. TempAuth
now sets swift_owner=True if the user is on the Admin ACL, authorizes
GET/HEAD/OPTIONS requests if the user is on any ACL, authorizes
PUT/POST/DELETE requests if the user is on the admin or read-write ACL, etc.
Note that the action of setting swift_owner=True triggers core Swift to
add or strip the privileged headers from the responses. Core Swift (not
the auth system) is responsible for that.
DocImpact: Documentation for the new ACL usage and format appears in
summary form in doc/source/overview_auth.rst, and in more detail in
swift/common/middleware/tempauth.py in the TempAuth class docstring.
I leave it to the Swift doc team to determine whether more is needed.
Change-Id: I836a99eaaa6bb0e92dc03e1ca46a474522e6e826
To help debug problems with the proxy-server app setup for each class,
we add a debug logger so that on failure we can see what requests were
actually sent.
Additionally, we move the set_http_connect() call to the same relative
position as the one before it for clarity of intent.
Related bug 1271962
Change-Id: Idc301c06e114b11c358ee4fbc0b2b70ec7743091
The reason for this is that having origin in the get_info calls causes an
infinite loop. The way that code was written it relies on GETorHEAD_base to
populate the data- the only problem is that the HEAD call is wrapped by
cors_validation which calls get_info and round and round we go. imo get_info
should be refactored to not work this way (relying on this other call to do
stuff behind scenes and then magically your stuff is there) because it seems
pretty prone to breaking. But I'll let somebody else do that :).
Fixes bug 1270039
Change-Id: Idad3cedd965e0d5fb068b062fe8fef301c87b75a
This way, with zero additional effort, SLO will support enhancements
to object storage and retrieval, such as:
* automatic resume of GETs on broken connection (today)
* storage policies (in the near future)
* erasure-coded object segments (in the far future)
This also lets SLOs work with other sorts of hypothetical third-party
middleware, for example object compression or encryption.
Getting COPY to work here is sort of a hack; the proxy's object
controller now checks for "swift.copy_response_hook" in the request's
environment and feeds the GET response (the source of the new object's
data) through it. This lets a COPY of a SLO manifest actually combine
the segments instead of merely copying the manifest document.
Updated ObjectController to expect a response's app_iter to be an
iterable, not just an iterator. (PEP 333 says "When called by the
server, the application object must return an iterable yielding zero
or more strings." ObjectController was just being too strict.) This
way, SLO can re-use the same response-generation logic for GET and
COPY requests.
Added a (sort of hokey) mechanism to allow middlewares to close
incompletely-consumed app iterators without triggering a warning. SLO
does this when it realizes it's performed a ranged GET on a manifest;
it closes the iterable, removes the range, and retries the
request. Without this change, the proxy logs would get 'Client
disconnected on read' in them.
DocImpact
blueprint multi-ring-large-objects
Change-Id: Ic11662eb5c7176fbf422a6fc87a569928d6f85a1
The copy source must be container/object.
This patch avoids the server to return
an internal server error when user provides
a path without a container.
Fixes: bug #1255049
Change-Id: I1a85c98d9b3a78bad40b8ceba9088cf323042412
Middleware or core features may need to store metadata
against accounts or containers. This patch adds a
generic mechanism for system metadata to be persisted
in backend databases, without polluting the user
metadata namespace, by using the reserved header
namespace x-<server_type>-sysmeta-*.
Modifications are firstly that backend servers persist
system metadata headers alongside user metadata and
other system state.
For accounts and containers, system metadata in PUT
and POST requests is treated in a similar way to user
metadata. System metadata is not yet supported for
object requests.
Secondly, changes in the proxy controllers ensure that
headers in the system metadata namespace will pass through
in requests to backend servers.
Thirdly, system metadata returned from backend servers
in GET or HEAD responses is added to the cached info
dict, which middleware can access.
Finally, a gatekeeper middleware module is provided
which filters all system metadata headers from requests
and responses by removing headers with names starting
x-account-sysmeta-, x-container-sysmeta-. The gatekeeper
also removes headers starting x-object-sysmeta- in
anticipation of future support for system metadata being
set for objects. This prevents clients from writing or
reading system metadata.
The required_filters list in swift/proxy/server.py is
modified to include the gatekeeper middleware so that
if the gatekeeper has not been configured in the
pipeline then it will be automatically inserted close
to the start of the pipeline.
blueprint cluster-federation
Change-Id: I80b8b14243cc59505f8c584920f8f527646b5f45
On GET, the proxy will go search the primary nodes plus some number of
handoffs for the account/container/object before giving up and
returning a 404. That number is, by default, twice the ring's replica
count. This was fine if your ring had an integral number of replicas,
but could lead to some slightly-odd behavior if you have fractional
replicas.
For example, imagine that you have 3.49 replicas in your object ring;
perhaps you're migrating a cluster from 3 replicas to 4, and you're
being smart and doing it a bit at a time.
On an object GET that all the primary nodes 404ed, the proxy would
then compute 2 * 3.49 = 6.98, round it up to 7, and go look at 7
handoff nodes. This is sort of weird; the intent was to look at 6
handoffs for objects with 3 replicas, and 8 handoffs for objects with
4, but the effect is 7 for everybody.
You also get little latency cliffs as you scale up replica counts. If,
instead of 3.49, you had 3.51 replicas, then the proxy would look at 8
handoff nodes in every case [ceil(2 * 3.51) = 8], so there'd be a
small-but-noticeable jump in the time it takes to produce a 404.
The fix is to compute the number of handoffs based on the number of
primary nodes for the partition, not the ring's replica count. This
gets rid of the little latency cliffs and makes the behavior more like
what you get with integral replica counts.
If your ring has an integral number of replicas, there's no behavior
change here.
Change-Id: I50538941e571135299fd6b86ecd9dc780cf649f5
the Last-Modified header in Response didn't have a suitable
value - an integer part of object's timestamp.
This leads that the the if-[un]modified-since header with the
value from last-modified is always earlier than timestamp
and results the content is always newer than value of these
conditional headers.
Patched code returns math.ceil() of object's timestamp
in Last-Modified header so the later conditional header works
correctly
Closes-Bug: #1248818
Change-Id: I1ece7d008551bf989da74d23f0ed6307c45c5436
If you create a container with a non-ASCII name, and then make another
container with X-Versions-Location: first-cøntåîner, *and* you're
serializing stuff in memcache as json (the default), when the proxy
tries to make a versioned object, it will crash.
The fix is to make sure that get_container_info() always returns strs,
not unicodes.
The long-term fix would be to get rid of simplejson entirely, as its
decoder can't make up its mind whether JSON strings should be Python
strs or unicodes, and that makes it really really easy to write bugs
like this.
Change-Id: Ib20ea5fb884484a4246d7a21a9f1e2ffd82eb04f
Remove the useless arg ("start index" = 0) in files, since its default
value is 0, to make code cleaner.
Fixes bug #1259750
Change-Id: I52afac28a3248895bb1c012a5934d39e7c2cc5a9
The proxy server was calling swob.Request.path_info_pop() prior to
instantiating a controller so that req.path_info was just /a/c/o (sans
/v1). The version got moved over into SCRIPT_NAME.
This lead to some unfortunate behavior when trying to re-use a request
from middleware. Something like this:
# Imagine we're a WSGIContext object here.
#
# To start, SCRIPT_NAME = '' and PATH_INFO='/v1/a/c/o'
resp_iter = self._app_call(env, start_response)
# Now SCRIPT_NAME='/v1' and PATH_INFO ='/a/c/o'
if something_special in self._response_headers:
env['REQUEST_METHOD'] = 'GET'
env.pop('HTTP_RANGE', None)
# 404 SURPRISE! The proxy calls path_info_pop() again,
# and now SCRIPT_NAME='/v1/a' and PATH_INFO='/c/o', so this
# gets treated as a container request. Yikes.
resp_iter = self._app_call(env, start_response)
Now we just leave SCRIPT_NAME and PATH_INFO alone. To make life easy
for everyone who does want just /a/c/o, I defined
swob.Request.swift_entity_path, which just strips off the /v1.
Note that there's still one call to path_info_pop() in tempauth, but
that's only for requests going to /auth, so it won't affect Swift API
requests. It might be a good idea to remove that one later, but let's
do one thing at a time.
Change-Id: I87557a11c01f3f3889b610578cda6ba7d3933e7a
The added sleep makes this test pass on my saio. I have not heard of
it failing for anyone else, but I figured I'd post this up just in
case someone does have the same problem and this fixes it for them.
Change-Id: Ia0bb09d36d0b531ade7c6a6034bbe31dd6c90a98
Swift can now optionally be configured to allow requests to '/info',
providing information about the swift cluster. Additionally a HMAC
signed requests to
'/info?swiftinfo_sig=<sign>&swiftinfo_expires=<expires>' can be
configured allowing privileged access to more sensitive information
not meant to be public.
DocImpact
Change-Id: I2379360fbfe3d9e9e8b25f1dc34517d199574495
Implements: blueprint capabilities
Closes-Bug: #1245694
If a source times out on read try another one of them with a
modified range. There had to be a lot of moved around code
to get this working but it should all make sense.
Change-Id: Ieaf045690a8823927a6f38098a95b37a4d4adb70
Allow the proxy to respond to many types of requests as soon as it has a
quorum. This can help speed up responses (without changing the results),
especially when one node is acting up.
I had to fix a few unit tests that no longer match the backend http requests
made by our proxy.
Change-Id: Ieb070dc3019e217e717b96154a7a809409bf40a5
By default, Python 2.*'s standard library "socket" module performs 8K
writes. For 10ge networks, with large MTUs (typically 9,000), this is
not optimal. We tie the default buffer size to the client_chunk_size
paramter for the proxy server, and to the network_chunk_size for the
object server.
One might be tempted to ask, isn't there a way to set this value on a
per-request basis? This author was unable to find a reference to the
_fileobject in the context of WSGI. By the time a request pass to a
WSGI object's __call__ method, the "wfile" attribute of the
req.environ['eventlet.input'] (Input) object has been set to None, and
the "rfile" attribute is the object wrapping the socket for reading,
not writing.
One might also be tempted to ask, why not just override the
wsgi.HttpProtocol's "wbufsize" class attribute instead? Until
eventlet/wsgi.py is fixed, we can't set wsgi.HttpProtocol.wbufsize to
anything but zero (the default, see Python's SocketServer.py,
StreamRequestHandler class), since Eventlet does not ensure the socket
_fileobject's flush() method is called after Eventlet invokes a
write() method on the same. NOTE: wbufsize (a class attribute of
StreamRequestHandler originally, not to be confused with the standard
library's socket._fileobject._wbufsize class attribute) is used for
the bufsize parameter of the connection object's makefile() method. As
a result, the socket's _fileobject code uses that value to set both
_rbufsize and _wbufsize. While that would allow us to transmit in 64KB
chunks, it also means that write() and writeline() method calls on the
socket _fileobject are only transmitted once 64KB have been
accumulated, or a flush() is called.
As for performance improvement:
Run 8KB 64KB
0 8.101 6.367
1 7.892 6.216
2 7.732 6.246
3 7.594 6.229
4 7.594 6.292
5 7.555 6.230
6 7.575 6.270
7 7.528 6.278
8 7.547 6.304
9 7.550 6.313
Average 7.667 6.275 1.3923 18.16%
Run using the following after adjusting the test value for obj_len to
1 GB:
nosetests -v --nocapture --nologcapture \
test/unit/proxy/test_server.py:TestProxyObjectPerformance.test_GET_debug_large_file
Change-Id: I4dd93acc3376e9960fbdcdcae00c6d002e545894
Signed-off-by: Peter Portante <peter.portante@redhat.com>
Refactor on-disk knowledge out of the object server by pushing the
async update pickle creation to the new DiskFileManager class (name is
not the best, so suggestions welcome), along with the REPLICATOR
method logic. We also move the mount checking and thread pool storage
to the new ondisk.Devices object, which then also becomes the new home
of the audit_location_generator method.
For the object server, a new setup() method is now called at the end
of the controller's construction, and the _diskfile() method has been
renamed to get_diskfile(), to allow implementation specific behavior.
We then hide the need for the REST API layer to know how and where
quarantining needs to be performed. There are now two places it is
checked internally, on open() where we verify the content-length,
name, and x-timestamp metadata, and in the reader on close where the
etag metadata is checked if the entire file was read.
We add a reader class to allow implementations to isolate the WSGI
handling code for that specific environment (it is used no-where else
in the REST APIs). This simplifies the caller's code to just use a
"with" statement once open to avoid multiple points where close needs
to be called.
For a full historical comparison, including the usage patterns see:
https://gist.github.com/portante/5488238
(as of master, 2b639f5, Merge
"Fix 500 from account-quota This Commit
middleware")
--------------------------------+------------------------------------
DiskFileManager(conf)
Methods:
.pickle_async_update()
.get_diskfile()
.get_hashes()
Attributes:
.devices
.logger
.disk_chunk_size
.keep_cache_size
.bytes_per_sync
DiskFile(a,c,o,keep_data_fp=) DiskFile(a,c,o)
Methods: Methods:
*.__iter__()
.close(verify_file=)
.is_deleted()
.is_expired()
.quarantine()
.get_data_file_size()
.open()
.read_metadata()
.create() .create()
.write_metadata()
.delete() .delete()
Attributes: Attributes:
.quarantined_dir
.keep_cache
.metadata
*DiskFileReader()
Methods:
.__iter__()
.close()
Attributes:
+.was_quarantined
DiskWriter() DiskFileWriter()
Methods: Methods:
.write() .write()
.put() .put()
* Note that the DiskFile class * Note that the DiskReader() object
implements all the methods returned by the
necessary for a WSGI app DiskFileOpened.reader() method
iterator implements all the methods
necessary for a WSGI app iterator
+ Note that if the auditor is
refactored to not use the DiskFile
class, see
https://review.openstack.org/44787
then we don't need the
was_quarantined attribute
A reference "in-memory" object server implementation of a backend
DiskFile class in swift/obj/mem_server.py and
swift/obj/mem_diskfile.py.
One can also reference
https://github.com/portante/gluster-swift/commits/diskfile for the
proposed integration with the gluster-swift code based on these
changes.
Change-Id: I44e153fdb405a5743e9c05349008f94136764916
Signed-off-by: Peter Portante <peter.portante@redhat.com>
This reverts commit 7760f41c3ce436cb23b4b8425db3749a3da33d32
Change-Id: I95e57a2563784a8cd5e995cc826afeac0eadbe62
Signed-off-by: Peter Portante <peter.portante@redhat.com>
If a client were in the middle of an object GET request and then
disconnected, the proxy would wait a while (default 60s) and then time
out the connection. As part of the teardown for this, the proxy would
attempt to close the connection to the object server, then drain any
associated buffers. However, this didn't work particularly well,
resulting in the proxy reading the entire remainder of the object for
no gain.
Now, the proxy closes the connection hard, by calling .close() on the
underlying socket._socket object. This is different from calling
.close() on a socket._socketobject object, which is what you get back
from socket.socket() and similar methods. Calling .close() on a
socket._socketobject simply decrements a reference counter on the
socket._socket, which has been observed in the past to result in
socket leaks when something holds onto a reference. However, calling
.close() on a socket._socket actually closes the socket regardless of
who else has a reference to it.
I had to delete a test assertion that said the object server never got
SIGPIPE after a GET w/X-Newest. Well, you get a SIGPIPE when you write
to a closed socket, and now the proxy is actually closing the sockets
early, so now you *do* get a SIGPIPE.
closes-bug: 1174660
Note that this will cause a regression on bug 1037337; unfortunately,
the cure is worse than the disease, so out it goes.
Change-Id: I9c7a2e7fdb8b4232e53ea96f86b50e8d34c27221
In 1.8.0 (Grizzly), your proxy logs would indicate which middleware
was responsible for an internal request, e.g. TU for tempurl or BD for
bulk delete. At some point, those all turned into GET_INFO, which does
not give you any idea which specific middleware was responsible, only
that it came from a get_account_info/get_container_info call.
This commit puts it back to how it was in 1.8.0. Also, the
new-since-1.8.0 function get_object_info() got swift_source plumbing
added to it, so source tracking for the quota middlewares'
get_object_info() calls will happen now too.
Note that due to the new-since-1.8.0 in-environment caching of
account/container info, you may not see as many lines in the proxy log
as you would with 1.8.0. This is because there are actually fewer
internal requests being made.
Change-Id: I2b2ff7823c612dc7ed7f268da979c4500bbbe911
Fixes object versioning when object name and/or version
container name contain multibyte utf-8 characters.
When object names containing non-ASCII characters
are PUT multiple times into a container with an
x-versions-location set, subsequent DELETE of the
object results in a 500 response status.
When the versions container name contains
non-ASCII characters the first delete of an object
succeeds but fails to restore previous version of
object, so second delete incorrectly returns 404.
Fixes bug 1229142
Change-Id: I425440f76b8328f8e119d390bfa4c7022181e89e