arguments are not locale decoded into Unicode

When the openstackclient in Python2 passes command line arguments to a
subcommand it fails to pass the arguments as text
(e.g. Unicode). Instead it passes the arguments as binary data encoded
using the current locales encoding.

An easy way to see this is trying to pass a username with a non-ASCII
character.

% openstack user delete ñew
No user with a name or ID of 'ñew' exists.

What occurs internally is when the user data is retrieved it's it
properly represented in a Unicode object. However the username pased
from the command line is still a str object encoded in the locales
encoding (typically UTF-8). A string comparison is attempted between
the encoded data from the command line and the Unicode text found in
the user representation. This seldom ends well, either the comparison
fails to match or a codec error is raised.

There is a hard and fast rule, all text data must be stored in Unicode
objects and the conversion from binary encoded text to Unicode must
occur as close to the I/O boundary as possible. Python3 enforces this
behavior automatically but in Python2 it is the programmers job to do
so.

In the past there have been attempts to fix problems deep inside
internal code by attempting to decode from UTF-8. There are two
problems with this approach. First, internal code has no way to
accurately know what encoding was used to encode the binary data. This
is way it needs to be decoded as close to the I/O source as possible
because that is the best place to know the actual encoding. Guessing
UTF-8 is at best a heuristic. Second, there must be a canonical
representation for data "inside" the program, you don't want dozens of
individual modules, classes, methods, etc. performing conversions,
instead they should be able to make the assumption in what format text
is represented in, the format for text data must be Unicode. This is
another reason to decode as close to the I/O as possible.

In Python3 the argv strings are decoded from the locales encoding by
the interpreter. By the time any Python3 code sees the argv strings
they will be Unicode. However in Python2 there must be explicit code
added to decode the argv strings into Unicode.

The conversion of sys.argv into Unicode only occurs when argv is not
passed to OpenStackShell.run(). If a caller of OpenStackShell.run()
supplies their own arg it is their responsiblity to assure they are
passing actual text objects. Consider this a requirement of the API.

Note: This patch does not contain a unittest to exercise the behavior
because it is difficult to construct a test that depends on command
invocation from a shell. The general structure of the unit tests is to
pass fake argv into OpenStackShell.run() as if it came from a
shell. Because the new code only operates when argv is not passed and
defaults to sys.argv it conflicts with the unittest design.

Change-Id: I779d260744728eae8455ff9dedb6e5c09c165559
Closes-Bug: 1603494
Signed-off-by: John Dennis <jdennis@redhat.com>
This commit is contained in:
John Dennis 2016-07-15 14:46:29 -04:00
parent 7a667d700f
commit 756d2fac67

View File

@ -18,7 +18,9 @@
import argparse
import getpass
import locale
import logging
import six
import sys
import traceback
@ -474,8 +476,17 @@ class OpenStackShell(app.App):
tcmd.run(targs)
def main(argv=sys.argv[1:]):
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
if six.PY2:
# Emulate Py3, decode argv into Unicode based on locale so that
# commands always see arguments as text instead of binary data
encoding = locale.getpreferredencoding()
if encoding:
argv = map(lambda arg: arg.decode(encoding), argv)
return OpenStackShell().run(argv)
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
sys.exit(main())