Pages: << 1 2 3 4 5 6 7 8 9 10 11 ... 26 >>

01/09/09

Permalink 12:22:51 pm, by fumanchu Email , 369 words   English (US)
Categories: Python

assertNoDiff

I recently had to test output that consisted of a long list of dicts against an expected set. After too many long debugging sessions with copious print statements and lots of hand-comparison, I finally got smart and switched to using Python's builtin difflib to give me just the parts I was interested in (the wrong parts).

With difflib and a little pprint magic, a failing test now looks like this:

Traceback (most recent call last):
  File "C:\Python25\lib\site-packages\app\test\util.py", line 237, in tearDown
    self.assertNoDiff(a, b, "Expected", "Received")
  File "C:\Python25\lib\site-packages\app\test\util.py", line 382, in failIfDiff
    raise self.failureException, msg
AssertionError:
--- Expected

+++ Received

@@ -13,4 +13,3 @@

 {'call': 'getuser101',
 'output': {'first_name': 'Georg',
            'gender': u'Male',
            'last_name': 'Handel',
            ...}}
 {'call': 'getuser1',
 'output': None}
 {'call': 'getuser101',
 'output': {'first_name': 'Georg',
            'gender': u'Male',
            'last_name': 'Handel',
            ...}}
-{'call': 'getuser101',
 'output': {'first_name': 'Georg',
            'gender': u'Male',
            'last_name': 'Handel',
            ...}}

...and I can now easily see that the "Received" data is missing the last dict in the "Expected" list. Here's the code (not exactly what I committed at work, but I think this is even better):

import difflib
from pprint import pformat


class DiffTestCaseMixin(object):

    def get_diff_msg(self, first, second,
                     fromfile='First', tofile='Second'):
        """Return a unified diff between first and second."""
        # Force inputs to iterables for diffing.
        # use pformat instead of str or repr to output dicts and such
        # in a stable order for comparison.
        if isinstance(first, (tuple, list, dict)):
            first = [pformat(d) for d in first]
        else:
            first = [pformat(first)]

        if isinstance(second, (tuple, list, dict)):
            second = [pformat(d) for d in second]
        else:
            second = [pformat(second)]

        diff = difflib.unified_diff(
            first, second, fromfile=fromfile, tofile=tofile)
        # Add line endings.
        return ''.join([d + '\n' for d in diff])

    def failIfDiff(self, first, second, fromfile='First', tofile='Second'):
        """If not first == second, fail with a unified diff."""
        if not first == second:
            msg = self.get_diff_msg(first, second, fromfile, tofile)
            raise self.failureException, msg

    assertNoDiff = failIfDiff

The get_diff_msg function is broken out to allow a test method to call self.fail(msg), where 'msg' might be the join'ed output of several diffs.

Happy testing!

01/05/09

Permalink 02:09:53 am, by fumanchu Email , 313 words   English (US)
Categories: General

99 Bottles...

1) Copy this list into your blog, with instructions.
2) Bold all the drinks you’ve imbibed.
3) Cross out any items that you won’t touch
4) Post a comment here and link to your results.

OR

If you don’t have a blog, just count the ones you’ve tried and post the number in the comments section.

  1. Manhattan Cocktail
  2. Kopi Luwak (Weasle Coffee)
  3. French / Swiss Absinthe
  4. Rootbeer
  5. Gin Martini
  6. Sauternes
  7. Whole Milk
  8. Tequila (100% Agave)
  9. XO Cognac
  10. Espresso
  11. Spring Water (directly from the spring) Soda water, too!
  12. Gin & Tonic
  13. Mead
  14. Westvleteren 12 (Yellow Cap) Trappist Ale
  15. Chateau d’Yquem
  16. Budweiser
  17. Maraschino Liqueur
  18. Mojito
  19. Orgeat
  20. Grand Marnier
  21. Mai Tai (original)
  22. Ice Wine (Canadian)
  23. Red Bull
  24. Fresh Squeezed Orange Juice Once a week!
  25. Bubble Tea
  26. Tokaji
  27. Chicory
  28. Islay Scotch
  29. Pusser’s Navy Rum shudder
  30. Fernet Branca
  31. Fresh Pressed Apple Cider
  32. Bourbon
  33. Australian Shiraz
  34. Buckley’s Cough Syrup
  35. Orange Bitters
  36. Margarita (classic recipe)
  37. Molasses & Milk
  38. Chimay Blue
  39. Wine of Pines (Tepache)
  40. Green Tea
  41. Daiginjo Sake Of course
  42. From the department of redundancy department: Chai Tea
  43. Vodka (chilled, straight)
  44. Coca-Cola
  45. Zombie (Beachcomber recipe)
  46. Barley Wine
  47. Brewed Chocolate (Xocolatl) Does Abuelita count?
  48. Pisco Sour
  49. Lemonade
  50. Speyside Single Malt
  51. Jamaican Blue Mountain Coffee
  52. Champagne (Vintage)
  53. Rosé (French)
  54. Bellini
  55. Caipirinha
  56. White Zinfandel (Blush)
  57. Coconut Water
  58. Cerveza Corona es lo mejor...
  59. Cafe au Lait
  60. Ice Tea
  61. Pedro Ximenez Sherry
  62. Vintage Port
  63. Hot Chocolate
  64. German Riesling
  65. Pina Colada
  66. El Dorado 15 Year Rum
  67. Chartreuse
  68. Greek Wine
  69. Negroni
  70. Jägermeister
  71. Chicha
  72. Guinness
  73. Rhum Agricole
  74. Palm Wine
  75. Soju
  76. Ceylon Tea (High Grown)
  77. Belgian Lambic
  78. Mongolian Airag
  79. Doogh, Lassi or Ayran
  80. Sugarcane Juice
  81. Ramos Gin Fizz
  82. Singapore Sling
  83. Mint Julep
  84. Old Fashioned
  85. Perique
  86. Jenever (Holland Gin)
  87. Chocolate Milkshake
  88. Traditional Italian Barolo
  89. Pulque
  90. Natural Sparkling Water
  91. Cuban Rum
  92. Asti Spumante
  93. Irish Whiskey
  94. Château Margaux
  95. Two Buck Chuck
  96. Screech
  97. Akvavit
  98. Rye Whisky
  99. German Weissbier
  100. Daiquiri (classic)
    and I'll add:
  101. Rompope

10/26/08

Permalink 08:48:51 pm, by fumanchu Email , 102 words   English (US)
Categories: IT, Python

The Timeless Way of Software

I just finished Chris Alexander's The Timeless Way of Building and I only have one question with regards to software development: why do we laud the patterns and ignore the call to context? In other words: modularity is the enemy of usable software. It also happens to be the enemy of efficient and of readable software. If I see one more networking package or ORM with One Abstraction To Rule Them All I am going to scream. You and I are really good at abstractions. We are freaks. Most people have a hard time with them. Try not to proliferate them unnecessarily.

09/29/08

Permalink 11:25:31 pm, by fumanchu Email , 1658 words   English (US)
Categories: IT

The Direct Attribute Configuration Pattern

The Direct Attribute Configuration Pattern

Most programs, especially libraries and frameworks, need "configuration". But exactly how to implement that is a murky subject, mostly because the boundary between "configuration" and "code" is itself ill-defined.

Context

So let's try to define it. The first thing you might notice is that the dictionary definition of "configuration", an arrangement of parts, is quite different from what you typically find in a modern "configuration file". For example, take a typical Apache httpd.conf file. It does contain several directives which identify components: LoadModule, for example. But far outweighing these are directives which set attributes, usually on an object or on the system as a whole. Directives like "Listen 80", "ThreadsPerChild 250", and "LogLevel debug", even though they could be implemented via arrangements of pieces, probably aren't. Instead, the values are most likely implemented as permanent cell variables which never appear, or move, or disappear, but instead only change in value. Even the LoadModule directive doesn't really arrange any pieces within a space; it merely identifies and includes them in an abstract set of "loaded modules". One might argue that the Location context directive deals with arrangements of URL's, but those aren't really arranged; they simply exist. You can't rearrange /path/to/resource to be above /path. No, the dictionary definition of "configuration" as "arrangement" is a holdover from our mostly-hardware past, where even the most dynamic "configuration system" still required moving cards and jumpers around in physical space.

There are some notable exceptions, of course, but the vast majority of software on the market today that is "configurable" consists of a fairly static set of objects, plus a formalized means of tweaking a subset of attributes of those objects. The most common exception to this, the "plugin", is also rarely arranged with respect to other plugins or components; instead, it is merely "turned on" or included. I believe this tendency is due to a natural human limitation: we just don't reason about graphs and networks very well yet, at least not nearly as well as we reason about vectors (of instructions) and sets. We feel good when working on serial problems, and bad when working on parallel ones. As Chris Alexander said:

There is little purpose, then, in saying: It would be better if this force did not exist. For if it does exist anyway, designs based on such wishful thinking will fail.

Conventional approaches and their problems

So then, let's discuss ways to implement this kind of "configuration". Again, let's look at Apache's httpd.conf: here we find almost a DSL, in that http_config.h defines functions to tokenize and parse a config file into another representation, a config vector. Then that intermediate structure is transformed into the actual used values like, say, request_rec->server->keep_alive_timeout.

Or take a typical postgresql.conf file. The entries therein are translated (via the ConfigureNamesBool array) to their internal variable names, and set globally. For example, check_function_bodies is implemented as an extern in guc.h. When a block of code needs to switch on the value of check_function_bodies, it #includes that header and reads the global value directly.

These designs carry with them several problems:

  1. The set of configurable attributes is fixed when the program is compiled.
  2. The set of configurable attributes must essentially be declared twice; once with an internal name, and again with an external name.
  3. Often, these names are different (quite often only by CamelCase for one and names_with_underscores for the other!), increasing the cognitive load for anyone dealing in both.
  4. Just as often, the types of the internal and external representations are different. The config file, for example, may allow specifying an attribute as "On" or "Off", but these are translated to the internal values 1 and 0. This mapping also increases cognitive load (I would argue by more than double).
  5. The namespace of attributes is flat, often only a single level. This can make searching for the correct directive name more difficult than a hierarchical namespace. The latter also promotes browsing of related attributes.
  6. If an intermediate structure is used as a scaffold, then "configuration variables" are declared together in one place, but read independently throughout the code base. In order to know what parts of a code block are configurable, the developer must search through the "config module" scaffolding, which is isolated from the code in question, and match up external names to internal effects.
  7. In some implementations, the intermediate structure is not a scaffold, but is the final repository of "config values"; the values are never copied onto their "real" referents. This makes it easier to know which parts of a code block are configurable--just grep for calls to config.get! But often, the config.get call is much more expensive than reading local copies; when that happens, performance can drop sharply with lots of config reads (often multiple reads of the same value).
  8. Memory usage is at least double for each configurable value since each one has an intermediate representation whether overridden or not. Quite often, the intermediate structures are retained long after they could have been freed.
  9. Conventional config file parsers often are slower and less strict than the parsers for the general-purpose languages they hide. Config reads are slow and errors are delayed.
  10. Config layers can be a lot of code. In Apache's server package, for example, config.c has more lines of code than any other C module except core.c.

A solution

There is a way to implement "configuration" as we have defined it above (setting values on named attributes) which avoids the above problems. Rather than defining a layer where external names, types, and values get translated to internal names, types, and values in an ad-hoc mapping, we can define a better translation step by obeying 3 very simple constraints:

  1. External names are exactly the same as internal names,
  2. External types are exactly the same as internal types, and
  3. External values are exactly the same as internal values.

For example, if you have an internal "database" object with a "default_encoding" string attribute, the conventional approach might yield a config file entry like:

DatabaseDefaultEncoding: utf8

But if we follow the above constraints, we instead see config entries like this:

database.default_encoding = 'utf8'

We can generalize that to:

(path.to.object).key = value

...and in fact, we can write a simple parser which performs just that mapping. In the simplest implementation, only the set of objects is defined, and the set of keys is open-ended (that is, any attribute of the given object(s) is overridable):

for key, value in config.pairs():
    objname, attrname = key.rsplit(".", 1)
    obj = configurables[objname]
    setattr(obj, attrname, value)

In contrast to the conventional approach, in the "Direct Attribute Configuration" pattern:

  1. The set of configurable attributes automatically changes as the program changes. Code which reads attributes may evolve independently of the config which writes them.
  2. The set of configurable attributes must only be declared once. If objects are declared instead of individual attributes, that shrinks to "less than once".
  3. The internal names are exactly the same as the external names; no translation required.
  4. The internal types are exactly the same as the external types; less guessing which of (On, 1, -1, True, true, t, y, Yes, etc, etc) is allowed.
  5. The namespace of attributes is hierarchical, promoting understanding via conceptual chunking.
  6. In order to know what parts of a code block are configurable, the developer only needs to remember which few objects are configurable—all the attributes follow.
  7. With no (or little) intermediate config structures, reads are fast.
  8. With no (or little) intermediate config structures, memory consumption is reduced.
  9. Re-using the grammar and parsers of the host language can reduce parsing time and raise syntax errors earlier and more easily.
  10. There's less code.

Other considerations

  • In an implementation of the Direct Attribute Configuration pattern, there is no longer any slippage between internal and external names. Sometimes that translation layer is useful to ease implementation improvements and deprecations. On the other hand, modern dynamic languages make that sort of renaming/refactoring less painful within the code itself: attributes may become properties, functions may morph into callables, and "get attribute" hooks can rewrite most any get/set as needed.
  • Some will also say, "if config is now the same as code, why bother separating them?" The answer lies in another constraint we typically find in configuration files: key-value pairs. All declarations must follow this syntax, wherein a name is mapped to (the result of) an expression. Imperative statements are not allowed; if some are needed, the developer must wrap them in a function and expose it to config. This is the real semantic boundary between config and code.
  • Some will further say, "operators aren't programmers", that the vagaries of syntax for various types (number, string, date, list, etc) is too much for busy admins and users. I would counter that by showing any existing config implementation. They all already have various syntax for various types, but unique to the whim of the config language designer; one uses commas to separate list items, another uses spaces; one uses ISO dates only, another allows 23 different date formats.
  • If names, types and values are the same for config as for code, then the same authoring tools can often be used for them both. The same tooltips you love when writing code can pop up when writing config.
  • It would be nice to extend the dotted-name format to command-line options as well as config file entries; however, some common option parsers (and even some shells) don't allow dots in option names.
  • Since the set of configurable attributes is open-ended, it's harder to write a "Configuration entries for program X" document for a DAC implementation than a conventional one.
  • The DAC pattern still doesn't address real arrangement-oriented configuration, especially acyclic graph construction.

That's enough for now; feel free to expand in the comments.

09/08/08

Permalink 11:36:06 am, by fumanchu Email , 132 words   English (US)
Categories: IT, CherryPy, WSGI

Resources are concepts for the client

...not objects on the server. Roy Fielding explains yet again:

Web architects must understand that resources are just consistent mappings from an identifier to some set of views on server-side state. If one view doesn’t suit your needs, then feel free to create a different resource that provides a better view (for any definition of “better”). These views need not have anything to do with how the information is stored on the server, or even what kind of state it ultimately reflects. It just needs to be understandable (and actionable) by the recipient.

I have found this to be the single most-misunderstood aspect of HTTP. Too many people conceive of URI's as just names for files or database objects. They can be so much more.

08/19/08

Permalink 07:40:00 pm, by fumanchu Email , 342 words   English (US)
Categories: IT

RESTful JSON

Interesting timing on Joe Gregorio's latest foray. Lately, I've been URI-ifying all the JSON calls which etsy.com's PHP layer makes to the back end (partly with the hope that that API would be opened up to the public someday, but that isn't currently a business need). Even though the company is bucking the mainstream quite successfully, the site itself is pretty typical e-commerce. Here's what I ended up with.

Out of 298 URI's (not counting querystring variants):

  • 40 are collections which support POST to add a subordinate item. Some of these are "top-level" object collections, and some are subordinate collections of data pertaining to single "objects" (e.g. /users/{user_id}/images/)
  • 37 are collections which support GET to return aggregated info on all, or a subset of, their members, sometimes with search params passed in the querystring.
  • 47 are traditional "objects" which support GET/PUT/DELETE on a URI of the form: /collection/subcollection/{id}, and tend to map to a database row (although many of those are virtual, being split in practice over several tables).
  • 92 are URI's which GET/PUT/DELETE "object attributes", usually a single scalar each, which tend to map to single database cells. Several of these have side effects when you set/delete them.
  • 34 are of the form /collection/count.
  • 31 are of the form /collection/ids/.
  • 20 are of the form /collection/count_and_limited_ids, which is perhaps a quirk of our architecture; at some point, I'd like to see how splitting these each into 2 calls affects performance.
  • 6 are RPC-style POST URI's which I haven't had time to refactor into real noun-y resources.
  • 2, I'm sad to say, are of the form DELETE /collection/{id}/cache

The URI space for this API is pretty sparse right now--these URI were simply created to replace an existing RPC-style space of procedure names. And it's essentially a single data point. However, I think it's pretty representative of e-commerce needs for RESTful JSON. One lesson might be that pagination (count and ids) should be addressed in any coordinated protocol effort.

07/25/08

Permalink 01:20:56 am, by fumanchu Email , 243 words   English (US)
Categories: Python, CherryPy

CherryPy for Python 3000

I'm categorically rejecting the 2to3 approach--for myself anyway. If you think it would help, feel free to:

  1. "upgrade" CP to 2.6, which AFAICT means ensuring it will no longer work in 2.5 or previous versions
  2. turn on the 3k warning
  3. import-and-fix until you don't get any warnings
  4. run-tests-and-fix until you don't get any warnings
  5. run 2to3
  6. import-and-fix until you don't get any errors
  7. run-tests-and-fix until you don't get any errors
  8. wait for bug reports

Me, I'd rather just drop cherrypy/ into 3k and skip steps 1-5.

Changes I had to make so far (http://www.cherrypy.org/changeset/2029):

  • (4) urlparse -> urllib.parse
  • (24) "except (ExcA, ExcB):" -> "except ExcA, ExcB:"
  • (30) "except ExcClass, x:" -> "except ExcClass as x"
  • (22) u"" -> ""
  • (1) BaseHTTPServer -> http.server
  • (1) rfc822 -> email.utils
  • (4) md5.new() -> hashlib.md5()
  • (3) sha.new() -> hashlib.sha1()
  • (3) urllib2 -> urllib
  • (28) StringIO -> io
  • (1) func.func_code -> func.code
  • (6) Cookie -> http.cookies
  • (3) ConfigParser -> configparser
  • (1) rfc822._monthnames -> email._parseaddr._monthnames
  • (105) print -> print()
  • (35) httplib -> http.client
  • (22) basestring -> (str, bytes)
  • (12) items() -> list(items())
  • (46) iteritems() -> items()
  • (11) Thread.get/setName -> get/set_name
  • (1) exec "" -> exec("")
  • (1) 0777 -> 0o777
  • (1) Queue -> queue
  • (1) urllib.unquote -> urllib.parse.unquote

At the moment, I'm a bit blocked importing wsgiserver--we had a nonblocking version of makefile that subclassed the old socket._fileobject class. Looks like the whole socket implementation has changed (and much of it pushed down into C). Not looking forward to reimplementing that.

07/11/08

Permalink 04:16:50 pm, by fumanchu Email , 228 words   English (US)
Categories: WHELPS

Writing High-Efficiency Large Python Systems--Lesson #3: Banish lazy imports

Lazy imports can be done either explicitly, by moving import statements inside functions (instead of at the global level), or by using tools such as LazyImport from egenix. Here's why they suck:

> fetchall (PgSQL:3227)
--> __fetchOneRow (PgSQL:2804)
----> typecast (PgSQL:874)
... 26703 function calls later ...
----< typecast (PgSQL:944): 
      <mx.DateTime.DateTime object for
       '2005-08-15 00:00:00.00' at 2713120>
    3477.321ms

Yes, folks, that single call took 3.4 seconds to run! That would be shorter if I weren't tracing calls, but...ick. Don't make your first customer wait like this in a high-performance app. The solution if you're stuck with lazy imports in code you don't control is to force them to be imported early:

mx.DateTime.Parser.DateFromString('2001-01-01')

Now that same call:

> fetchall (PgSQL:3227)
--> __fetchOneRow (PgSQL:2804)
----> typecast (PgSQL:874)
... 7 function calls later ...
----< typecast (PgSQL:944): 
      <mx.DateTime.DateTime object for
       '2005-08-15 00:00:00.00' at 27cf360>
    1.270ms

That's 1/3815th the number of function calls and 1/2738th the run time. I am not missing decimal points.

Not only is this time-consuming for the first requestor, but lends itself to nasty interactions when a second request starts before the first is done with all the imports. Module import is one of the least-thread-safe parts of almost any app, because people are used to expecting all imports in the main thread at process start.

I'm trying very hard not to rail at length about WSGI frameworks that expect to start up applications during the first HTTP request...but it's so tempting.

07/03/08

Permalink 05:37:31 pm, by fumanchu Email , 319 words   English (US)
Categories: WHELPS

Writing High-Efficiency Large Python Systems--Lesson #2: Use nothing but local syslog

You want to log everything, but you'll find that even in the simplest requests with the fastest response times, a simple file-based access log can add 10% to your response time (which usually means ~91% as many requests per second). The fastest substitute we've found for file-based logging in Python is syslog. Here's how easy it is:

import syslog
syslog.syslog(facility | priority, msg)

Nothing's faster, at least nothing that doesn't require you telling Operations to compile a new C module on their production servers.

"But wait!" you say, "Python's builtin logging module has a SysLogHandler! Use that!" Well, no. There are two reasons why not. First, because Python's logging module in general is bog-slow--too slow for high-efficiency apps. It can make many function calls just to decide it's not going to log a message. Second, the SysLogHandler in the stdlib uses a UDP socket by default. You can pass it a string for the address (probably '/dev/log') and it will use a UNIX socket just like syslog.syslog, but it'll still do it in Python, not C, and you still have all the logging module overhead.

Here's a SysLogLibHandler if you're stuck with the stdlib logging module:

class SysLogLibHandler(logging.Handler):
    """A logging handler that emits messages to syslog.syslog."""
    priority_map = {
        10: syslog.LOG_NOTICE, 
        20: syslog.LOG_NOTICE, 
        30: syslog.LOG_WARNING, 
        40: syslog.LOG_ERR, 
        50: syslog.LOG_CRIT, 
        0: syslog.LOG_NOTICE, 
        }

    def __init__(self, facility):
        self.facility = facility
        logging.Handler.__init__(self)

    def emit(self, record):
        syslog.syslog(self.facility | self.priority_map[record.levelno],
                      self.format(record))

I suggest using syslog.LOCAL0 - syslog.LOCAL7 for the facility arg. If you're writing a server, use one facility for access log messages and a different one for error/debug logs. Then you can configure syslogd to handle them differently (e.g., send them to /var/log/myapp/access.log and /var/log/myapp/error.log).

Permalink 05:02:59 pm, by fumanchu Email , 189 words   English (US)
Categories: WHELPS

Writing High-Efficiency Large Python Systems--Lesson #1: Transactions in tests

Don't write your test suite to create and destroy databases for each run. Instead, make each test method start a transaction and roll it back. We just made that move at work on a DAL project, and the test suite went from 500+ seconds to run the whole thing down to around 100. It also allowed us to remove a lot of "undo" code in the tests.

This means ensuring your test helpers always connect to their databases on the same connection (transactions are connection-specific). If you're using a connection pool where leased conns are bound to each thread, this means rewriting tests that start new threads (or leaving them "the old way"; that is, create/drop). It also means that, rather than running slightly different .sql files per test or module, you instead have a base of data and allow each test to add other data as needed. If your rollbacks work, these can't pollute other tests.

Obviously, this is much harder if you're doing integration testing of sharded systems and the like. But for application logic, it'll save you a lot of headache to do this from the start.

<< 1 2 3 4 5 6 7 8 9 10 11 ... 26 >>

July 2017
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          

Search

The requested Blog doesn't exist any more!

XML Feeds

multiblog