Pages: << 1 2 3 4 5 6 7 8 9 10 11 ... 26 >>


Permalink 12:22:48 pm, by fumanchu Email , 238 words   English (US)
Categories: Python

Specifically designed to be readable

Duncan McGreggor writes:

The Twisted source code was specifically designed to be read
(well, the code from the last two years, anyway).

If that were true, then this would not be ('object' graciously donated by me to the Twisted Foundation):

>>> from twisted.web import http
>>> http.HTTPChannel.mro()
[<class 'twisted.web.http.HTTPChannel'>,
 <class 'twisted.protocols.basic.LineReceiver'>,
 <class 'twisted.internet.protocol.Protocol'>,
 <class 'twisted.internet.protocol.BaseProtocol'>,
 <type 'object'>,
 <class twisted.protocols.basic._PauseableMixin at 0x02ABCB70>,
 <class twisted.protocols.policies.TimeoutMixin at 0x02ABC420>,

This wouldn't be true either:

$ grep -R "class I.*" /usr/lib/python2.5/site-packages/twisted | wc -l

Interfaces are great for development of a framework, but suck for development with a framework. That must be an older rev on my nix box; that number's grown to 380 in trunk! Not all of those are Interfaces, but most are.

Here's my personal favorite:

for tran in 'Generic TCP UNIX SSL UDP UNIXDatagram Multicast'.split():
    for side in 'Server Client'.split():
        if tran == "Multicast" and side == "Client":
        base = globals()['_Abstract'+side]
        method = {'Generic': 'With'}.get(tran, tran)
        doc = _doc[side]%vars()
        klass = new.classobj(tran+side, (base,),
                             {'method': method, '__doc__': doc})
        globals()[tran+side] = klass

You've got a tough row to hoe, Twisted devs. Good luck.


Permalink 08:56:58 pm, by fumanchu Email , 335 words   English (US)
Categories: Python, CherryPy

Tracking memory leaks with Dowser

Marius Gedminas just wrote a post on memory leaks. He could have used Dowser to find the leak more easily, I'll bet.

Dowser is a CherryPy application for monitoring and managing object references in your Python program. Because CherryPy runs everything (even the listening HTTP socket) in its own threads, it's a snap to include Dowser in any Python process. Dowser is also very lightweight (because CherryPy is). Here's how I added it to a Twisted project we're using at work:

from twisted.application import service
application = service.Application("My Server")

import cherrypy
from misc import dowser
cherrypy.config.update({'server.socket_port': 8088})
# Windows only

from twisted.internet import reactor
reactor.addSystemEventTrigger('after', 'shutdown', cherrypy.engine.exit)

The lines before 'import cherrypy' already existed and are here just for context (this is a Twisted service.tac module). Let's quickly discuss the new code:

  1. import cherrypy and dowser. You don't have to stick dowser into a 'misc' folder; that's just how I checked it out from svn.
  2. Set the port you want CherryPy to listen on; pick a port your app isn't already using if it's a TCP server.
  3. Mount the dowser root.
  4. Turn off the CherryPy autoreloader, and the Ctrl-C handler if you're on Windows. I should really turn that off by default in CP. :/
  5. Start the engine, which starts listening on the port in a new thread among other things.
  6. Tell Twisted to stop CherryPy when it stops.

Then browse to http://localhost:8088/ and you'll see pretty sparklines of all the objects. Change the URL to http://localhost:8088/?floor=20 to see graphs for only those objects which have 20 or more objects.

Then, just click on the 'TRACE' links to get lots more information about each object. See the Dowser wiki page for more details and screenshots.


Permalink 10:38:58 pm, by fumanchu Email , 53 words   English (US)
Categories: IT, Python

Vellum coming along nicely

First, a great aphorism from Zed's (Vellum book]( (pdf):

Makefiles are the C programmer’s REPL and interpreter.

He also asks himself:

What’s the minimum syntax needed to describe a build specification?

I predict good things based on the presence of that question alone.


Permalink 07:21:10 pm, by fumanchu Email , 42 words   English (US)
Categories: IT, Python

Epic [FAIL]

You have my permission to name your next test framework, library, or script "epic" and bill it as "more full of [FAIL] than any other test thingy".

Oh, and

/me looks in Titus' direction...


Permalink 11:58:01 am, by fumanchu Email , 105 words   English (US)
Categories: IT, Python, Dejavu

LINQ in Python

Chui's counterpoint pines:

There are some interesting ideas raised in LINQ that even Python developers ought to explore and consider adopting in a future Python.

Python had all this before LINQ in Dejavu and now Geniusql, and more pythonically, to boot. Instead of:

var AnIQueryable = from Customer in db.Customers where
    Customer.FirstName.StartsWith("m") select Customer;

you can write:

m_names =
    lambda cust: cust.FirstName.startswith("m"))

and instead of:

var AverageRuns =(from Master in this.db.Masters
    select Master.Runs).Average()

you can write:

avgruns = m: avg(m.Runs))


Permalink 03:49:54 pm, by fumanchu Email , 1195 words   English (US)
Categories: IT

Xtremely Quick Development System

Divmod has a development methodology which they call UQDS. It's billed as lightweight, but I've been using it for 6 months now and find it burdensome. The basic flow of UQDS is: make a ticket, do all work in a branch, get a full review, merge to trunk. In theory, this brings the benefits of peer review and fewer conflicts between developers. In practice, however, I've found the following problems:

  1. Review is great, but is often performed by whomever is "free" rather than by those whom the change most affects. The larger the group affected, the more review happens after the change is merged back to trunk anyway.
  2. Conflicts are not reduced, they're delayed and distributed. One branch still gets merged to trunk before others, and those others often must all forward merge.
  3. The amount of change in a branch grows too large, and often incorporates changes which have little or nothing to do with the original ticket. Comments and whitespace get touched up from one dev to another; small buglets get found and fixed; refactoring happens. Review is more difficult.
  4. Commit history is unreadable. Changesets consist entirely of massive merges from branches and equally massive forward merges.
  5. The review overhead isn't worth the supposed benefits; developers spend too much time in review, and often invent bikeshed problems in order to feel their review time is worthwhile. There is no distinction between the review process for tiny versus massive changesets.
  6. Many problems still leak through the review process, but reverting trunk changesets is harder because they tend to be too large and mix concerns. That is, you often end up reverting 1/2 a changeset, which subversion does not make easy. UQDS tries to avoid reverting multiple changesets at once, but overcorrects by making fractional reverts more common. With Subversion, at least, I'd rather be saddled with the former.
  7. UQDS says, "[developers] can take all the time they need to come up with a good message" when they commit to trunk. In reality, they don't--they just want to finish quickly and move on to the next ticket.

So, here's my answer:

The Xtremely Quick Development System (XQDS)

The goals of XQDS:

  1. Code fast.
  2. Improve the process documentation.

The strategy of XQDS:

  1. Reduce changeset size. This allows everyone to code faster, since they don't need to exert as much mental effort to understand others' changes. It also makes the timeline more readable (and revertible!), since each changeset and its commit message is atomic.
  2. Resolve conflicts immediately.
  3. Apply review resources only where needed.
  4. Make tests easier to run than not run. If your test suite takes so long that you're forced to run it on a remote machine, you've either done something wrong or this methodology is too small for you. Which is fine.
  5. Record all decisions: on tickets where warranted, on trunk changesets regardless.

The flow of XQDS is:

  1. A task is created in an issue tracker.
  2. The task is discussed, on the ticket if possible. If the conversation is long or ephemeral, it may be conducted on IRC or elsewhere; however, at least the final decision(s) should be summarized on the ticket, not on a wiki page, mailing list, or other medium. This step may be ongoing as work is done.
  3. Someone accepts the ticket. If work lapses, others are always free to accept it for themselves.
  4. The worker does work in their local copy of trunk. Branches are created sparingly as needed, and the worker switches their local copy using svn switch.
  5. Work gets merged to trunk as it passes the full test suite in the smallest functional chunks possible. Sometimes "smallest possible" can actually be quite large; however, refactorings, buglets, and doc improvements are committed in their own changesets. In my experience, there's nothing worse than trying to review a changeset that's 5% ticket fix and 95% whitespace changes.
  6. Each commit includes a message that MUST describe the actual change; comments about the reasons or context are desirable but secondary. However, commits that directly influence a ticket always reference the ticket.
  7. Everyone runs svn up on at least a daily basis, and definitely before committing. Conflicts are resolved as needed with each local copy.


  • Don't you lose the benefit of branching? What if Jethro checks in broken code and goes to lunch?

    1. Jethro shouldn't be checking in broken code. You fix this by increasing peer shame, not ignoring it or shifting the detection work onto another developer.
    2. This happens even using UQDS. Review by a single developer doesn't catch all possible broken code.
    3. You can still branch whenever you see fit. You're just not required to branch for each ticket.
  • Don't you lose the benefit of review? Review helps avoid conflict and also teaches the reviewee.

    1. In my experience, review is better with XQDS than UQDS. First, you apply review resources where they are needed. Not every change needs review. This allows developers to invest more energy into reviews which merit it.
    2. You raise the bar for review: everyone is expected to watch the timeline, svn up often, and help resolve conflicts.
    3. Senior developers are allowed to touch up junior developers' code in situ. It's always faster and often much more effective to show improvements than explain them.
    4. People who care about or are particularly skilled about various aspects of code quality are themselves responsible for upkeep. At some point, you can stop trying to teach Jethro how to avoid contention over concurrent shared mutable resources because he's never going to get it. At some point, you stop trying to make the Twisted developer follow your PEP-8 whitespace conventions because she has no incentive to do so; you fix it yourself and get on with life.
  • What if I need to share unfinished code with another developer? Or switch developers mid-feature? Or switch platforms mid-feature? Or switch features mid-developer?

    • Make a branch. XQDS doesn't prohibit this. But it doesn't mandate it like UQDS does. All that Combinator nonsense can be replaced with svn switch and a simple folder rename when you want to put aside some work for a while.
  • Isn't trunk broken more often?

    • Not if you have a good local test suite. If your test suite can't detect broken code, you need to write more functional (end-to-end) tests.
  • UQDS says it improves information flow to managers. Don't you lose that with XQDS?

    • Not at all. On the one hand, managers use the same tools (e.g. Trac timelines) that developers use to monitor progress. But with XQDS, the timeline is actually readable without all those cross merges. Managers can therefore also write Trac SQL queries, for example, which report developer activity based on the changeset activity.
  • Doesn't XQDS require more conflict resolution?

    • Not more, just more immediate. As UQDS notes, this can result in a situation where you feel like you cannot commit your local work because it's now broken due to someone else's changes. But that's a false impression: just svn switch to a new branch and commit your (now broken) changes. You're going to have to resolve the conflict either way; but XQDS allows you to skip making a branch unless you need it.


Permalink 09:05:32 pm, by fumanchu Email , 1036 words   English (US)
Categories: IT

Memcached set invalidation

One of the saving graces of Memcached is its use of a stable, static address space; that is, client X and client Y can each manipulate datum D as long as they both know its key. But because the space of addresses is static and flat (not hierarchical), it also tends to be huge, and sparse. This can make it difficult to perform set operations, such as invalidation of a class of entries.

For example, let's design a Data Access component which sits in front of a database, and accepts requests using an RPC-style interface. It fetches results from the cache where possible; otherwise, it reads from a database and writes the result to the cache before returning it to the caller. Assume we have multiple Data Access (DAL) servers, multiple Memcached servers, and one or more database servers.

A common set-invalidation scenario involves the caching of lists of items. Let's suppose a web server requests get_items_in_cart(cart_id=13, limit=10, offset=30). The DAL server might translate this into a cache key such as get_items_in_cart::13:10:30. So far so good. But that's just a read; when we add writes to the picture, cache coherency starts to become a problem.

When a webserver asks a DAL component to add an item to cart 13, we need to invalidate (or recalculate) get_items_in_cart::13:10:30. However, it should be readily apparent that we need to invalidate not just a single key, but potentially a whole class of keys, get_items_in_cart::id:limit:offset, and that that set of keys could be very large. Let's conservatively guess:

>>> max_items = 1000
>>> sum([(max_items / n) for n in range(5, 21)])

That is, if we restrict the 'limit' argument to the interval [5, 20], and assume a maximum number of items per cart of 1024, we end up with over 1500 potential cache keys per cart for a single function! And if we have a million carts...? There are several ways we could attack this problem:

  1. Restrict the interface. Don't parameterize functions unless absolutely necessary. This would mean, rather than expose get_items_in_cart(cart_id, limit, offset), which has 3 arguments, we might instead always assume a page size of 10 and expose get_paged_items_in_cart(cart_id, page), which would result in only 100 potential cache keys per cart, not 1500. We might even go further and expose get_items_in_cart(cart_id) (1 cache key per cart) and let the caller do its own paging. This makes cache invalidation more performant (fewer keys to manage) at the cost of API simplicity and extensibility (because the API is less flexible). So your programmers will scream, possibly silently, when the product team decides they want to change the number of items on a page.
  2. Use an address registry. This approach introduces a separate directory service; when the result of get_items_in_cart(cart_id, limit, offset) is cached, the address is registered in a central list. When a class of keys needs to be invalidated, that central registry can then iterate over all seen keys for each class quickly and stably (or return the list so the caller can do it). This can reduce performance with the extra network traffic, is not very scalable, and tacks on a new reliability issue (the directory is a single point of failure, which costs even more performance if made redundant). Also, your system architects will probably leave you threatening notes.
  3. Broadcast address hits. This is like #2, but broadcasts address registration to all DAL servers instead of using a central registry. Network traffic is much increased, and reliably updating all participating nodes can be quite a chore. Your network admins will most likely start randomly unplugging your office's LAN and phones, and then surreptitiously plug them back in when you knock on their door.
  4. Hash the addresses. If the consumers of our API tend to call only a few combinations of arguments for our cart function, then although we have a large potential space of cache keys, the set of actual keys might be very small, maybe only 2 or 3 keys! The classic means of reducing a large, sparse array to a small, denser one is hashing. That is, rather than using the key get_items_in_cart::13:10:30, we might hash it to get_items_in_cart::13/x, where 'x' is a number from 1 to 4. If we could come up with a reliable hash, we could then invalidate 4 keys, not 1500. This approach can become very complex, and results in cache keys which aren't very transparent (so getting other components to interact with the system is harder). But it doesn't require any extra network calls or new components; each DAL server does its own hash calculations. Your VP of Engineering may stick you with the bill for all those new engineers with Math Ph.D.'s, though.
  5. Rely on timeouts. If you don't mind web servers serving stale data for get_items_in_cart::13:10:30 for, say, 2 minutes, then tell memcached to expire that entry after 2 minutes, and don't bother trying to invalidate it on write. This is by far the simplest solution when you can swing it. But your system operators or QA team might call you first when they have to wait for one too many timeouts during a 3 A.M. debugging session.
  6. Add a version column to the carts table, and use it when forming cache keys. When you update the row, update the version. This can increase the number of database hits if not done properly, but in this example, your callers most likely have already fetched the entire relevant carts row, and might be able to pass the version in the DAL call. However, don't be surprised when your DBA's tell you the HR DB crashed and they lost your timesheets for the past year.

That's all I can think of at the moment. I think I've got some political goodwill among our other programmers at the moment (in that they don't wish me any specific harm, yet), so I may go with solution #1 for now on my current project. But I've also recently been playing architect, so maybe I'll pick solution #2 and just throw away any threatening notes I leave for myself.


Permalink 07:32:22 pm, by fumanchu Email , 70 words   English (US)
Categories: IT, General

Complexity: the mismatch between needs, does, and wants

ocean has rapidly become my favorite blogger. I even find myself reading nearly every one of his quick links. Here's a gem from a few days ago:

This mismatch between what a person wants, what a tool does, and what a person needs turns out to be very important. It's so important that it has a special name: complexity.


Permalink 11:30:19 am, by fumanchu Email , 1791 words   English (US)
Categories: IT, Python, Dejavu

Memcached indices

It seems lots of people are using memcached to cache both a set of objects (each with their own key), and also various lists of those objects using different keys. For example, a retailer might cache Item objects, but also want to cache the list of Items in a given Category. The SQL before memcached might look like this:

SELECT * FROM Item WHERE CategoryID = 5;

..whereas with memcached mixed in, the I/O tends to look like this (against an empty cache):

get Item:CategoryID=5
set Item:CategoryID=5 1 300 19
get Item:1111
get Item:2222
get Item:3333
get Item:4444
SELECT * FROM Item WHERE ID IN (1111, 2222, 3333, 4444)
set Item:1111 1 300 58
set Item:2222 1 300 58
set Item:3333 1 300 54
set Item:4444 1 300 80

That is, fetch the list of ID's from the cache; if not present, fetch it from the DB and store it in the cache (the "300" in the above examples means, "expire in 300 seconds"). Then iterate over the list of ID's and try to fetch each whole object from cache; if any miss, ask the DB for them (in as few operations as possible) and store them in the cache.

Once both the objects and the list-of-id's are cached, subsequent calls to a hypothetical get_items_by_category function should look like this:

get Item:CategoryID=5
sending key Item:CategoryID=5
get Item:1111
sending key Item:1111
get Item:2222
sending key Item:2222
get Item:3333
sending key Item:3333
get Item:4444
sending key Item:4444

But what happens when you move Item 3333 from Category 5 to Category 6? There are three possibilities:

  1. Your timeouts are reasonable. In this case, you're OK with some ambiguity; specifically, you're OK that all clients will continue to see Item 3333 in the old Category list for up to 300 seconds. If you ask the Item object directly for its Category, you'll get "6", but if you ask for all Items in Category 5, it'll still show up in the "wrong" category. It might even show up in both Category lists, if you refetch Item:Category=6 from the DB before the cached Item:Category=5 list expires.
  2. Your timeouts are unreasonable. You thought 300 seconds was a good expiration time because that's the minimum you could get away with to sufficiently reduce load on your poor single-point-of-failure DB. In this case, you're not comfortable with the ambiguity, so you start adding invalidation calls to clean out the cached lists sooner. Lots of them. Spread out all over your application.
  3. Your timeouts are 0, meaning "never expire". This is like 2, but means your clients will probably never hit the DB on their own--you MUST explicitly add code to update both the cached objects and the cached lists whenever an object changes, or expect an unusable site.

If you're happy with option 1, great! The rest of this discussion probably isn't for you. I'm going to explore three solutions (only one of which I'm happy with) for cases 2 and 3.

Database Indices

So you're not happy with the expiration time of your cached lists, so you've built all that invalidation code. What you may not realize is that you've just reinvented (badly) something databases have had for decades: indices.

An index is usually implemented with a B+-tree, most of the details of which are unimportant for us. What is important is that 1) an index covers a subset of the columns in the table (often a single column), which from now on I'm going to call the index criteria, and 2) each distinct combination of values for the index criteria has its own leaf node in the tree, which contains/points to a list of rows in the table that match that combination. What a mouthful. Another way to say it is that the leaf nodes in the tree look like this for an index over Item.CategoryID:

(2,): [9650, 2304, 22, 50888]
(3,): [323, 3000, 243246, 87346, 6563, 8679]
(5,): [1111, 2222, 3333, 4444]
(6,): [18]

When you ask the database for all Items with a CategoryID of 5, the database traverses the "CategoryID" index tree, finds the leaf node for "5", grabs the list stored at that node, then iterates over the list and yields each full object mentioned therein. This is called an "index scan":

                                            QUERY PLAN
 Index Scan using items_categoryid on Items  (cost=0.00..29.49 rows=609 width=24)
   Index Cond: (CategoryID = 5)

Sound familiar? It's exactly what we're doing by hand with memcached.

Memcached Indices

Okay, it's not exactly like our memcached example. There are some striking differences.

First, a database index is sparse, in the sense that it doesn't contain leaf nodes for every potential index criteria value, just the concrete values in the table. Our memcached indexing is even sparser: so far it only contains leaf nodes (lists of ID's) for each index scan we've run. If we've only asked for Items in Categories 2 and 5, memcached will only contain nodes for Item:CategoryID=2 and Item:CategoryID=5.

Second, a database index is a full tree, with a root node. What we've done so far in memcached is only store leaf nodes. This will bite us in a moment.

Third, a database index is naturally transactional. When you move Item 3333 from Category 5 to 6, you might execute the SQL statement "UPDATE Item SET CategoryID = 6 WHERE ID = 3333;". The database, being the sole arbiter of Truth, can lock the row, read the old value, lock the index leaf nodes, remove the row from the old leaf node and add it to the new one, write the new value to the table, and unlock, all within a fully-logged transaction (although it can be a lot more complicated than that with ranged index locks, log schemes, and page reallocation schemes). Our memcached implementation so far can't do any of that.

Combining the above differences, we get...a real mess. Specifically, we have to find a way to do all of those index updates.

One way would be to invalidate everything. Call flush_all and be done with it. This can work (poorly) if you've partitioned your data well over multiple memcached servers, but Woe Unto Thee if you're storing, say, cached HTML on the same node.

Another, narrower, solution would be to try to delete all known cached lists for the given criteria. One way to do that would be to attempt to maintain the whole index tree in memcached, not just the leaf nodes. This turns out to be a thorny problem because of the transitory nature of memcached data--what happens when you lose an intermediate node in the index tree? or the root node? You'll fail to update the now-unreachable leaf nodes, but clients will still fetch them and get the stale results.

An even narrower solution would be to try to update just two index leaf nodes for each updated object. For example, if we move Item 3333 from Category 5 to 6, we could try to remove the Item from leaf node "5" and add it to leaf node "6". This can work, but requires keeping the old values around for all indexed columns in your application layer, which a lot of modern DALs and ORMs don't do by default.

I was stuck at this point for a couple of days, until I had an epiphany:

Add eagerly, remove lazily

Recall if you will what I said above about the transactional nature of database indices. The DB can remove the row from the old index node(s) and add it to the new index node(s) in one step. We already know we can't do that with memcached, since the "get" and "set" operations aren't atomic anyway (although check-and-set can work around this; see 'cas' in protocol.txt.)

Transactions exist to maintain data integrity; to move The Data from one valid state to another valid state. But don't confuse the technique for the phenomena that technique is trying to prevent. In the case of indices, we use transactions on indices to avoid both 1) reading an object that does not meet the index criteria for the given node, or 2) failing to read an object that does meet the criteria. Databases avoid both scenarios by adding and removing rows from the index atomically.

When dealing with index nodes in memcached, however, the best approach is to separate the two phenomena by adding eagerly and removing lazily:

  • Add eagerly: when an object (row) is flushed from the session, add it to any indices for that class (table). That is, whenever you "UPDATE Item SET CategoryID = 6 WHERE ID = 3333", you also cache the object (you were doing that already anyway), but you also append the ID to the list stored at Item:CategoryID=6.
  • Remove lazily: when you UPDATE you do not try to remove the item from the old index. Instead, you let the next client that reads Item:CategoryID=5 do that when it iterates over the objects in the list. If any objects no longer meet the index node criteria (CategoryID=5), they are removed from the list, and the client sets the revised index node back into the cache.


There are several:

  1. A client might recall the index node but not the objects in its list. For example, a web page might just need hyperlinks to each object, which it tries to generate without recalling the entire set of objects. This technique succeeds in the fact that it doesn't fetch more data than the app asked for, but fails if stale objects are still listed in the index node. The only solutions I can see are to either disallow this sort of index read or to fall back on sane timeouts again.
  2. This doesn't address order, limit, or offset explicitly. Simple order queries should be no problem, but if you want the top N items you must either include the 'order by' columns in your index, or read in all items in the index, filter out the stale objects, sort them, and perform the limit/offset yourself.
  3. Functionally decomposing SQL (or whatever QL your ORM might use) into potential index criteria is not a simple task. If you're using a single table.column = value expression, it maps well, but using joins, boolean operators (like and, or, not) or other arithmetic operators makes it difficult. But then, the major databases have the same issues.
  4. Whatever you all leave in the comments.


I think this is worth a shot. I'm adding it to Dejavu in a branch, but it's complicated enough that it may not be done for a bit. Comments and suggestions on the approach welcome.


Permalink 06:16:18 pm, by fumanchu Email , 28 words   English (US)
Categories: General, Photography

The Dresden Codak Gallery

Just wanted to show off my new framed prints from Can't wait to get a bigger place where they'll really look good!

<< 1 2 3 4 5 6 7 8 9 10 11 ... 26 >>

September 2020
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      


The requested Blog doesn't exist any more!

XML Feeds