« Complexity: the mismatch between needs, does, and wantsThe Dresden Codak Gallery »

Memcached indices

11/12/07

Permalink 11:30:19 am, by fumanchu Email , 1791 words   English (US)
Categories: IT, Python, Dejavu

Memcached indices

It seems lots of people are using memcached to cache both a set of objects (each with their own key), and also various lists of those objects using different keys. For example, a retailer might cache Item objects, but also want to cache the list of Items in a given Category. The SQL before memcached might look like this:

SELECT * FROM Item WHERE CategoryID = 5;

..whereas with memcached mixed in, the I/O tends to look like this (against an empty cache):

get Item:CategoryID=5
END
SELECT ID FROM Item WHERE CategoryID = 5;
set Item:CategoryID=5 1 300 19
1111,2222,3333,4444
STORED
get Item:1111
END
get Item:2222
END
get Item:3333
END
get Item:4444
END
SELECT * FROM Item WHERE ID IN (1111, 2222, 3333, 4444)
set Item:1111 1 300 58
STORED
set Item:2222 1 300 58
STORED
set Item:3333 1 300 54
STORED
set Item:4444 1 300 80
STORED

That is, fetch the list of ID's from the cache; if not present, fetch it from the DB and store it in the cache (the "300" in the above examples means, "expire in 300 seconds"). Then iterate over the list of ID's and try to fetch each whole object from cache; if any miss, ask the DB for them (in as few operations as possible) and store them in the cache.

Once both the objects and the list-of-id's are cached, subsequent calls to a hypothetical get_items_by_category function should look like this:

get Item:CategoryID=5
sending key Item:CategoryID=5
END
get Item:1111
sending key Item:1111
END
get Item:2222
sending key Item:2222
END
get Item:3333
sending key Item:3333
END
get Item:4444
sending key Item:4444
END

But what happens when you move Item 3333 from Category 5 to Category 6? There are three possibilities:

  1. Your timeouts are reasonable. In this case, you're OK with some ambiguity; specifically, you're OK that all clients will continue to see Item 3333 in the old Category list for up to 300 seconds. If you ask the Item object directly for its Category, you'll get "6", but if you ask for all Items in Category 5, it'll still show up in the "wrong" category. It might even show up in both Category lists, if you refetch Item:Category=6 from the DB before the cached Item:Category=5 list expires.
  2. Your timeouts are unreasonable. You thought 300 seconds was a good expiration time because that's the minimum you could get away with to sufficiently reduce load on your poor single-point-of-failure DB. In this case, you're not comfortable with the ambiguity, so you start adding invalidation calls to clean out the cached lists sooner. Lots of them. Spread out all over your application.
  3. Your timeouts are 0, meaning "never expire". This is like 2, but means your clients will probably never hit the DB on their own--you MUST explicitly add code to update both the cached objects and the cached lists whenever an object changes, or expect an unusable site.

If you're happy with option 1, great! The rest of this discussion probably isn't for you. I'm going to explore three solutions (only one of which I'm happy with) for cases 2 and 3.

Database Indices

So you're not happy with the expiration time of your cached lists, so you've built all that invalidation code. What you may not realize is that you've just reinvented (badly) something databases have had for decades: indices.

An index is usually implemented with a B+-tree, most of the details of which are unimportant for us. What is important is that 1) an index covers a subset of the columns in the table (often a single column), which from now on I'm going to call the index criteria, and 2) each distinct combination of values for the index criteria has its own leaf node in the tree, which contains/points to a list of rows in the table that match that combination. What a mouthful. Another way to say it is that the leaf nodes in the tree look like this for an index over Item.CategoryID:

(2,): [9650, 2304, 22, 50888]
(3,): [323, 3000, 243246, 87346, 6563, 8679]
(5,): [1111, 2222, 3333, 4444]
(6,): [18]

When you ask the database for all Items with a CategoryID of 5, the database traverses the "CategoryID" index tree, finds the leaf node for "5", grabs the list stored at that node, then iterates over the list and yields each full object mentioned therein. This is called an "index scan":

# EXPLAIN SELECT * FROM Items WHERE CategoryID = 5;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using items_categoryid on Items  (cost=0.00..29.49 rows=609 width=24)
   Index Cond: (CategoryID = 5)

Sound familiar? It's exactly what we're doing by hand with memcached.

Memcached Indices

Okay, it's not exactly like our memcached example. There are some striking differences.

First, a database index is sparse, in the sense that it doesn't contain leaf nodes for every potential index criteria value, just the concrete values in the table. Our memcached indexing is even sparser: so far it only contains leaf nodes (lists of ID's) for each index scan we've run. If we've only asked for Items in Categories 2 and 5, memcached will only contain nodes for Item:CategoryID=2 and Item:CategoryID=5.

Second, a database index is a full tree, with a root node. What we've done so far in memcached is only store leaf nodes. This will bite us in a moment.

Third, a database index is naturally transactional. When you move Item 3333 from Category 5 to 6, you might execute the SQL statement "UPDATE Item SET CategoryID = 6 WHERE ID = 3333;". The database, being the sole arbiter of Truth, can lock the row, read the old value, lock the index leaf nodes, remove the row from the old leaf node and add it to the new one, write the new value to the table, and unlock, all within a fully-logged transaction (although it can be a lot more complicated than that with ranged index locks, log schemes, and page reallocation schemes). Our memcached implementation so far can't do any of that.

Combining the above differences, we get...a real mess. Specifically, we have to find a way to do all of those index updates.

One way would be to invalidate everything. Call flush_all and be done with it. This can work (poorly) if you've partitioned your data well over multiple memcached servers, but Woe Unto Thee if you're storing, say, cached HTML on the same node.

Another, narrower, solution would be to try to delete all known cached lists for the given criteria. One way to do that would be to attempt to maintain the whole index tree in memcached, not just the leaf nodes. This turns out to be a thorny problem because of the transitory nature of memcached data--what happens when you lose an intermediate node in the index tree? or the root node? You'll fail to update the now-unreachable leaf nodes, but clients will still fetch them and get the stale results.

An even narrower solution would be to try to update just two index leaf nodes for each updated object. For example, if we move Item 3333 from Category 5 to 6, we could try to remove the Item from leaf node "5" and add it to leaf node "6". This can work, but requires keeping the old values around for all indexed columns in your application layer, which a lot of modern DALs and ORMs don't do by default.

I was stuck at this point for a couple of days, until I had an epiphany:

Add eagerly, remove lazily

Recall if you will what I said above about the transactional nature of database indices. The DB can remove the row from the old index node(s) and add it to the new index node(s) in one step. We already know we can't do that with memcached, since the "get" and "set" operations aren't atomic anyway (although check-and-set can work around this; see 'cas' in protocol.txt.)

Transactions exist to maintain data integrity; to move The Data from one valid state to another valid state. But don't confuse the technique for the phenomena that technique is trying to prevent. In the case of indices, we use transactions on indices to avoid both 1) reading an object that does not meet the index criteria for the given node, or 2) failing to read an object that does meet the criteria. Databases avoid both scenarios by adding and removing rows from the index atomically.

When dealing with index nodes in memcached, however, the best approach is to separate the two phenomena by adding eagerly and removing lazily:

  • Add eagerly: when an object (row) is flushed from the session, add it to any indices for that class (table). That is, whenever you "UPDATE Item SET CategoryID = 6 WHERE ID = 3333", you also cache the object (you were doing that already anyway), but you also append the ID to the list stored at Item:CategoryID=6.
  • Remove lazily: when you UPDATE you do not try to remove the item from the old index. Instead, you let the next client that reads Item:CategoryID=5 do that when it iterates over the objects in the list. If any objects no longer meet the index node criteria (CategoryID=5), they are removed from the list, and the client sets the revised index node back into the cache.

Shortcomings

There are several:

  1. A client might recall the index node but not the objects in its list. For example, a web page might just need hyperlinks to each object, which it tries to generate without recalling the entire set of objects. This technique succeeds in the fact that it doesn't fetch more data than the app asked for, but fails if stale objects are still listed in the index node. The only solutions I can see are to either disallow this sort of index read or to fall back on sane timeouts again.
  2. This doesn't address order, limit, or offset explicitly. Simple order queries should be no problem, but if you want the top N items you must either include the 'order by' columns in your index, or read in all items in the index, filter out the stale objects, sort them, and perform the limit/offset yourself.
  3. Functionally decomposing SQL (or whatever QL your ORM might use) into potential index criteria is not a simple task. If you're using a single table.column = value expression, it maps well, but using joins, boolean operators (like and, or, not) or other arithmetic operators makes it difficult. But then, the major databases have the same issues.
  4. Whatever you all leave in the comments.

Conclusion

I think this is worth a shot. I'm adding it to Dejavu in a branch, but it's complicated enough that it may not be done for a bit. Comments and suggestions on the approach welcome.

2 comments

Comment from: chenz [Visitor]

hi, could you tell me how to config cherrypy to list all the files in a dir? like http://download.cherrypy.org/

I just can't find it in the FAQ, this should be very simple problem, do not know where the magic is

12/13/07 @ 02:49
Comment from: fumanchu [Member] Email

chenz, ask on the mailing list.

12/13/07 @ 13:11

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.

Please enter the phrase "I am a real human." in the textbox above.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
September 2017
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30

Search

The requested Blog doesn't exist any more!

XML Feeds

powered by b2evolution