Hobby-hacking Eric

2009-10-08

darcs hashed-storage work merged (woo!)

The following is a copy of my recent post to the darcs-users mailing list.

Hi everybody,

So you may have noticed me saying this in a couple of recent threads. Petr Ročkai's hashed-storage work from his 2009 Google Summer of Code project has been merged!

I thought I would take a few moments to give everybody an overview of how this work benefits us, and where we'll be going in the future.

In a nutshell

What does this mean for you? Faster repository-local operations.

Hashed format repositories (with darcs-1 and darcs-2 patches alike) should now be faster to use on a daily basis. We saw the very beginnings of this work in Darcs 2.3.0 with a faster darcs whatsnew. Now these speed improvements cover all repository-local operations.

The next Darcs beta is a couple of months away, but before that, I would like to encourage you to try this out for yourself:

darcs get --lazy http://darcs.net
cd darcs.net
cabal install

For best results, please run darcs optimize --upgrade followed by darcs optimize --pristine. Pay attention over the next couple of weeks when you try a record, amend, revert, unrecord. If we've done our work right, there should be nothing to see. Darcs should be less noticeable, with fewer "Synchronizing pristine" messages and a faster return to the command prompt. We think you'll like it. But please get back to us. Is Darcs faster for you?

If you're particularly interested, I will step through these changes in greater detail at the end of this message. Meanwhile, I would like to step back a little and take stock of how these improvements fit in to the bigger picture.

The road ahead

The hashed storage work is a big step forward and definitely a cause for celebration. I think it is useful to reflect on this progress and consider how it fits in with our progress since darcs 1.0.9:

  • ssh connection sharing (darcs transfer mode)
  • HTTP pipelining
  • lazy repositories
  • the global cache

and now

  • index-based diffing
  • hashed-storage efficiency

We cannot promise that Darcs will magically become fast overnight. But what we can and will do is continue chipping away at it, solving problems one at a time; release by release, a little bit better, a little bit faster every time until one day we can look back and marvel at all the progress we've made.

So Petr's work makes Darcs easier to live with on a day-to-day basis. But that's not enough. Now we need to turn our attention to that crucial first impression; what happens when people try Darcs out for the first time is that they darcs get a repository they want and... then... they... wait...

This is embarrassing, but we can fix it. In fact, we already have started working on the problem. The next version of hashed-storage will likely introduce a notion of "packs" in which the many often very small files that Darcs keeps track of will be concatenated into more substantial "packs" that compress better and reduce the ill effects of latency. My hope is that we will be able to complete the packs work by Darcs 2.5.

There's a lot more progress to be made: smarter patch representations, tuning for large patches, file-to-patch caching for long histories. And that's just performance! For more details about our performance work, please have a look at

http://tinyurl.com/darcs-performance2

If you could do anything to help, benchmark, profile, anything at all, please let us know :-)

The fight continues.

Thank-you!

Petr and Ganesh deserve a huge round of applause. Petr, thanks for thinking up this work, getting it done and pushing it through. Ganesh, thanks for an extremely thorough and thoughtful review. The two of you, thanks for holding on, for tenacious cooperation in the face of adversity.

Thanks also to all the wider Darcs community for all your support, comments, patch reviews.

I'm looking forward to seeing you at the upcoming Darcs hacking sprint. The sprint will take place in Vienna, Austria on the weekend of 14-15 November. Everybody, especially Darcs and Haskell newbies, is welcome to join in. Details on http://wiki.darcs.net/Sprints/2009-11

And if I may take a paragraph to mention this, Darcs needs your support. Every little counts, if you can send patches, review patches, tweak documentation, profile, benchmark, submit bug reports. Barring that, you could also make a contribution to our travel fund via the Software Freedom Conservancy. See http://darcs.net/donations.html for details.

Thanks everybody and enjoy!

Eric

Changes in detail

  • Darcs uses an "index" file to compute working directory and pristine cache diffs. This avoids timestamps going out of synch when you have multiple local branches, which saves a huge and needless slowdown.

  • Hashed storage is more efficient in general. Even if you already have perfect timestamps, the new optimisations should make Darcs faster in general.

  • The new 'darcs optimize --pristine' reduces spurious mismatches on directories.

  • Darcs no longer requires a one second sleep after applying patches.