tail -f judofyr

2013-07-22

cat 09:17:43 UTC
Improving RubyGems

RubyGems, the package manager for Ruby, has a few issues:

1. Having multiple versions of Ruby (including JRuby or Rubinius) is
   common in Ruby, but there's no way to share gems across versions.
   This means you'll have to re-download gems as you switch between
   environments. Minor issue.

2. Requiring any library will load the gemspec for EVERY gem installed:

   $ ls /Users/magnus/.gem/ruby/2.0.0/gems | wc -l
   203
   $ ruby -rbenchmark -e'puts Benchmark.measure { Gem::Specification.to_a }'
   0.130000   0.010000   0.140000 (  0.139032)

   That's right: 130ms are spent loading metadata. And it's even worse
   with more gems installed. This is noticeable from the command line.

There's some other issues that are now solved thanks to Bundler.
However, Bundler has a a few issues of its own:

1. Bundler has poor test coverage and a resolver that no-one really
   knows how work.

   "it uses a recursive backtracking algorithm to find the dependencies
   and exact details are locked in brains of Yehuda Katz" - gnuified

2. Resolving projects can be slow as Bundler needs to make many network
   connections:

   <whitequark> bundler kills me on this project 
   <whitequark> $ time bundle 
   <whitequark> bla bla SIX MINUTES 
   <whitequark> this project uses 225 gems 

---

What can we do to improve this situation? How would we re-build RubyGems
today? I'm going to focus on one thing today: Improving performance of
locally installed gems.

The performance issue with locally installed gems today is that RubyGems
will often load the gemspecs of all installed gems. More specifically,
Gem::Specification extends Enumerable so it's very tempting to write
code like:

Gem::Specification.find { |gem| gem.name == "..." }
RubyGems has to read every file into memory, invoke the Ruby parser, and build objects (Gem::Specification, Gem::Dependency, Gem::Requirement and so on). Depending on the number of gems installed this can take several hundred milliseconds. If we want to speed up startup time, we need to stop loading every gemspec. So what are our requirements? What needs to fast and what can be slow? I've come up with three usecases for locally installed gems: 1. Binaries. You type `gem install heroku` and gets the `heroku` binary. 2. Projects. You use Bundler to ensure that you use the correct versions. 3. Tiny scripts. You type `require "library"` to get the latest version. (1) and (2) must be loaded fast. 130ms is very noticeable for command line tools, and nobody likes to wait for their tests to load. So in these two cases we can't just naively "find all gems installed on this system". (3) isn't that important. If startup speed matters you can always add an Gemfile later on. Luckily for us, it's pretty simple to make (1) and (2) fast: 1. You can create shims for binaries can contain a flattened list of all dependencies (recursively). This can be done at install-time. 2. If you have a Gemfile.lock you don't need to load gemspecs for all the gems installed on the system, but only for the gems present in the lockfile. The concept is the same: Move dependency resolution from run-time to install-time. Waiting 10s while installing is nothing, waiting 10s every time you run a test is hell.

cat 12:46:31 UTC
Improving RubyGems

Let's focus on another problem: How can we speed up Bundler's resolver?

The problem here is the dependency data. The dependency graph of
RubyGems is quite big so right now Bundler takes an incremental
approach: It tries to resolve as much as possible using locally
installed gems, and when that fails it starts downloading the dependency
graph for the related gems.

There's two (obvious) ways to improve this: improve the algorithm, or
figure out a way to store the complete dependency graph efficiently.
Let's focus on the last.

First step: Let's download the dependency graph locally:

require 'rubygems'
require 'open-uri'
require 'pp'

sleep_time = 5
num = 250 # the API supports max 250 gems
url = "http://rubygems.org/api/v1/dependencies?gems=%s"
specs = Marshal.load(open("specs.4.8"))

puts "Loading specs..."
gems = specs.map { |n, v, p| n }.uniq.sort
puts "Fetching dependency data for #{gems.size} gems"

i = 0
until gems.empty?
  slice = gems.slice!(0, num)
  file = "data/full/deps.#{i}"
  i += 1
  next if File.exist?(file)

  puts "Downloading..."
  data = open(url % slice.join(','))

  puts "Parsing..."
  deps = Marshal.load(data)

  puts "Storing..."
  res = {}

  deps.each do |dep|
    deplist = (res[dep[:name]] ||= [])
    deplist << [dep[:number], dep[:dependencies].sort]
  end

  File.open(file, "wb") do |f|
    f << Marshal.dump(res)
  end

  puts "#{gems.size} left..."
  sleep(sleep_time) unless gems.empty?
end
This script depends on `specs.4.8` (which is a list of all gems available). You can fetch this from https://rubygems.org/specs.4.8.gz. Then it creates a bunch of files under data/full that contains Marshalled data about the versions. So how big is it? $ du -sh data/full 27M Wow, that's surprisingly small. Who said the dependency graph was big? We can do even better though. Most versions of the same gem have the same dependencies, yet this stores the same dependency array multiple times. We can compact this.
res = {
  :dependencies => [],
  :gems => {},
}

at_exit do
  puts "Done. Writing..."
  File.open("compact.4.8", "wb") do |f|
    f << Marshal.dump(res)
  end
end

deps = res[:dependencies]
gems = res[:gems]

deps_mapping = Hash.new do |h, k|
  deps << k
  h[k] = deps_mapping.size
end

# Add the empty dependency first so it gets index == 0
deps_mapping[[]]

i = 0
while true
  p i
  file = "data/full/deps.#{i}"
  i += 1
  break unless File.exist?(file)

  data = Marshal.load(open(file))

  data.each do |name, versions|
    gems[name] = versions.map do |number, dependencies|
      [number, deps_mapping[dependencies]]
    end
  end
end
We store dependencies in an array, and refer to a single dependency using an index into this array. How small is this? $ du -sh compact.4.8* 13M compact.4.8 2.3M compact.4.8.gz Wow. 2.3MB. That's nothing! With this file Bundler doesn't have to contact any server for resolving. However, we don't want to redownload a 2.3MB every time you resolve dependencies. Instead we should have daily/weekly/monthly file that contains every single version, and every 15 minute we generate a file that only contains the versions released since the previous full file. Bundler will then first try to resolve using the full file (which has been cached for a day/week/month), and if that fails it downloads the patch file with the newest information.

cat 13:30:13 UTC
Improving RubyGems

<yorickpeterse> I'd say it would be much better to simply not load
Gem::Specification when doing require() calls.

This is another good point: If we already have a Gemfile.lock (or a
binary shim) and don't need to to dependency resolution, why should we
load the gemspec when requiring files? Can't we figure out the path to
library files with when we already know the gem name, version and
platform?

One difficulty is that it's the gem that decides where the library files
are located. By default they are under "lib/", but there are some gems
that puts them other places (most likely: some micro gems place them
under ".").

But yes, if we store the library path somewhere else, it's perfectly
possible to avoid loading the gemspec at all.

cat 14:55:49 UTC
Improving RubyGems

One update regarding the compact dependency graph: I forgot about
platforms. RubyGems supports having separate gems for specific
platforms, so the same version can be released for both plain Ruby and
precompiled for Windows. These different platforms should be treated as
a completely different "version".

So I tweaked the downloader script so that it stores the platform as
well. Luckily, this is a pretty small amount of data, and the final
compact.4.8.gz ended up at 2.3MB as well.

Summary: 2.3MB is required for storing the whole dependency graph (with
platform information) as a gzipped, Marshalled hash.

cat 18:33:54 UTC
Improving RubyGems

So I hacked together Bundler to use the compressed dependency graph and
got some interesting results.

Before:

  $ time bundle install
  ...
  Your bundle is complete!
  
  real    0m23.705s
  user    0m8.615s
  sys     0m0.164s

(NOTE: I disabled the actual installer. The timings here are only for
resolving and updating Git repos).

After:

  $ time bundle install
  ...
  Your bundle is complete!
  
  real    0m14.39s
  user    0m13.10s
  sys     0m0.31s

Not very impressive. But this is a quite small Gemfile. whitequark
published a huge Gemfile that he's been using in one of his apps:
https://gist.github.com/whitequark/cfe5e5f70ef22a28cab5. It's quite
marvelous: over a 100 gems specified in the Gemfile itself. Over 200
gems are resolved as a result.

With regular Bundler this takes SEVERAL MINUTES to resolve. It just goes
on forever. Turns out that there's a bug in Bundler that causes big
Gemfiles to fetch dependencies using the old "API" which involves one
request per gem. I've filed a bug in Bundler so hopefully this will be
fixed soon: https://github.com/bundler/bundler/issues/2556

I hacked my way around it (adding a "if true") and managed to resolve
dependencies using the proper API:

  $ time bundle install
  ...
  Your bundle is complete!

  real    0m30.686s
  user    0m9.004s
  sys     0m0.592s

This is plain Bundler (no compressed dependency graph) with no
installation. Quite impressive actually.

With a compressed cached dependency graph (and thus no network activity):

  $ time bundle install
  ...
  Your bundle is complete!

  real    0m16.513s
  user    0m10.034s
  sys     0m0.537s

Disclaimer: This involves a cached file. I've not included the
downloading in the timings. As discussed earlier, a real implemenation
would need to fetch a patch file with an updated dependency graph in
order to resolve recent gems.

---

Conclusions:

1. It's faster with a locally cached dependency graph.
2. But it's not that faster. It doesn't appear to be a bottleneck.

However, the current resolver has quite a few bugs. It's a quite
difficult problem and the current algorithm in use isn't bug-free. With
a locally cached dependency graph we could use a "dumber" algorithm that
makes more "mistakes", but is easier to understand and improve.

And this dependency graph is also easier to setup for people having
their own gem servers. The dependency API requires a Sinatra app with a
database; the compressed dependency graph can easily be generated when
you generate the other indexes.

Final conclusion: I think Bundler needs a locally cached dependency graph.

cat 20:56:59 UTC
Improving RubyGems

One clarification: I misunderstood how the Bundler resolver works. I
thought it mixed "fetching dependency data" with "resolving versions". I
thought it was optimized to fetch as little dependency data as possible,
and this it was a "smarter" algorithm.

As gnufied mentioned on Twitter: "fetching dependency data" is always
separate from "resolving versions". After the first step is done Bundler
has indeed everything required to resolve dependencies. Having the full
dependency graph cached locally doesn't mean we can simplify the
algorithm.

Even though we might need to improve the resolver algorithm through
other means, I still think as cached dependency graph would improve
Bundler.