• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Saturday, 02 Nov 2013 07:00

I shipped GitHub's first user-facing Go app a month ago: the Releases API upload endpoint. It's a really simple, low traffic service to dip our toes in the Go waters. Before I could even think about shipping it though, I had to answer these questions:

  • How can I deploy a Go app?
  • Will it be fast enough?
  • Will I have any visibility into it?

The first two questions are simple enough. I worked with some Ops people on getting Go support in our Boxen and Puppet recipes. Considering how much time this app would spend in network requests, I knew that raw execution speed wasn't going to be a factor. To help answer question 3, I wrote grohl, a combination logging, error reporting, and metrics library.

import "github.com/technoweenie/grohl"

A few months ago, we started using the scrolls Ruby gem for logging on GitHub.com. It's a simple logger that writes out key/value logs:

app=myapp deploy=production fn=trap signal=TERM at=exit status=0

Logs are then indexed, giving us the ability to search logs for the first time. The next thing we did was added a unique X-GitHub-Request-Id header to every API request. This same request is sent down to internal systems, exception reporters, and auditors. We can use this to trace user problems across the entire system.

I knew my Go app had to be tied into the same systems to give me visibility: our exception tracker, statsd to record metrics into Graphite, and our log index. I wrote grohl to be the single source of truth for the app. Its default behavior is to just log everything, with the expectation that something would process them. Relevant lines are indexed, metrics are graphed, and exceptions are reported.

At GitHub, we're not quite there yet. So, grohl exposes both an error reporting interface, and a statter interface (designed to work with g2s). Maybe you want to push metrics directly to statsd, or you want to push errors to a custom HTTP endpoint. It's also nice that I can double check my app's metrics and error reporting without having to spin up external services. They just show up in the development log like anything else.

Comments are on reddit.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Monday, 21 Oct 2013 07:00

Justinas Stankevičius wrote a post about writing HTTP middleware in Go. Having seen how Rack changed the Ruby web framework landscape, I'm glad Go has simple HTTP server interfaces baked in.

GitHub itself runs as a set of about 15 Rack middleware (depending on the exact environment that it boots in). They are setup in a nice declarative format:

# GitHub app middleware pipeline
use InvalidCookieDropper
use Rack::ContentTypeCleaner
use Rails::Rack::Static unless %w[staging production].include?(Rails.env)

# Enable Rack middleware for capturing (or generating) request id's
use Rack::RequestId

However, Rack actually assembles the objects like this:

InvalidCookieDropper.new(
  Rack::ContentTypeCleaner.new(
    Rack::RequestId.New(app)
  )
)

This wraps every request in a nested call stack, which gets exposed in any stack traces:

lib/rack/request_id.rb:20:in `call'
lib/rack/content_type_cleaner.rb:11:in `call'
lib/rack/invalid_cookie_dropper.rb:24:in `call'
lib/github/timer.rb:47:in `block in call'

go-httppipe uses an approach that simply loops through a slice of http.Handler objects, and returns after one of them calls WriteHeader().

pipe := httppipe.New(
  invalidcookiedropper.New(),
  contenttypecleaner.New()
  requestid.New(),
  myapp.New(),
)

http.Handle("/", pipe)

This is how http.StripPrefix currently wraps another handler:

func StripPrefix(prefix string, h Handler) Handler {
  if prefix == "" {
    return h
  }
  return HandlerFunc(func(w ResponseWriter, r *Request) {
    if p := strings.TrimPrefix(r.URL.Path, prefix); len(p) < len(r.URL.Path) {
      r.URL.Path = p
      h.ServeHTTP(w, r)
    } else {
      NotFound(w, r)
    }
  })
}

It could be rewritten like this:

type StripPrefixHandler struct {
  Prefix string
}

func (h *StripPrefixHandler) ServeHTTP(w ResponseWriter, r *Request) {
  if h.Prefix == "" {
    return
  }
  
  p := strings.TrimPrefix(r.URL.Path, h.Prefix)
  if len(p) < len(r.URL.Path) {
    r.URL.Path = p
  } else {
    NotFound(w, r)
  }
}

func StripPrefix(prefix string) Handler {
  return &StripPrefixHandler{prefix}
}

Notice that we don't have to worry about passing the response writer and request to the inner handler anymore.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Thursday, 29 Aug 2013 07:00

I've been toying with Go off and on for the last few months. I'm finally at a point where I'm using it in a real project at GitHub, so I've been exploring it in more detail. Yesterday I saw some duplicated code that could benefit from class inheritance. This isn't Ruby, so I eventually figured out that Go calls this "embedding." This is something I missed from my first run through the Effective Go book.

Let's start with a basic struct that serves as the super class.

type SuperStruct struct {
  PublicField string
  privateField string
}

func (s *SuperStruct) Foo() {
  fmt.Println(s.PublicField, s.privateField)
}

It's easy to tell what Foo() will do:

func main() {
  sup := &SuperStruct{"public", "private"}
  sub.Foo()
  // prints "public private\n"
}

What happens when we embed SuperStruct into SubStruct?

type SubStruct struct {
  CustomField string
  
  // Notice that we don't bother naming embedded struct field.
  *SuperStruct
}

At this point, SuperStruct's two fields (PublicField and privateField) and method (Foo()) are available in SubStruct. SubStruct is initialized a little differently though.

func main() {
  sup := &SuperStruct{"public", "private"}
  sub := &SubStruct{"custom", sup}
  
  // you can also initialize with specific field names:
  sub := &SubStruct{CustomField: "custom", SuperStruct: sup}
}

From here, we can access the SuperStruct fields and methods as if they were defined in SubStruct.

func main() {
  sup := &SuperStruct{"public", "private"}
  sub := &SubStruct{"custom", sup}
  sub.Foo()
  // prints "public private\n"
}

We can also access the inner SuperStruct if needed. You'd normally do this if you wanted to override a behavior of an embedded method.

func (s *SubStruct) Foo() {
  fmt.Println(s.CustomField, s.PublicField)
}

func main() {
  sup := &SuperStruct{"public", "private"}
  sub := &SubStruct{"custom", sup}
  sub.Foo()
  // prints "custom public\n"
}
Author: "rick"
Send by mail Print  Save  Delicious 
Date: Monday, 19 Mar 2012 07:00

At GitHub, we've been going over various policies and patterns we use to ship features. One of the specific things is how we deal with mass assignment issues. There are 3 main ways we've handled it in the past:

  • Add ActiveRecord::Base.attr_accessible to whitelist the attributes we can set. This is a great safety net, but leaves the controller looking unsafe:
def create
  @post = Post.create params[:post]
end
def create
  @post = Post.create post_hash
end

def post_hash
  params[:post].slice :title, :body
end
  • You can wrap access to your data model around another abstraction layer. You can go with something completely custom, or use something like Django Forms as another approach.

Having a common pattern is a great idea, as well as other organizational patterns in use (testing, code review, etc). But, we felt like we needed something that would force compliance with safe handling of user input in web controllers. Something that works with what we're already doing, but can't be thwarted by someone writing lazy code. Keep in mind, this person may be someone from the past, that already shipped the code long before common patterns were in place.

The TaintedHash is what we came up with. It's a simple proxy to a protected inner Hash, that only exposes keys that are requested by name. If you're going to be passing the hash into anything that iterates through its values, you'll have to tell it which keys to expose:

# You can set properties manually:
Post.new :title => params[:post][:title]

# You can still slice
hash = params[:post].slice :title
Post.new(hash)

# You can't do this anymore:
Post.new params[:post]

# ... unless you tell it to expose some keys
params[:post].expose(:title)
Post.new params[:post]

It's a tiny class with no dependencies that hooks into Rails 2.3 with a simple before filter:

def wrap_params_with_tainted_hash
  @_params = TaintedHash.new(@_params.to_hash)
end

It's meant to be very low level and simple. It does work well with existing ActiveRecord accessible attributes:

Post.new params[:post].expose(*Post.attr_accessible)

One other TaintedHash goal is that broken rules need to be easily called out by ack.

# #original_hash and #expose_all are probably easy to find through `ack`
Post.new params.original_hash['post']

Post.new params[:post].expose_all

Currently the only place we use #original_hash at all is to give the relevant params to the Rails url writer. If the right keys aren't exposed, Rails can't build our URLs since there are no exposed values to iterate through.

This has been active on GitHub for over a week. If it works out, we'll probably look at introducing it or something like it to our other various ruby apps that are running in production. The branch did expose areas that weren't tested well enough. To help in the conversion of the entire app, I added code to raise test exceptions in after filters if Hashes had any keys left over. In production, we simply logged any unexposed keys that were missed.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Friday, 19 Aug 2011 07:00

So, Kyle and I discovered some interesting IE9 behavior. Redirect responses from DELETE requests are followed with another DELETE. How is this surprising?

Using more of the HTTP methods lets us keep the URLs cleaner. Web browsers don't understand PUT/PATCH/DELETE in forms, so a workaround was needed. Rails looks at a _method GET parameter on POST requests to determine what HTTP verb it should be recognized as. The GData API supports this behavior through the X-HTTP-Method-Override header.

A typical Rails controller might look like this:

class WidgetsController < ApplicationController
  # DELETE /widgets/1
  def destroy
    @widget.destroy
    redirect_to '/widgets'
  end
end

If you don't like Rails, just close your eyes and think of your favorite web framework...

This action works great for a simple form in a browser. You click "Submit", it POSTs to the server, and then you end up back at the root page. Then, you can add some jQuery to spice things up for newer browsers. Progressive enhancement and all that.

$('.remove-widget').click(function() {
  $.del(this.href, function() {
    // celebrate, disable a spinner, etc
  })
  return false
})

This works great in all modern browsers, except IE9. We discovered that not only does IE9 send a real DELETE request, it also follows the redirect with another DELETE. If that redirect points to another resource, you can get a dangerous cascading effect.

RFC 2616 is not clear about what to do in this case, but strongly suggests that redirects are not automatically followed unless coming from a safe method.

If the 302 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued.

Standard practice for browsers over the years is that redirects from POST requests are followed with a GET request. GET/HEAD requests are usually safe, so this seems like reasonable behavior. It's expected by web developers, and consistent across browsers.

I can't imagine that this behavior in IE9 was on purpose. It feels like an edge case that slipped through an if statement because "DELETE" != "POST". I've submitted feedback to the IE9 team about this issue. I'm curious to see what they say.

So, if your application might be responding to ajax requests with redirects, you should probably start sending back 200 OK...

Update: Eric Law on the IEInternals blog responded to one of Kyle's tweets. Apparently the behavior is correct according to HTTP 1.0, and IE has been following DELETE redirects since at least IE6.

Here's the breakdown of browser behavior when receiving a 302 redirect from a DELETE request:

IE 6-10 DELETE method is preserved
Chrome 13 Converts to GET
Firefox 6 Converts to GET
Safari 5.1 Converts to GET
Opera 11.5 Converts to GET

We didn't see the behavior in IE8, so we assumed it was new to IE9. At least, no one was sending in crazy bug reports from other browsers. This is another example why developers hate dealing with IE. Kudos to the standards compliance, though!

Discuss this post on Hacker News.

Author: "rick"
Send by mail Print  Save  Delicious 
NPM rocks   New window
Date: Saturday, 16 Jul 2011 07:00

It is very easy to write and distribute packaged ruby libraries to the world. You slap a gemspec file on it, push it to rubygems.org, and anyone can get it. Tools like Jeweler (though I roll with rakegem these days) and gemcutter made it ridiculously easy to push ruby gems.

But, ruby gems are far from perfect (and no, I'm talking about the drama around slimgems). Unfortunately, a lot of problems emerged over time for various reasons, and will be tough to solve.

Node.js is an extremely young programming community that I've been following for well over a year now. It's been interesting to see the node packaging landscape grow this time, in contrast to my own early experiences with ruby gems. In this time, npm has emerged as the dominant node packaging system. Npm is written by Isaac Schlueter, based on his experience using Yinst at Yahoo.

The first, and biggest reason that I love npm is that it's not loaded at runtime. You never need to require('npm') for your library to function. This is in stark contrast to ruby libraries, where nearly every one of them requires rubygems.

Why is that? Say you're writing a sweet web service, and you want to require a database adapter:

require 'mysql'

class SweetApp
end

Boom: LoadError. Where is the mysql library? Oh, let's just use rubygems:

require 'rubygems'
require 'mysql'

class SweetApp
end

Now, check out your load path. Depending on system, it should have an entry like this:

/Library/Ruby/Gems/1.8/gems/mysql-1.0.0/lib

Just to get mysql loaded, we had to load rubygems, and have it find the correct lib path for us. It does this every time your app boots up. After a while, your app likely has 30-100 (or 208, in the case of GitHub) gems loaded, each with its own entry in the load path. Every time you require something, ruby has to scan the whole list until it finds a match. God help you if you try to require something with a common name.

Why doesn't this happen with node? index.js. Let's look at a node port of my sweet web service:

mysql = require('mysql')

SweetApp = function() {}

Assuming you installed the mysql lib with npm, node will check for these files:

./node_modules/mysql.js
./node_modules/mysql/index.js

It can also look in node's load path (though it sounds like this may be removed in the future).

> require.paths
[ '/Users/technoweenie/.node_modules'
, '/Users/technoweenie/.node_libraries'
, '/usr/local/lib/node'
]

Loading packages from npm (or wherever) doesn't add to the load path. NPM just knows how to install packages so that node can easily find them.

Without some kind of index.rb file, ruby forces all ruby libraries to live in separate directories, usually with its own load path entry. Or, you can combine the files together like the ruby standard lib:

$ ls /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8
English.rb           debug.rb             forwardable.rb       logger.rb
Env.rb               delegate.rb          ftools.rb            mailread.rb

Newer versions of node have even started cooperating with npm through the package.json file. Your package can provide a package.json file instead of index.js, and node will find it. The following file is from the mysql package, and instructs node to find the local lib/mysql.js when you require('mysql') from your app.

$ cat node_modules/mysql/package.json 
{ "name" : "mysql"
, "version": "0.9.1"
, "main" : "./lib/mysql"
...
}

Newer versions of node and npm also support the idea of cascading node_modules directories. If you have an app at /home/rick/app, node will check these directories for libraries:

/home/rick/app/node_modules
/home/rick/node_modules
/home/node_modules
/node_modules

This makes it easy to bundle libraries with your node apps. You can commit them directly and know they'll run wherever you push them (though this may not work for npm packages that require compilation). You can also setup a package.json file like this:

{
  "name" : "alambic"
, "version" : "0.0.1"
, "dependencies" :
  { "mysql" : "0.9"
  , "coffee-script" : "1.0"
  , "formidable": "1.0.2"
  , "underscore": "1.1.7"
  }
}

Running npm install will load these to the node_modules directory inside my app. I can run this once after updating code on our servers, and it's ready to rock. This feature is reminiscent of Bundler, but again, it doesn't rely on your app using npm at runtime.

You can comment on this through the HN discussion...

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Sunday, 19 Jun 2011 07:00

After Friday's ZeroMQ Pub Sub post, Jérôme Petazzoni taught me a bit more about ZeroMQ.

.bbpBox81777168979468290 {background:url(http://a3.twimg.com/profile_background_images/22130819/twi.png) #000000;padding:20px;} p.bbpTweet{background:#fff;padding:10px 12px 10px 12px;margin:0;min-height:48px;color:#000;font-size:18px !important;line-height:22px;-moz-border-radius:5px;-webkit-border-radius:5px} p.bbpTweet span.metadata{display:block;width:100%;clear:both;margin-top:8px;padding-top:12px;height:40px;border-top:1px solid #fff;border-top:1px solid #e6e6e6} p.bbpTweet span.metadata span.author{line-height:19px} p.bbpTweet span.metadata span.author img{float:left;margin:0 7px 0 0px;width:38px;height:38px} p.bbpTweet a:hover{text-decoration:underline}p.bbpTweet span.timestamp{font-size:12px;display:block}

@technoweenie Once a PUB/SUB socket is connected, it IS reliable. Use socket identity to be sure not to lose any message on reconnects.less than a minute ago via web Favorite Retweet Reply

Wow, so even ZeroMQ PUB sockets queue messages to subscribers. It looks like they get buffered in memory. You can configure the ZMQ_HWM option (ZMQ::HWM in ruby) to limit how many messages will be buffered. You can also set the ZMQ_SWAP option to set the size of an on-disk swap for messages that cross the high water mark.

Armed with this bit of knowledge, I updated the publisher script to set an identity of channel-username:

context = ZMQ::Context.new
chan    = ARGV[0]
user    = ARGV[1]
pub     = context.socket ZMQ::PUB
pub.setsockopt ZMQ::IDENTITY, "#{chan}-#{user}"

pub.bind 'tcp://*:5555'

To really highlight reliable pub/sub, I wrote a custom publisher script that just pings every second.

require 'zmq'
context = ZMQ::Context.new
pub = context.socket ZMQ::PUB
pub.setsockopt ZMQ::IDENTITY, 'ping-pinger'
pub.bind 'tcp://*:5555'

i=0
loop do
  pub.send "ping pinger #{i+=1}" ; sleep 1
end

Updating the subscriber should've been just as simple, but the while statement didn't allow for good error handling:

while msg = STDIN.gets
  msg.strip!
  pub.send "#{chan} #{user} #{msg}"
end

Any interruption in the process would lose a single message. I instead used a method:

def process(line = nil)
  line ||= @socket.recv
  chan, user, msg = line.split ' ', 3
  puts "##{chan} [#{user}]: #{msg}"
  true
rescue SignalException
  process(line) if line
  false
end

This way any exception doesn't interrupt the processing of a message. Here's what the loop looks like now:

subscriber = Subscriber.new ARGV[0]
subscriber.connect ZMQ::Context.new, 'tcp://127.0.0.1:5555'
subscriber.subscribe_to 'rubyonrails', 'ruby-lang', 'ping'

loop do
  unless subscriber.process
    subscriber.close
    puts "Quitting..."
    exit
  end
end

This is what the console output looks like:

#ping [pinger]: 21
#ping [pinger]: 22
^CQuitting...
ruby-1.9.2-p180 ~p/zcollab/pubsub git:(master) ✗$ ruby sub.rb abc
#ping [pinger]: 23
#ping [pinger]: 24

I still run into rare cases where the Interrupt is raised inside the socket.recv call. For a more advanced script, you could also try trapping signals to control how your script exits.

You can comment on this through the HN discussion...

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Friday, 17 Jun 2011 07:00

I read Nick Quaranto's blog post about Redis Pub Sub, and thought I'd port the examples to ZeroMQ to show how easy it is. As I've said in previous posts, ZeroMQ is a great networking library, and pub/sub is one of the patterns you can use.

Redis is amazing though. I'm not trying to say anything bad about Nick's approach (and Radish is really awesome). Why would you use ZeroMQ over Redis?

  • You want to do quick messaging between hosts, processes, or even threads.
  • You want to use a different transport besides TCP: multicast, in-process, inter-process. The code doesn't change (besides the bind/connect calls).
  • You want to take advantage of other ZeroMQ messaging patterns to (request/reply, push/pull, etc).
  • You don't want certain components to talk to the central Redis servers.
  • You don't want to deal with connection errors. ZeroMQ publishers and subscribers can start up in any order. They'll connect and reconnect behind the scenes.
  • ZeroMQ PUB sockets will buffer messages if a SUB socket drops and reconnects. Read more about reliable pub sub.

Why would you use Redis over ZeroMQ?

  • You only need pub/sub, you have Redis already. Fewer networking components is obviously simpler and better.

At GitHub, we use a lot of Redis, but we have one clear case where ZeroMQ would be better suited: our Service Hooks server. Since the code is open source, the server it runs on is completely isolated from everything else. We could setup another Redis server, but it's overkill just to enable message passing between the main GitHub app and the Services app. We currently use HTTP calls, but could just as easily use ZeroMQ.

Demo

I ported Nick's code to a simple ZeroMQ chat demo. It works the same: A user connects and publishes messages to a channel, and subscribed users receive them.

Publish

This uses the zmq gem to bind a SUB socket to port 5555. You can tweak this to play with some of the other network transports too, like multicast or inproc. I'm not using JSON in this example, though it is of course possible with ZeroMQ.

# pub.rb
require 'zmq'

context = ZMQ::Context.new
chan    = ARGV[0]
user    = ARGV[1]
pub     = context.socket ZMQ::PUB
pub.bind 'tcp://*:5555'

while msg = STDIN.gets
  msg.strip!
  pub.send "#{chan} #{user} #{msg}"
end

One slight difference here is that the channel is sent as part of the message. Redis lets you send the channel as a separate parameter, but ZeroMQ just includes it in the beginning of the message.

You can run the script the same too:

$ ruby pub.rb rubyonrails technoweenie
Hello world

This sends a ZeroMQ message like this:

rubyonrails technoweenie Hello World

Subscribe

Now, let's write something to receive and display these published messages.

# sub.rb
require 'zmq'

context = ZMQ::Context.new
chans   = %w(rubyonrails ruby-lang)
sub     = context.socket ZMQ::SUB

sub.connect 'tcp://127.0.0.1:5555'
chans.each { |ch| sub.setsockopt ZMQ::SUBSCRIBE, ch }

while line = sub.recv
  chan, user, msg = line.split ' ', 3
  puts "##{chan} [#{user}]: #{msg}"
end

ZeroMQ is a c++ library built just for messaging, so it hides away the complexities of receiving them behind the blocking recv call. Therefore, you don't have to worry about setting up callbacks for message events or anything like that, unless you use an asynchronous ZeroMQ library (EventMachine, Node.js, etc).

This works exactly like the Redis example:

$ ruby pub.rb rubyonrails qrush
Whoa!
`rake routes` right?

$ ruby pub.rb rubyonrails turbage
How do I list routes?
Oh, duh. thanks bro.

$ ruby pub.rb ruby-lang qrush
I think it's Array#include? you really want.

$ ruby sub.rb
#rubyonrails - [qrush]: Whoa!
#rubyonrails - [turbage]: How do I list routes?
#ruby-lang - [qrush]: I think it's Array#include? you really want.
#rubyonrails - [qrush]: `rake routes` right?
#rubyonrails - [turbage]: Oh, duh. thanks bro.

Advanced Pub/Sub

sustrik on HN mentioned a whitepaper on forwarding subscriptions through the network. Check out the Design of PUB/SUB subsystem in ØMQ whitepaper for a look at a larger pub/sub architecture.

If this sounds interesting to you, check out Jakub Stastny's post: "Why Rubyists Should Care About Messaging". If you're hungry for more after that, the ZeroMQ Guide goes into way more detail. It's very well done, but might create an obsession around messaging :)

You can comment on this through the HN discussion...

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Tuesday, 07 Jun 2011 07:00

After reading the ZeroMQ guide several times, I really wanted to hack on a non trivial app. Somehow I settled on a deceptively simple Dropbox clone. Though, I say "clone" not because I want to move my content off Dropbox. It's a simple way to describe the kind of system I'm attempting to build.

Here's the source to DropBear.

It's more of a gross hack, as my dev process had to be greatly accelerated to be ready for tonight's Riak meetup. Clearly I shouldn't tell my half-baked ideas to Mark Phillips (of Basho). Only the fully baked ones. I have some slides.

Essentially, DropBear clients push their files to a DropBear server using ZeroMQ PUSH/PULL sockets. The server dumps the file in Riak and notifies other clients. The server using PUB/SUB sockets to distribute the changes to other clients. I basically copied the high level ZeroMQ architecture that mongrel2 uses.

I made two critical errors in writing the DropBear prototype:

First, I originally tried to get the clients and the server to talk through ZeroMQ ROUTER sockets. It almost worked, but I ran into some weird issues. I ended up having to redesign and rewrite DropBear to use the PUSH/PULL and PUB/SUB sockets. Luckily, I met some dotcloud devs at the Riak meetup (who use a ton of ZeroMQ). They explained why my understanding of ROUTER sockets was completely wrong.

Second, I used EventMachine. The ZeroMQ bindings work well, but the callback structure was awkward. I went with EM because I really wanted the the clients and server to be single processes each. I actually tried using the Node.js bindings originally, but ran into what looks like a bug with the PUB/SUB sockets. So, I had to port it to EM. However, most of the examples in the guide are tiny scripts that work a single socket.

Those ruby examples translate really closely to other languages too (c, lua, python). Even the node.js bindings are fairly close (though the blocking recv call is replaced by emitted message events). I love how each script is so tiny, and describes its exact function in a small comment at the top of the file.

It's not so much that the EventMachine bindings are bad, but it feels like this is how ZeroMQ is meant to be used. Talking with Sebastien (from dotcloud) confirmed it. Lots of tiny scripts that use ZeroMQ messages the way Erlang uses messages to communicate.

As a project, DropBear is pretty much a failure right now. But the experience building it taught me a lot about how ZeroMQ should work. It's always fun to play around in new environments, especially when they challenge the way you think about writing code.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Sunday, 22 May 2011 07:00

A few weeks ago, I had one of those sleepless nights that comes with travelling several timezones ahead of what you're used to. I picked the ZeroMQ (ØMQ) Guide as reading material to lull me into a deep sleep. Bad move: I had a Bing moment with ØMQ, and stayed up playing with network services on my laptop until it was time for my son to go to school the next day.

I have to admit, the name "ZeroMQ" was a little misleading for me. I think that's because most other message queues are very similar: something pushes messages into a big, centralized queue, and workers pop messages off the front. ØMQ is really just a networking library. Sockets the way you want them to work. I think Zed Shaw put it best in his Pycon talk: ØMQ can replace your internal HTTP REST services. HWhaaat?!?

First, a Quick Primer

ZeroMQ is a c++ library providing asynchronous messaging over a variety of transports (inproc, IPC, TCP, OpenPGM, etc). It has bindings for over 20 languages. ØMQ has simple socket patterns that can be used together to build more complex architectures.

  • Request and Reply sockets are the simplest type, providing synchronous messaging between two systems.
  • Push and Pull sockets let you distribute messages to multiple workers. A Push socket will distribute sent messages to its Pull clients evenly.
  • Publish sockets broadcast messages to any Subscribe sockets that may be listening.
  • Dealer and Router sockets (or X_REP/X_REQ in older versions of ØMQ) handle asynchronous messaging. They require a bit of knowledge of ØMQ addressing to really grok.

Replacing REST

When you're designing large systems, it makes sense to break things out into smaller pieces that communicate across some common protocol. A lot of people champion the REST/JSON combo because it's easy, it works well, and it's available everywhere.

ØMQ is like that too. It works nearly the same in every supported language. You can use JSON, MessagePack, tagged netstrings, etc. You get cool socket types that aren't really possible with REST (pub/sub, push/pull).

You do end up having to build your own protocol a bit. You end up losing the richness of HTTP (verbs, URIs, headers). You're also unable to expose these services outside your private network due to some asserts in the ØMQ code.

Dakee: the Request/Reply Chat

Dakee is a simple chat bot that uses REQ and REP sockets to communicate with another user. The names come from Collabedit, which is what Towski and I used to write the initial versions.

The script binds a REP socket to 5555, and connects a REQ socket to the other user's REP socket.

var context = require('zeromq')
  , req     = context.createSocket('req')
  , rep     = context.createSocket('rep')
  , ip      = process.env.CLIENT_IP || '192.168.1.25'

// ...

req.connect("tcp://" + ip + ":5555")
rep.bindSync("tcp://*:5555")

Received messages are printed to standard output. Messages from standard input are sent out on the REP socket if it needs a response, or to the REQ socket.

It makes for an unusually useless private chat system, but it manages to highlight the behavior of the REQ and REP sockets pretty well. The node.js event loop actually works against ØMQ in this case, allowing you to send multiple messages to the REQ socket. However, the other client won't see these extra messages until they've replied to your first one.

Why is the REQ/REP pair of socket types synchronous like this? One factor is simplicity. The REP socket keeps you from having to know which REQ socket it needs to reply to. If you want more flexibility, you'll have to look at the Dealer and Router socket types.

ZeroMQ in the Wild

The only projects I know of that use ØMQ are Mongrel2 and Storm. Mongrel2 is a web server that uses ZeroMQ to talk to backend handlers written in any language. Storm is still vaporware, but it sounds like it uses ØMQ in a similar fashion.

What next?

I'm continually amazed at the wealth of information in the ZeroMQ Guide. I'd highly recommend checking it out if you want a new perspective on message queue systems. The code examples are mostly in c, but each lists ported examples in other languages. If a language gets all of the samples ported, it is awarded with a full translation (so far only PHP and Lua have succeeded). I'd love to see full Ruby and Node.js translations too (I took the easy ones, sorry!). Porting these examples are a great way to figure out how ØMQ works.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Thursday, 28 Apr 2011 07:00

So here I am, writing documentation for some new GitHub API sweetness, when something strikes me. Why are we using PUT requests for updates? Should it bug me that my API uses the PUT verb?

The Conventional Wisdom

I was actively contributing to the Rails Core team when Rails had its sweaty HTTP lovefest in Rails v1.2.x. This introduced the REST concepts to a lot of Rails developers in a real applicable form. Not only can I build a sweet REST service, it's provided for me as long as I stay on the golden path... Huzzah!

class PostsController < ActionController::Base
  # PUT /posts/1.json
  def update
    @post = Post.find(params[:id])
    @post.update_attributes(params[:post])
    respond_to do |format|
      format.html
      format.json { render :json => @post }
    end
  end
end

This wasn't an accident. David (and the rest of us) were all heavily inspired by the Atom Publishing Protocol. Look at how they specify updates to resources:

Client                                     Server
  |                                           |
  |  1.) PUT to Member URI                    |
  |      Member Representation                |
  |------------------------------------------>|
  |                                           |
  |  2.) 200 OK                               |
  |<------------------------------------------|

It's a simplistic flow chart, but it clearly shows how PUT requests are used to update the resources. Joe Gregorio (one of the AtomPub creators) used a similar setup for RESTLog. RESTLog is a blogging system that stored posts as <item> RSS fragments. At the time, I didn't really understand what REST meant, I was more focused in trying to get RSS to work.

Mixed Messages

I think this is where things got confused. AtomPub and RESTLog assume you're using the PUT verb to replace the contents of the resource on every request. However, typical API updates don't require the full XML or JSON data.

POST /items
{"title": "a", "body": "b"}

PUT /items/1
{"title": "a!"}

What does RFC 2616 say about this?

The fundamental difference between the POST and PUT requests is reflected in the different meaning of the Request-URI. The URI in a POST request identifies the resource that will handle the enclosed entity... In contrast, the URI in a PUT request identifies the entity enclosed with the request -- the user agent knows what URI is intended and the server MUST NOT attempt to apply the request to some other resource.

Section 9.6 doesn't really mention partial updates anywhere. It mainly says that PUT requests are idempotent, and uses the URL to identify the resource. So who says PUT is for complete replacements only? RFC 5789:

In a PUT request, the enclosed entity is considered to be a modified version of the resource stored on the origin server, and the client is requesting that the stored version be replaced. With PATCH, however, the enclosed entity contains a set of instructions describing how a resource currently residing on the origin server should be modified to produce a new version.

Mark Nottingham just pointed me towards a new draft of the HTTP Message Semantics RFC (written just a few days ago!). It actually mentions partial PUT requests:

An origin server SHOULD reject any PUT request that contains a Content-Range header field, since it might be misinterpreted as partial content (or might be partial content that is being mistakenly PUT as a full representation). Partial content updates are possible by targeting a separately identified resource with state that overlaps a portion of the larger resource, or by using a different method that has been specifically defined for partial updates (for example, the PATCH method defined in RFC5789).

So in summary:

  • PUT and Content-Range don't mix (sorry Sean!)
  • Partial PUTs can target internal resources. For instance, you could update a user's address by making a PUT request to the user's address resource.
  • Or, use PATCH.

Should I Care?

In my adhoc Twitter poll, the responses I got were divided by those saying I should care (and use PATCH), or asking what was wrong with PUT.

I've dealt with a lot of API bugs, and I can only think of a single one that had to do with the PUT verb specifically: Browsers can only send GET or POST requests. Depending on the server, user agents can work around this by using POST and specifying the "real" verb with a _method parameter or the X-HTTP-Method-Override header.

At some point, all this POST vs PUT nonsense devolves into spec wankery anyway. Does it really hurt anyone that certain API endpoints expect the PUT verb? Not really. People are still able to ship cool shit. It doesn't really matter to most people if you call your RPC API a REST API.

However, I've been working on GitHub API v3 as a clean slate. It gave me a chance to infuse more REST concepts into everything. So, I weighed my options.

  • How many users will be affected?
  • What's the chance that they will update their code to match changes to the API?
  • Is there an easy way to maintain backwards compatibility?

Fortunately, API v3 documentation was only given out to a few eager beta testers. So, I knew right away that the number of users was small, with a high probability that they'd notice changes and update their code accordingly. I really wanted to avoid the case where I break some old script on a server somewhere, that no one remembers the login info for.

In this specific case, I also had a way to keep the old behavior. I hacked up a quick Sinatra extension to let me easily define actions that respond to multiple verbs. I also spoke with the Sinatra team in adding this to Sinatra itself.

I launched the new GitHub API this morning, and I'm pretty happy with how it turned out. It's clear that what we have is just enough for people to be productive in, and there's still some more work to be done fine tuning things.

The next big thing I want to tackle on the GitHub API is HATEOAS. I already dipped my toe in those waters for the Tender Support API...

And Now, Your Moment of Zen

Here's the full diagram of the flow that determines how webmachine processes resources. Webmachine is the framework that powers Riak's HTTP API, among other things.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Tuesday, 05 Oct 2010 07:00

I've been toying around with the idea of adding a PubSubHubbub layer around the GitHub timeline events, so I wrote Nub Nub. Nub Nub is a bare implementation of PubSubHubbub so I can explore how it might be used inside GitHub.

PubSubHubbub (PuSH) is "a simple, open, web-hook-based pubsub protocol," according to the homepage. There's a lot of talk about hub discovery URLs, Atom feeds, multicast publishing, etc. Let's boil it all down to the essentials:

  • A feed (or PuSH topic) identifies one or more hub servers.
  • A subscriber (a webhook server) subscribes to one or more topics.
  • When the feed updates, it pings the hub server.
  • The hub server then publishes the data to the subscribers.

As far as GitHub goes, this is still too complex. We're not a big feed aggregation service, or a generic PuSH hub service.

  • A GitHub user specifies one or more post-receive URLs that get pinged on every Git push.
  • GitHub publishes to these post-receive URLs on every Git push.

It's the same thing, minus all the talk about feed scanning and updates. We already have an internal queue system that can handle this. We also have people asking for an API to manage service hooks and post receive URLs, so why not provide standard API hooks?

As far as Atom/RSS go, there's not a lot in the PuSH spec that really depend on them. The feed scanning portion does, of course. The data is all internal though, so there's no need to implement the Content Notification methods. But there's no sense in parsing our own feeds, so we can push straight JSON. The specs do say that published items need to be Atom or RSS, but I see no reason we can't support JSON. If the subscriber subscribes to a JSON topic, the hub pushes JSON. If the subscriber subscribes to an Atom topic, the hub pushes Atom.

This strategy can be applied to any other web service that sends out notifications. That's why I wrote Nub Nub. It's a pretty bare bones implementation, with no mention of a preferred web server or data store. It just provides a few methods for making or responding to PuSH calls.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Tuesday, 03 Aug 2010 07:00

I've been playing around with Riak a bit lately. It's a simple key/value store with S3-style buckets and one-way links between keys. It also has clustering built in, and lets you run map/reduce against a set of data pretty easily. All this, over a simple HTTP API.

It's a great way to start playing with Riak, but I found it to be pretty slow. With Riak, there are two more options: use the Erlang client, or write a Protocol Buffer adapter. I'd never done anything with Protocol Buffers, so I figured this was good opportunity.

Riak PBC Client

Armed with Node.js Protocol Buffer serializing and parsing abilities, I took a look at the Riak PBC API. It has a very simple API:

00 00 00 07 09 0A 01 62 12 01 6B
|----Len---|MC|----Message-----|

Each message starts with 4 bytes for the message length, a single byte for the message code, and then the message.

The example above is how a simple request for a key might look.

// the Protocol Buffer schema.
message RpbGetReq {
    required bytes bucket = 1;
    required bytes key = 2;
    optional uint32 r = 3;
}

A Riak request looks something like this:

Schema = require('protobuf_for_node').Schema
schema = new Schema(fs.readFileSync('./riak.desc'))
GetReq = schema["RpbGetReq"]

# <Buffer 0a 01 62 12 01 6b>
data = GetReq.serialize bucket: 'b', key: 'k'
len  = data.length + 1 # account for riak code too
req  = new Buffer(len + 4) # 4 byte message length
req[0] = len >>>  24
req[1] = len >>>  16
req[2] = len >>>   8
req[3] = len &   255
req[4] = 9
data.copy req, 5, 0 # copy serialized data to the buffer

# req is now
# <Buffer 00 00 00 07 09 0a 01 62 12 01 6b>

That assembles the message. Now, we just create a tcp connection to send it to Riak:

conn = net.createConnection 8087, '127.0.0.1'

conn.on 'connect', ->
  conn.write req

Finally, something needs to listen for the data event for a response:

conn.on 'data', (chunk) ->
  len = (chunk[0] << 24) + 
        (chunk[1] << 16) +
        (chunk[2] <<  8) +
         chunk[3]  -  1 # subtract 1 for the message code
  type = lookup_type_from_code(chunk[4])
  msg  = new Buffer(len)
  chunk.copy msg, 0, 5
  data = type.parse msg

Pooling Connections

My initial example started off pretty basic, but started to grow out of control. I quickly realized that since the socket API was very synchronous, I needed to implement a connection pool so a Node.js process could have simultaneous conversations with Riak. A basic example looks like this:

riak = new (require './protobuf')()
server = http.createServer (request, response) ->
  # get a fresh connection off the pool
  riak.start (conn) ->
    # make a connection, call the given callback when it returns.
    conn.send('PingReq') (data) ->
      response.writeHead 200, 'Content-Type': 'text/plain'
      response.end sys.inspect(data)
      conn.finish() # release the connection back to the pool

# SHORTCUT
server = http.createServer (request, response) ->
  # automatically gets a fresh connection, sends a request, and releases
  # it back to the pool when done.
  riak.send('PingReq') (data) ->
    response.writeHead 200, 'Content-Type': 'text/plain'
    response.end sys.inspect(data)

nori + riak-js

Right now, this isn't in any released version of nori or riak-js. The rough Protocol Buffers client is available in the coffee branch of my riak-js fork.

When Frank released the sweet Riak-JS site, I took a hard look at what purpose nori was solving:

  • I wanted to learn more about Riak (accomplished).
  • I wanted to experiment with a new API style (very similar to Riak-js)
  • I wanted a higher level Riak lib, more like an ORM.

The goals aligned pretty closely with riak-js, so there seemed no good reason to double our efforts. I've decided to discontinue nori for the time being, and focus my Riak efforts in a refactoring of riak-js. We want to have a single lib that lets you access Riak from jQuery (maybe), as well as Node.js over the HTTP and PBC APIs.

So, what is the current progress of all this? Here are some quick benchmarks from my iMac i7:

# riak-js http API 
# ab -n 5000 -c 20 
# 734.31 req/sec
sys  = require 'sys'
http = require 'http'
db   = require('riak-js').getClient()

server = http.createServer (req, resp) ->
  db.get('airlines', 'KLM') (flight, meta) ->
    resp.writeHead 200, 'Content-Type': 'text/plain'
    resp.end sys.inspect(flight)

server.listen 8124

# riak-js PBC API
# ab -n 5000 -c 20
# 1682.01 req/sec
sys  = require 'sys'
http = require 'http'
riak = new (require './protobuf')()

server = http.createServer (req, resp) ->
  riak.send('GetReq', bucket: 'airlines', key: 'KLM') (flight) ->
    resp.writeHead 200, 'Content-Type': 'text/plain'
    resp.end sys.inspect(flight)

server.listen 8124

That's over a 2x speedup, not bad.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Tuesday, 13 Jul 2010 07:00

Node.js is great at handling lots of asynchronous connections, but sometimes I'd like to limit how many are in use. One real world example is some kind of spider or feed reader. If you have a list of 500 addresses to fetch, you don't want to fetch them all at once. Maybe they're all on one server, or the requests return large files that need some post processing.

A simple queue like Resque is great for this, but I wanted something even simpler. Something that lived in the Node.js process, and could exit cleanly without any of that persistent mess left over.

Chain Gang is the result of my experimentation. My idea is using the Node.js event system for pub/sub:

First, I specify my unit of work. In this case, I'm fetching a a web address, and calling worker.finish() after that's done.

sys:   require 'sys'
http:  require 'http'
client: http.createClient 8080, 'localhost'
# start an active chain gang queue with 3 workers by default.
chain: require('chain-gang').create()

# downloads a web page, runs the callback when it's done.
get_path: (path, cb) ->
  req: client.request('GET', path, {host: 'localhost'})
  req.end()
  req.addListener 'response', (resp) ->
    resp.addListener 'data', (chunk) ->
      sys.puts chunk
    resp.addListener 'end', cb

# returns a chain gang job that downloads a web page and finishes the worker.
job: (timeout, name) ->
  (worker) ->
    get_path "/$timeout/$name", ->
      worker.finish()

Now, I can add the callback, and queue the unit of work:

# queues the job
chain.add job(1, 'foo')

# queues the job with the unique name "foo_request"
chain.add job(1, 'foo'), 'foo_request'

Assuming the chain gang queue is active, it should start executing the jobs immediately.

There are two interesting behaviors that are possible now: Duplicate jobs are not run, and only a fixed number of jobs can run at any given time. To highlight them, I have some sample files:

  • webserver.coffee is a silly web server that waits for a specified amount of time before returning a request. A URL like "/3/foo" will return in 3 seconds, for example.
  • chain-with-dupes.coffee shows what happens when multiple jobs with the same name are queued. In this contrived example, only the first, longer one is completed. The rest are ignored.
  • chain-with-uniques.coffee shows how Chain Gang handles more jobs than workers. They just sit in an array until a free worker can take it.

On a side note, this is my first lib using npm (Node.js package manager). Type npm install chain-gang to get rockin'.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Wednesday, 07 Jul 2010 07:00

So, I was interviewed by The Geek Talk recently. Read on to learn the awful truth behind my early programming days :)

Also, I'm moving to San Francisco this weekend. I'm really looking forward to working side by side with my fellow GitHubbers. Portland's a great place, and I have a feeling I'll be back.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Monday, 28 Jun 2010 07:00

My first node.js project at GitHub is a replacement download server. I wanted to remove the extra moving pieces required to get it to work. One of the steps involves writing a file from the output of git archive. My initial attempt looked like this:

fs:      require('fs')
child:   require('child_process')

git:     child.spawn 'git', ['archive', 'other options']
stream:  fs.createWriteStream outputFilename

# writes the file from git archive to the file stream
git.stdout.addListener 'data', (data) ->
  # if the file stream isn't flushed, pause git's stdout
  if !stream.write(data)
    git.stdout.pause()

# once the file stream is flushed, resume git's stdout
stream.addListener 'drain', ->
  git.stdout.resume()

git.addListener 'exit', (code) ->
  stream.end()

However, git archive's tar format does not come compressed. That means I have to pipe the output to another ChildProcess object. How do I do that without a lot of code duplication? I put the common callbacks into defined functions:

fs:      require('fs')
child:   require('child_process')

# writes data to the local file system.
streamer: (data) ->
  if !stream.write(data)
    input.pause()

# pipes the data to the gzip process.
gzipper: (data) ->
  if !gzip.stdin.write(data)
    git.stdout.pause()

# closes the written file stream.  
closer: (code) ->
  stream.end()

git:     child.spawn 'git', ['archive', 'other options']
stream:  fs.createWriteStream outputFilename

stream.addListener 'drain', ->
  input.resume()

# if this is a tarball, pipe `git archive` through `gzip -n`
if outputFilename.match(/\.tar\.gz$/)
  gzip:  child.spawn 'gzip', ['-n', '-c']
  input: gzip.stdout
  gzip.stdout.addListener 'data', streamer
  gzip.addListener        'exit', closer
  gzip.stdin.addListener  'drain', ->
    git.stdout.resume()
  git.addListener 'exit', (code) ->
    gzip.stdin.end()
else
  input: git.stdout
  git.addListener 'exit', closer

git.stdout.addListener 'data', (if gzip then gzipper else streamer)

That's the code to write either git archive --format=zip or git archive --format=tar | gzip to a file. It works, but the code is more complicated than I'd like.

Ryan suggested I use tee for outputting the file, and /bin/sh to assemble the pipes.

Now, the code is even simpler than my first attempt:

child: require('child_process')

cmd: 'git archive ...'

if outputFilename.match(/\.tar\.gz$/)
  cmd += ' | gzip -n -c'

arch:    child.spawn '/bin/sh', ['-c', "$cmd | tee $outputFilename"]
arch.addListener 'exit', (code) ->
  # do something
Author: "rick"
Send by mail Print  Save  Delicious 
Date: Wednesday, 23 Jun 2010 07:00

If you're reading this, I've completed migrating my blog from Mephisto to Jekyll.

I had fun working on it, but clearly I failed on being able to foster a good community around it. It was an unconventional Rails app in a time before things like Rack, Sinatra, MongoDB, restful controllers, etc. It's nice to see similar ideas in newer projects though:

Thanks to everyone who helped work on it, especially Justin for the partnership on the design. Mephisto was where our working relationship started, which eventually lead to Lighthouse and ENTP...

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Monday, 17 May 2010 07:00

I'm going to be taking Chris' place in the Building an API panel at Railsconf in June. I'll be speaking about the GitHub API (of course), as well as touching on my experiences building APIs for Lighthouse and Tender.

Don't despair, Chris will still be doing his Redis, Rails, and Reque talk.

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Monday, 10 May 2010 07:00

Map Reduce?

I took the Riak Fast Track and really liked messing around with map reduce functions. So, I wrote nori, a node.js client.

Riak is a key/value store inspired by the Dynamo whitepaper. It has buckets, which contain resources identified by keys, with a REST API. Therefore, it feels a lot like S3, with added map reduce and link walking powers.

Riak is written in Erlang, but Basho decided to also support javascript for map reduce. This makes node.js a natural fit for Riak. Node.js is of course great at handling non-blocking HTTP requests, and function.toString() lets us pass javascript functions through Nori. This means it would be trivial to write local tests of your map reduce functions with local data (without having to go through Riak). Look at how closely my implementation matches the sample functions in the fast track.

Overall, the Fast Track was pretty good. I would have liked some coverage of link walking, but at some point you have to cut the "fast track" short. It was short enough to digest in a sitting (though, it did turn a chillaxin' Sunday afternoon into an epic node.js hackfest).

Author: "rick"
Send by mail Print  Save  Delicious 
Date: Monday, 10 May 2010 07:00

Where we're going, we don't need Rhodes

Oh man, @xinuc is breaking my heart over here. The name Rhodes is taken by some javascript mobile phone framework. Now, I need a new name. I'm leaning towards Noh-Varr if no one else has any suggestions.

Update: Okay, Jed rocks, the new project's name will be nori.

Author: "rick"
Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader