Reverse proxy roundup
The first line of defense in scaling out a web solution is a load balancer, often implemented as a HTTP reverse proxy. In my particular situation, I wanted to be able to load balance based on HTTP headers (Host is a must, URL is a bonus) on a FreeBSD platform.
- Pros:
- We were already running Apache, so it was an obvious initial solution to the problem. Not too bad to configure, well documented, and pretty stable. It served us well for quite some time.
- Cons:
- The prefork MPM is definitely not ideal for this purpose and doesn't give us the scalability we want. Screwing around with other MPMs was not ideal because Apache is also doing other things that we didn't want to disturb. Maintaining a separately configured installation just for proxying isn't a good solution. The thread-using MPMs would only be a small win, anyway.
- Pros:
- Easy enough to configure, supports all of the features we need, pretty good performance and scalability. Single process event-driven model that can use kqueue.
- Cons:
- Spotty documentation. Totally bizarre configuration language. Not terribly mature compared to other solutions. Lots of bugs. The current version (1.4.11) leaks memory like a sieve (#758), so it's unfortunately not an option at this point.
- Pros:
- Solid documentation (but only in man page format). All of the features we need, not a whole lot of other junk. Good track record. This is what we're currently using for now, but it's no permanent solution. Better than Apache, and it doesn't seem to leak RAM.
- Cons:
- Uses LOTS of threads. That means lots of RAM, limited scalability and performance. Documentation is only available as a man page, had to install it to really evaluate it.
The other solutions I looked into, but didn't end up trying were
- Squid:
- Way too complicated. Too hard to figure out how to get it to do what we wanted to. Might be a good solution though, so I'll probably look into it again.
- Perlbal:
- No usable docs. FreeBSD port exists, but sucks (no rc file, no sample config). Couldn't figure out how to get it to reverse proxy with the features that I needed. Written in Perl, which I do grok, but don't like hacking on.
- HAProxy:
- Looks awesome, but doesn't have the features we need. Seems to only be able to do IP based load balancing, not by headers. More than just a TCP proxy, but not enough of a HTTP proxy.
- Balance:
- Just a TCP proxy.
- Pen:
- Just a TCP proxy.
Apache has some other benefits that you did not mention. All of the Apache modules out there are available to you. Specifically, mod_security is a great way to do a conceptually simple but flexible layer-7 firewall. That will block bad URLs, GETs, POSTs, SQL injection, XSS, and other nastiness. (Also look at fail2ban as a supplement.)
A mature and well-understood reverse-proxy + HTTP firewall combination seems like a decent trade-off to me. For those features, you might justify a separate chroot Apache installation (mod_security supports this as a configuration option) — that’s no different from a separate Pound installation, right? In fact, a reverse proxy and firewall _should_ be standalone for security and availability reasons. (At least with Apache, you get the large mindshare, maintainability, and documentation). Is prefork the fastest thing out there? No. But since it is standalone now, you can experiment with other MPMs. Upgrading the hardware might be an option too if you agree that a dedicated chrooted mod_proxy + mod_security + fail2ban combination is worth the cost.
Comment by Jason Smith — 2006-08-04 @ 10:07 pm
mod_rewrite is also worth mentioning. It is great for a front-end proxy when you need to start integrating several components together.
Comment by Jason Smith — 2006-08-04 @ 10:09 pm
Apache 2.0 and 2.2 especially are good as a caching proxy. By using its caching features you should get pretty good performance. Depending on your app of course.
2.0 caching is buggier and less feature complete than 2.2.
It didn’t look like pound could do caching.
Maybe with caching apache could work ok for you.
Comment by Rene Dudfield — 2006-08-05 @ 12:16 am
lighty 1.5.0 currently gets a final polish and has a completly new mod_proxy_core which will integrate the features from the different backend plugins and will support HTTP, FastCGI, SCGI and CGI and provide load-balancing, fail-over and keep-alive on top of them.
http://blog.lighttpd.net/articles/2006/07/15/the-new-mod_proxy_core or the new articles document the process.
Comment by Jan Kneschke — 2006-08-05 @ 12:22 am
It’s great that lighty 1.5.0 is going to be out soon, but honestly I don’t give a shit about features unless it no longer leaks memory. Lighty sounds nice in theory, but I can’t run broken code in production.
Comment by bob — 2006-08-05 @ 3:10 am
I think you’d want to use Squid really… you’ll also get cacheing for free - it even supports cacheing of parts of the HTML… there was a press release once that Zope and Squid supported that… don’t know how that works though.
And I don’t know, but doesn’t seem to be that hard to configure… I’ve done it :)
Comment by Damjan — 2006-08-05 @ 8:10 am
Caching doesn’t apply here. I only want reverse proxying, nothing else.
Comment by bob — 2006-08-05 @ 1:44 pm
Hi Bob,
have you considered PLB (Pure Load Balancer): http://plb.sunsite.dk/index.html
“It uses an asynchronous non-forking/non-blocking model, and provides failover abilities. When a backend server goes down, it automatically removes it from the server pool, and tries to bring it back to life later.”
There is also python director: http://pythondirector.sourceforge.net/
“async i/o based, so much less overhead than fork/thread based balancers. Can use either twisted or python’s standard asyncore library (twisted is recommended, and asyncore support will be removed in a future version).”
Dunno if they can do HTTP headers based balance, PLB’s doc seems pretty scarce while it seems you can easily write (in python) a custom balancing algorithm for python director.
Let us know how it goes.
Comment by michele — 2006-08-05 @ 6:49 pm
I hadn’t heard of either PLB or Python Director. PLB is basically just a TCP load balancer, and so is Python Director. Neither of them know anything about HTTP.
For some reason PLB reads in the HTTP headers in full before dispatching to a server (from what I understand by glancing at the source), but it doesn’t act based on those headers.
Of course I could hack something to do what I want, but that’s really a last resort… I have other code that needs to be written. A better reverse proxy than Pound is a relatively small win overall, so writing a bunch of code to replace it would be counter-productive (especially considering the maintenance it’d require over time).
Comment by bob — 2006-08-05 @ 8:35 pm
Ops, I hadn’t noticed they are both TCP based although for Python Director is clearly stated… need to sleep more.
Finally I agree that spending time implementing an ad-hoc alternative to Pound is quite pointless.
Thanks.
Comment by michele — 2006-08-06 @ 1:34 am
I am using Squid to proxy incoming requests to separate applications running on the same server: Zope, Apache, and CherryPy.
It’s much easier than you think. In squid.conf, you need to set the following:
http_port 80
redirect_program /path/to/program-i-will-explain-below
change the “http_access deny all” line to “http_access allow all”.
httpd_accel_host virtual
httpd_accel_port 0
httpd_accel_uses_host_header on
Now all access control and redirection will be controlled by the program that redirect_program points to.
It’s pretty simple, actually. Squid will open up 5 (configurable through redirect_children) instances of the program and write data about incoming requests to stdin. There are 4 fields: url, source_ip, ident, and method. All the program needs to do is respond with the real information and Squid will retrieve it on behalf of the client.
For example, if you want http://www.foo.com/ to actually go to http://10.0.0.2:8080/, you write a program that behaves like this:
stdin: http://www.foo.com/ 192.168.1.1 - GET
stdout: http://10.0.0.2:8080/ 192.168.1.1 - GET
As you can see, this gives you an enormous amount of flexibility. You can redirect based on source ip, and you can merge the url space of separate servers. You can have http://www.foo.com/bar go to an Apache instance and http://foo.com/baz go to Zope.
Comment by James Oakley — 2006-08-11 @ 7:45 pm
You should check out Squid more thoroughly. A company I used to work at used Squid as the basis of a pretty large CDN (content delivery network). I’m familiar with some of the more obscure but performance-boosting options and some of the configuration pitfalls. Shoot me a line, or maybe I’ll post an article on it if I ever get my blog back up.
-arg
Comment by Andy Gross — 2006-08-15 @ 10:52 am
It is quite strange that you consider HTTP servers while looking for HTTP proxy. There are few mature proxy servers that can do a great job better than any HTTP server. Beside Squid, I’d recommend to take a look at Delegate (http://www.delegate.org/)
Comment by Stranger — 2006-08-16 @ 7:13 pm
DeleGate was not considered because I’ve never heard of it and it didn’t show up in any of the searches I did.
The majority of proxy servers I found did not suit the requirements.
Comment by bob — 2006-08-17 @ 10:10 am
Bob, I haven’t been able to confirm the Memory Leak in Lighttpd. I’ve been running it for over a month on my Textdrive server which runs FreeBSD 6.0. However, I’m not sure if they’ve installed via ports or otherwise.
Comment by SuperJared — 2006-08-24 @ 1:00 pm
Perhaps it’s reverse proxying that causes the leak? That’s the only thing my Lighttpd installation was doing at the time. Are you sure it’s 1.4.11?
Either way, I wasn’t the first one with the problem, and I’m not particularly interested in touching Lighttpd again after that experience.
Comment by bob — 2006-08-24 @ 3:14 pm
Hi guys,
just found varnish
http://varnish.projects.linpro.no/
and also a blog with som e comments o it:
http://www.mnot.net/blog/2006/08/21/caching_performance
In my opiniosn, varnsih looks qute promising
Comment by Hans — 2006-11-12 @ 9:00 am
This stuff is covered heavily in both “Scalable Internet Architectures” and “Building Scalable Web Sites”.
Comment by Shannon -jj Behrens — 2006-11-16 @ 1:12 am
Unfortunately the state of the art advances faster than the presses, so it’s entirely likely that whatever recommendations are given in literature for specific software choices are irrelevant.
On the other hand, I’m sure these books have great advice with regard to the architecture of scalable sites. However, I don’t think they’re particularly relevant to specific load balancer/proxy choices.
Comment by bob — 2006-11-16 @ 1:54 am
I believe that HTProxy does (now?) support the features (cookie based session affinity, in particular) that you’re looking for. Currently I’m trialling it myself, while I’m nowhere near production (or even development) it seems quite interesting.
Comment by johnf — 2007-11-23 @ 6:48 am
argh!
HAProxy, not HTProxy.
Comment by johnf — 2007-11-23 @ 6:50 am
I would highly suggest pound or lighttpd as a reverse proxy. As of version 2.4e, Pound is extremely fast and stable. Lighttpd did have some problems in the past and most of those have been fixed. Memeory managment has been greatly improved. I have to agree about the documentation, but there are examples like the following to help everyone out:
Pound Reverse Proxy “how “to”
http://calomel.org/pound.html
Light webserver “how to”
http://calomel.org/lighttpd.html
Comment by Calomel — 2007-12-12 @ 8:31 am
Just to add to the list, PHK of FreeBBSD, MD5, and phkmalloc fame has started on a project called “Varnish”:
http://www.varnish-cache.org/
There’s a good video on why they’re doing things the way they are:
http://varnish.projects.linpro.no/wiki/VarnishInTheNews
You can also skim the arch doc (though I also recommend the video if you have time):
http://varnish.projects.linpro.no/wiki/ArchitectNotes
Comment by David Magda — 2008-01-23 @ 7:55 pm