In 2003, Nick Kew released a new module that complements Apache's mod_proxy and is essential for reverse-proxying. Since then he gets regular questions and requests for help on proxying with Apache. In this article he attempts to give a comprehensive overview of the proxying and mod_proxy_html
This article was originally published at ApacheWeek in January 2004, and moved to ApacheTutor with minor updates in October 2006.
A proxy server is a gateway for users to the Web at large. Users configure the proxy in their browser settings, and all HTTP requests are routed via the proxy. Proxies are typically operated by ISPs and network administrators, and serve several purposes: for example,
A reverse proxy is a gateway for servers, and enables one web server to provide content from another transparently. As with a standard proxy, a reverse proxy may serve to improve performance of the web by caching; this is a simple way to mirror a website. But the most common reason to run a reverse proxy is to enable controlled access from the Web at large to servers behind a firewall.
The proxied server may be a webserver itself, or it may be an application server using a different protocol, or an application server with just rudimentary HTTP that needs to be shielded from the web at large. Since 2004, reverse proxying has been the preferred method of deploying JAVA/Tomcat applications on the Web, replacing the old mod_jk (itself a special-purpose reverse proxy module).
The standard Apache module mod_proxy supports both types of proxy operation. Under Apache 1.x, mod_proxy only supported HTTP/1.0, but from Apache 2.0, it supports HTTP/1.1. This distinction is particularly important in a proxy, because one of the most significant changes between the two protocol versions is that HTTP/1.1 introduces rich new cache control mechanisms.
This article deals with running a reverse proxy with Apache 2. Users of earlier versions of Apache are encouraged to upgrade and take advantage of the altogether richer architecture and improved application support. At the time of writing, the reason most commonly cited for not upgrading is difficulties running PHP on Apache 2. I cannot speak from personal experience, but several well-informed sources tell me the difficulty lies with non-thread-safe code in PHP, and that it works well with Apache 2 if it is built with the non-threaded Prefork MPM.
So far, we have spoken loosely of mod_proxy. However, it's a little more complicated than that. In keeping with Apache's modular architecture, mod_proxy is itself modular, and a typical proxy server will need to enable several modules. Those relevant to proxying and this article include:
Having mentioned the modules, I'm going to ignore caching for the remainder of this article. You may want to add it if you are concerned about the load on your network or origin servers, but the details are outside the scope of this article. I'm also going to ignore all non-HTTP protocols, and load balancing.
With the exception of mod_proxy_html, the above are all included in the core Apache distribution. They can easily be enabled in the Apache build process. For example:
$ ./configure --enable-so --enable-mods-shared="proxy proxy_http proxy_ftp proxy_connect headers" $ make # make install
Of course, you may want other build options too, and you could just as well build the modules as static.
If you are adding proxying to an existing installation, you should use apxs instead:
# apxs -c -i [module-name].c noting that mod_proxy itself is in two source files (mod_proxy.c and proxy_util.c).
This leaves mod_proxy_html, which is not included in the core distribution. mod_proxy_html is a third-party module, and requires a third-party library libxml2. At the time of writing, libxml2 is installed as standard or packaged for most operating systems. If you don't have it, you can download it from xmlsoft.org and install it yourself. For the purposes of this article, we'll assume libxml2 is installed as /usr/lib/libxml2.so, with headers in /usr/include/libxml2/libxml/.
# apxs -c -I/usr/include/libxml2 -i mod_proxy_html.c
Company example.com has a website at www.example.com, which has a public IP address and DNS entry, and can be accessed from anywhere on the Internet.
The company also has a couple of application servers which have private IP addresses and unregistered DNS entries, and are inside the firewall. The application servers are visible within the network - including the webserver, as "internal1.example.com" and "internal2.example.com", But because they have no public DNS entries, anyone looking at internal1.example.com from outside the company network will get a "no such host" error.
A decision is taken to enable Web access to the application servers. But they should not be exposed to the Internet directly, instead they should be integrated with the webserver, so that http://www.example.com/app1/any-path-here is mapped internally to http://internal1.example.com/any-path-here and http://www.example.com/app2/other-path-here is mapped internally to http://internal2.example.com/other-path-here. This is a typical reverse-proxy situation.
As with any modules, the first thing to do is to load them in httpd.conf (this is not necessary if we build them statically into Apache).
LoadModule proxy_module modules/mod_proxy.so LoadModule proxy_http_module modules/mod_proxy_http.so #LoadModule proxy_ftp_module modules/mod_proxy_ftp.so #LoadModule proxy_connect_module modules/mod_proxy_connect.so LoadModule headers_module modules/mod_headers.so LoadModule deflate_module modules/mod_deflate.so LoadFile /usr/lib/libxml2.so LoadModule proxy_html_module modules/mod_proxy_html.so
For windows users this is slightly different: you'll need to load libxml2.dll rather than libxml2.so, and you'll probably need to load iconv.dll and xlib.dll as prerequisites to libxml2 (you can download them from zlatkovic.com, the same site that maintains windows binaries of libxml2). The LoadFile directive is the same.
Of course, you may not need all the modules. Two that are not required in our typical scenario are shown commented out above.
Having loaded the modules, we can now configure the Proxy. But before doing so, we have an important security warning:
Do Not set "ProxyRequests On". Setting ProxyRequests On turns your server into an Open Proxy. There are 'bots scanning the Web for open proxies. When they find you, they'll start using you to route around blocks and filters to access questionable or illegal material. At worst, they might be able to route email spam through your proxy. Your legitimate traffic will be swamped, and you'll find your server getting blocked by things like family filters.
Of course, you may also want to run a forward proxy with appropriate security measures, but that lies outside the scope of this article. The author runs both forward and reverse proxies on the same server (but under different Virtual Hosts).
The fundamental configuration directive to set up a reverse proxy is ProxyPass. We use it to set up proxy rules for each of the application servers:
ProxyPass /app1/ http://internal1.example.com/ ProxyPass /app2/ http://internal2.example.com/
Now as soon as Apache re-reads the configuration (the recommended way to do this is with "apachectl graceful"), proxy requests will work, so http://www.example.com/app1/some-path maps to http://internal1.example.com/some-path as required.
However, this is not the whole story. ProxyPass just sends traffic straight through. So when the application servers generate references to themselves (or to other internal addresses), they will be passed straight through to the outside world, where they won't work.
For example, an HTTP redirection often takes place when a user (or author) forgets a trailing slash in a URL. So the response to a request for http://www.example.com/app1/foo proxies to http://internal.example.com/foo which generates a response:
HTTP/1.1 302 Found Location: http://internal.example.com/foo/ (etc)
But from the outside world, the net effect of this is a "No such host" error. The proxy needs to re-map the Location header to its own address space and return a valid URL
HTTP/1.1 302 Found Location: http://www.example.com/app1/foo/
The command to enable such rewrites in the HTTP Headers is ProxyPassReverse. The Apache documentation suggests the form:
ProxyPassReverse /app1/ http://internal1.example.com/ ProxyPassReverse /app2/ http://internal2.example.com/
However, there is a slightly more complex alternative form that I recommend as more robust:
ProxyPassReverse / ProxyPassReverse /
The reason for recommending this is that a problem arises with some application servers. Suppose for example we have a redirect:
HTTP/1.1 302 Found Location: /some/path/to/file.html
This is a violation of the HTTP protocol and so should never happen: HTTP only permits full URLs in Location headers. However, it is also a source of much confusion, not least because the CGI spec has a similar Location header with different semantics where relative paths are allowed. There are a lot of broken servers out there! In this instance, the first form of ProxyPassReverse will return the incorrect response
HTTP/1.1 302 Found Location: /some/path/to/file.html
which, even allowing for error-correcting browsers, is outside the Proxy's address space and won't work. The second form fixes this to
HTTP/1.1 302 Found Location: /app2/some/path/to/file.html
which is still broken, but will at least work in error-correcting browsers. Most browsers will deal with this.
If your backend server uses cookies, you may also need the ProxyPassReverseCookiePath and ProxyPassReverseCookieDomain directives. These are similar to ProxyPassReverse, but deal with the different form of cookie headers. These require mod_proxy from Apache 2.2 (recommended), or a patched version of 2.0.
As we have seen, ProxyPassReverse remaps URLs in the HTTP headers to ensure they work from outside the company network. There is, however, a separate problem when links appear in HTML pages served. Consider the following cases:
The same problem of course applies to included content such as images, stylesheets, scripts or applets, and other contexts where URLs occur in HTML.
To fix this requires us to parse the HTML and rewrite the links. This is the purpose of mod_proxy_html. It works as an output filter, parsing the HTML and rewriting links as it is served. Two configuration directives are required to set it up:
mod_proxy_html is based on a SAX parser: specifically the HTMLparser module from libxml2 running in SAX mode (any other parse mode would of course be very much slower, especially for larger documents). It has full knowledge of all URI attributes that can occur in HTML 4 and XHTML 1. Whenever a URL is encountered, it is matched against applicable ProxyHTMLURLMap directives. If it starts with any from-pattern, that will be rewritten to the to-pattern. Rules are applied in the reverse order to their appearance in httpd.conf, and matching stops as soon as a match is found.
Here's how we set up a reverse proxy for HTML. Firstly, full links to the internal servers should be rewritten regardless of where they arise, so we have:
ProxyHTMLURLMap http://internal1.example.com /app1 ProxyHTMLURLMap http://internal2.example.com /app2
Note that in this instance we omitted the "trailing" slash. Since the matching logic is starts-with, we use the minimal matching pattern. We have now globally fixed case 3 above.
Case 2 above requires a little more care. Because the link doesn't include the hostname, the rewrite rule must be context-sensitive. As with ProxyPassReverse above, we deal with that using
ProxyHTMLURLMap / /app1/ ProxyHTMLURLMap / /app2/
The above is a simple case taken from mod_proxy_html version 1. With the more complex URLmapping and rewriting enabled by Version 2, you may need a bit of help setting up a complex ruleset, perhaps involving a series of complex regexps, chained anc blocking rules, etc. To help with setting up and troubleshooting your rulesets, mod_proxy_html 2 provides a "debug" mode, in which all the 'interesting' things it does are written to the Apache error log. To analyse and fix your rulesets, set
ProxyHTMLLogVerbose On LogLevel Info (or LogLevel Debug)
Now run your testcases through your rulesets, and examine the apache error log for details of exactly how it was processed.
Do not leave ProxyHTMLLogVerbose On for normal use. Although the effect is marginal, it is an overhead.
The previous section sets up remapping of HTML URLs, but leaves any URL encountered in a Stylesheet or Script untouched. mod_proxy_html doesn't parse Javascript or CSS, so dealing with URLs in them requires text-based search-and-replace. This is enabled by the directive ProxyHTMLExtended On.
Because the extended mode is text-based, it can no longer guarantee to match exact URLs. It's up to you to devise matching rules that can pick out URLs, just as if you were writing an old-fashioned Perl or PHP regexp-based filter (though of course it's still massively more efficient than performing search-and-replace on an entire document in-memory). To help with this, ProxyHTMLExtended supports both simple text-based and regular expression search-and-replace, according to the flags. You can also use the flags to specify rules separately for HTML links, scripting events, and embedded scripts and stylesheets.
A second key consideration with extended URL mapping is that whereas an HTML link contains exactly one URL, a script or stylesheet may contain many. So instead of stopping after a successful match, the processor will apply all applicable mapping rules. This can be stopped with the L (last) flag.
We just set up a proxy to parse and where necessary correct HTML. But of course, the web isn't just HTML. Surely feeding non-HTML content through an HTML parser is at best inefficient, if not totally broken?
Yes indeed. mod_proxy_html deals with that by checking the Content-Type header, and removing itself from the processing chain when a document is not HTML (text/html) or XHTML (application/xhtml+xml). This happens in the filter initialisation phase, before any data are processed by the filter.
But that still leaves a problem. Consider compressed HTML:
Content-Type: text/html Content-Encoding: gzip
Feeding that into an HTML parser is clearly broken!
There are two solutions to this. One is to uncompress the incoming data with mod_deflate. Uncompressing and compressing content radically reduces network traffic, but increases the processor load on the proxy. It is worthwhile if and only if bandwidth between the proxy and the backend is at a premium: this is common on the 'net at large, but unlikely to be the case on a company internal network.
SetOutputFilter INFLATE;proxy-html;DEFLATE
The alternative solution is to refuse to support compression. Stripping any Accept-Encoding request header does the job. So invoking mod_headers, we add a directive
RequestHeader unset Accept-Encoding
This should only apply to the Proxy, so we put it inside our containers.
A similar situation arises in the case of encrypted (https) content. But in this case, there is no such workaround: if we could decrypt the data to process it then so could any other man-in-the-middle, and the security would be worthless. This can only be circumvented by installing mod_ssl and a certificate on the proxy, so that the actual secure session is between the browser and the proxy, not the origin server.
We are now in a position to write a complete configuration for our reverse proxy. Here is a bare minimum, that ignores extended urlmapping:
LoadModule proxy_module modules/mod_proxy.so LoadModule proxy_http_module modules/mod_proxy_http.so LoadModule headers_module modules/mod_headers.so LoadFile /usr/lib/libxml2.so LoadModule proxy_html_module modules/mod_proxy_html.so ProxyRequests off ProxyPass /app1/ http://internal1.example.com/ ProxyPass /app2/ http://internal2.example.com/ ProxyHTMLURLMap http://internal1.example.com /app1 ProxyHTMLURLMap http://internal2.example.com /app2 ProxyPassReverse / SetOutputFilter proxy-html ProxyHTMLURLMap / /app1/ ProxyHTMLURLMap /app1 /app1 RequestHeader unset Accept-Encoding ProxyPassReverse / SetOutputFilter proxy-html ProxyHTMLURLMap / /app2/ ProxyHTMLURLMap /app2 /app2 RequestHeader unset Accept-Encoding
Of course, there's more than one way to do it. Our configuration would actually have been simpler if we'd used Virtual Hosts for each application server. But that takes you beyond the realm of Apache configuration and into DNS. If you don't fully understand that (or if you think "why can't I see my domain" is a webserver question), then please don't try using virtual hosts for this.
We haven't dealt with caching in this article. In a company-intranet situation, the connection from the proxy to the application servers is the local LAN, which is probably fast and has ample capacity. In such cases, caching at the proxy will have little effect, and can probably be omitted.
If we want to cache pages, we can of course do so with mod_cache But that is beyond the scope of this article.
Another powerful use for a proxy is to transform the content on-the-fly according to the user's preferences. This author's flagship mod_accessibility product (from which mod_proxy_html is a spinoff) serves to transform HTML and XHTML on-demand to enhance usability and accessibility.
A reverse proxy is not the natural place for a "family filter", but is ideal for defining access controls and imposing security restrictions. We could, for example, configure the proxy to recognise a custom header from an origin server and block content based on it. This delegates control to the application servers.
(A) It doesn't really, but it may appear to. Here are the possible causes:
Changing the FPI (the </code> line) may affect some browsers. FIX: set the doctype explicitly if this bothers you.
mod_proxy_html has the side-effect of transforming content to utf-8 (Unicode) encoding. This should not be a problem: utf-8 is well-supported by browsers, and offers comprehensive support for internationalisation. If it appears to cause a problem, that's almost certainly a bug in the application server, or possibly a misconfigured browser. FIX: filter through mod_charset_lite to your chosen charset.
mod_proxy_html will perform some minor normalisations. If your HTML includes elements that are closed implicitly, it will explicitly close them. In other words:
Hello, World!
will be transformed to
Hello, World!
If this affects the rendition in your browser, it almost certainly means you are using malformed HTML and relying on error-correction in a browser. FIX: validate your HTML! The online Page Valet service will both validate and show your markup normalised by the DTD, while a companion tool AccessValet will show markup normalised by the same parser used in the proxy, and highlight other problems. Both are available at http://valet.webthing.com/